README.rdoc in muddyit_fu-0.2.11 vs README.rdoc in muddyit_fu-0.2.12

- old
+ new

@@ -1,24 +1,22 @@ = muddyit_fu Muddy is an information extraction platform. For further details see the '{Getting Started with Muddy}[http://blog.muddy.it/2009/11/getting-started-with-muddy]' -article. This gem provides access to the Muddy platform via it's API : +article. This gem provides access to the Muddy platform via it's API (see {Muddy Developer Guide}[http://muddy.it/developers/]). -{Muddy Developer Guide}[http://muddy.it/developers/] - == Installation sudo gem install gemcutter sudo gem tumble sudo gem install muddyit_fu == Authentication and authorisation Muddy supports OAuth and HTTP Basic auth for authentication and authorisation. We recommend you use OAuth wherever possible when accessing Muddy. An example -of using OAuth with the muddy platform is descibed in the +of using OAuth with the Muddy platform is described in the {Building with Muddy and OAuth}[http://blog.muddy.it/2010/01/building-with-muddy-and-oauth] article. === Example muddyit.yml for OAuth @@ -57,23 +55,32 @@ == Storing extraction results in a collection Muddy allows you to store the entity extraction results so aggregate operations can be performed over a collection of content (a 'collection' has many analysed 'pages'). -A basic muddy account provides a single 'collection' where extraction results +A basic Muddy account provides a single 'collection' where extraction results can be stored. To store a page against a collection, the collection must first be found : collection = muddyit.collections.find(:all).first Once a collection has been found, entity extraction results can be stored in it: collection.pages.create('http://news.bbc.co.uk/1/hi/uk_politics/8011321.stm', {:minium_confidence => 0.2}) -== Viewing all analysed pages in a collection +== Working with a collection +A collection allows aggregate operations to be perfomed on itself and on it's +members. A collection is identified by it's 'collection token'. This is an +alphanumeric six character string (e.g. 'a0ret4'). A collection can be found if +it's token is known : + + collection = muddyit.collections.find('a0ret4') + +=== Viewing all analysed pages + You can iterate through all the analysed pages in a collection, be aware that the Muddy API provides the pages as paginated sets, so it may take some time to page through a complete set of pages in a collection (due to repeated HTTP requests for each new paginated set of results). @@ -85,29 +92,46 @@ page.entities.each do |entity| puts "\t#{entity.uri}" end end -== Working with a collection +=== Finding a particular page or pages -A collection allows aggregate operations to be perfomed on itself and on it's -members. A collection is identified by it's 'collection token'. This is an -alphanumeric six character string (e.g. 'a0ret4'). A collection can be found if -it's token is known : +Each page in a collection is assigned a unique alphanumeric identifier. Whilst +this can be used to find a given page in a collection, it is possible to search +for the page using other attributes : - collection = muddyit.collections.find('a0ret4') + page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df') + page = collection.pages.find(:all, :uri => 'http://news.bbc.co.uk/1/hi/business/8186840.stm').first + page = collection.pages.find(:all, :title => 'BBC NEWS | Business | ITV in 25m Friends Reunited sale').first -=== View all pages containing 'Gordon Brown' +=== Rereshing a page's results -If we want to find all references to the grounded entity for 'Gordon Brown 'then +A page can be 'refereshed' (the entity extraction is run again) by calling the +refresh method on a page object : + + page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df') + updated_page = page.update + +=== Deleting a page from a collection + +A page can be removed from a collection by calling the 'destroy' method on a +page object : + + page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df') + page.destroy + +=== View all pages containing entity 'Gordon Brown' + +If we want to find all pages that reference the grounded entity for 'Gordon Brown' then it can be searched for using it's DBpedia URI : require 'muddyit_fu' muddyit = Muddyit.new('./config.yml') collection = muddyit.collections.find('a0ret4') collection.pages.find_by_entity('http://dbpedia.org/resource/Gordon_Brown') do |page| - puts page.identifier + puts "#{page.identifier} - #{page.title}" end === Find related entities for 'Gordon Brown' To find other entities that occur frequently with 'Gordon Brown' in this @@ -116,11 +140,11 @@ require 'muddyit_fu' muddyit = Muddyit.new('./config.yml') collection = muddyit.collections.find('a0ret4') puts "Related entity\tOccurance collection.entities.find_related('http://dbpedia.org/resource/Gordon_Brown').each do |entry| - puts "#{entry[:enity].uri}\t#{entry[:count]}" + puts "#{entry[:entity].uri}\t#{entry[:count]}" end === Find related content for : http://news.bbc.co.uk/1/hi/uk_politics/7878418.stm To find other content in the collection that shares similar entities with the @@ -132,9 +156,20 @@ page = collection.pages.find(:all, :uri => 'http://news.bbc.co.uk/1/hi/uk_politics/7878418.stm').first puts "Page : #{page.title}\n\n" page.related_content.each do |results| puts "#{results[:page].title} #{results[:count]}" end + +== Batch processing content and the Muddy queue + +The Muddy platform runs a background job queue that allows many requests to be +made in quick succession (rather than waiting for the full extraction request to +complete), with analysis of the pages happening asynchronously via the queue +and being stored in the collection at a later time. This can be useful when trying +to analyse large content collections. To send a request to the queue use : + + collection = muddyit.collections.find('a0ret4') + collection.pages.create('http://news.bbc.co.uk/1/hi/uk_politics/8011321.stm', {:realtime => false}) == Contact Author: Rob Lee Email: support [at] muddy.it