README.rdoc in muddyit_fu-0.2.11 vs README.rdoc in muddyit_fu-0.2.12
- old
+ new
@@ -1,24 +1,22 @@
= muddyit_fu
Muddy is an information extraction platform. For further
details see the '{Getting Started with Muddy}[http://blog.muddy.it/2009/11/getting-started-with-muddy]'
-article. This gem provides access to the Muddy platform via it's API :
+article. This gem provides access to the Muddy platform via it's API (see {Muddy Developer Guide}[http://muddy.it/developers/]).
-{Muddy Developer Guide}[http://muddy.it/developers/]
-
== Installation
sudo gem install gemcutter
sudo gem tumble
sudo gem install muddyit_fu
== Authentication and authorisation
Muddy supports OAuth and HTTP Basic auth for authentication and authorisation.
We recommend you use OAuth wherever possible when accessing Muddy. An example
-of using OAuth with the muddy platform is descibed in the
+of using OAuth with the Muddy platform is described in the
{Building with Muddy and OAuth}[http://blog.muddy.it/2010/01/building-with-muddy-and-oauth]
article.
=== Example muddyit.yml for OAuth
@@ -57,23 +55,32 @@
== Storing extraction results in a collection
Muddy allows you to store the entity extraction results so aggregate operations
can be performed over a collection of content (a 'collection' has many analysed 'pages').
-A basic muddy account provides a single 'collection' where extraction results
+A basic Muddy account provides a single 'collection' where extraction results
can be stored.
To store a page against a collection, the collection must first be found :
collection = muddyit.collections.find(:all).first
Once a collection has been found, entity extraction results can be stored in it:
collection.pages.create('http://news.bbc.co.uk/1/hi/uk_politics/8011321.stm', {:minium_confidence => 0.2})
-== Viewing all analysed pages in a collection
+== Working with a collection
+A collection allows aggregate operations to be perfomed on itself and on it's
+members. A collection is identified by it's 'collection token'. This is an
+alphanumeric six character string (e.g. 'a0ret4'). A collection can be found if
+it's token is known :
+
+ collection = muddyit.collections.find('a0ret4')
+
+=== Viewing all analysed pages
+
You can iterate through all the analysed pages in a collection, be aware that
the Muddy API provides the pages as paginated sets, so it may take some time to
page through a complete set of pages in a collection (due to repeated HTTP requests
for each new paginated set of results).
@@ -85,29 +92,46 @@
page.entities.each do |entity|
puts "\t#{entity.uri}"
end
end
-== Working with a collection
+=== Finding a particular page or pages
-A collection allows aggregate operations to be perfomed on itself and on it's
-members. A collection is identified by it's 'collection token'. This is an
-alphanumeric six character string (e.g. 'a0ret4'). A collection can be found if
-it's token is known :
+Each page in a collection is assigned a unique alphanumeric identifier. Whilst
+this can be used to find a given page in a collection, it is possible to search
+for the page using other attributes :
- collection = muddyit.collections.find('a0ret4')
+ page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df')
+ page = collection.pages.find(:all, :uri => 'http://news.bbc.co.uk/1/hi/business/8186840.stm').first
+ page = collection.pages.find(:all, :title => 'BBC NEWS | Business | ITV in 25m Friends Reunited sale').first
-=== View all pages containing 'Gordon Brown'
+=== Rereshing a page's results
-If we want to find all references to the grounded entity for 'Gordon Brown 'then
+A page can be 'refereshed' (the entity extraction is run again) by calling the
+refresh method on a page object :
+
+ page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df')
+ updated_page = page.update
+
+=== Deleting a page from a collection
+
+A page can be removed from a collection by calling the 'destroy' method on a
+page object :
+
+ page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df')
+ page.destroy
+
+=== View all pages containing entity 'Gordon Brown'
+
+If we want to find all pages that reference the grounded entity for 'Gordon Brown' then
it can be searched for using it's DBpedia URI :
require 'muddyit_fu'
muddyit = Muddyit.new('./config.yml')
collection = muddyit.collections.find('a0ret4')
collection.pages.find_by_entity('http://dbpedia.org/resource/Gordon_Brown') do |page|
- puts page.identifier
+ puts "#{page.identifier} - #{page.title}"
end
=== Find related entities for 'Gordon Brown'
To find other entities that occur frequently with 'Gordon Brown' in this
@@ -116,11 +140,11 @@
require 'muddyit_fu'
muddyit = Muddyit.new('./config.yml')
collection = muddyit.collections.find('a0ret4')
puts "Related entity\tOccurance
collection.entities.find_related('http://dbpedia.org/resource/Gordon_Brown').each do |entry|
- puts "#{entry[:enity].uri}\t#{entry[:count]}"
+ puts "#{entry[:entity].uri}\t#{entry[:count]}"
end
=== Find related content for : http://news.bbc.co.uk/1/hi/uk_politics/7878418.stm
To find other content in the collection that shares similar entities with the
@@ -132,9 +156,20 @@
page = collection.pages.find(:all, :uri => 'http://news.bbc.co.uk/1/hi/uk_politics/7878418.stm').first
puts "Page : #{page.title}\n\n"
page.related_content.each do |results|
puts "#{results[:page].title} #{results[:count]}"
end
+
+== Batch processing content and the Muddy queue
+
+The Muddy platform runs a background job queue that allows many requests to be
+made in quick succession (rather than waiting for the full extraction request to
+complete), with analysis of the pages happening asynchronously via the queue
+and being stored in the collection at a later time. This can be useful when trying
+to analyse large content collections. To send a request to the queue use :
+
+ collection = muddyit.collections.find('a0ret4')
+ collection.pages.create('http://news.bbc.co.uk/1/hi/uk_politics/8011321.stm', {:realtime => false})
== Contact
Author: Rob Lee
Email: support [at] muddy.it