= muddyit_fu

Muddy is an information extraction platform.  For further
details see the '{Getting Started with Muddy}[http://blog.muddy.it/2009/11/getting-started-with-muddy]'
article.  This gem provides access to the Muddy platform via it's API (see {Muddy Developer Guide}[http://muddy.it/developers/]).

== Installation

  sudo gem install gemcutter
  sudo gem tumble
  sudo gem install muddyit_fu

== Authentication and authorisation

Muddy supports OAuth and HTTP Basic auth for authentication and authorisation.
We recommend you use OAuth wherever possible when accessing Muddy.  An example
of using OAuth with the Muddy platform is described in the
{Building with Muddy and OAuth}[http://blog.muddy.it/2010/01/building-with-muddy-and-oauth]
article.

=== Example muddyit.yml for OAuth

  ---
  consumer_key: YOUR_CONSUMER_KEY
  consumer_secret: YOUR_CONSUMER_SECRET
  access_token: YOUR_ACCESS_TOKEN
  access_token_secret: YOUR_ACCESS_TOKEN_SECRET

=== Example muddyit.yml for HTTP Basic Auth

  ---
  username: YOUR_USERNAME
  password: YOUR_PASSWORD

== Simplest entity extraction example

This example uses the basic 'extract' method to retrieve a list of entities from
a piece of source text.

  require 'muddyit_fu'
  muddyit =  Muddyit.new('./config.yml')
  page = muddyit.extract(ARGV[0])
  page.entities.each do |entity|
    puts "\t#{entity.term}, #{entity.uri}, #{entity.classification}"
  end

== Working with web pages instead of text

Muddy uses an intelligent extraction method to identify the key text on any given
web page, meaning that the entities extracted are relevant to the article and don't
include spurious results from navigation sidebars or page footers.  To work with a
URL rather than text, just specify a URL instead :

  page = muddyit.extract('http://news.bbc.co.uk/1/hi/northern_ireland/8450854.stm')

== Storing extraction results in a collection

Muddy allows you to store the entity extraction results so aggregate operations
can be performed over a collection of content (a 'collection' has many analysed 'pages').
A basic Muddy account provides a single 'collection' where extraction results
can be stored.

To store a page against a collection, the collection must first be found :

  collection = muddyit.collections.find(:all).first

Once a collection has been found, entity extraction results can be stored in it:

  collection.pages.create('http://news.bbc.co.uk/1/hi/uk_politics/8011321.stm', {:minium_confidence => 0.2})

== Working with a collection

A collection allows aggregate operations to be perfomed on itself and on it's
members.  A collection is identified by it's 'collection token'.  This is an
alphanumeric six character string (e.g. 'a0ret4').  A collection can be found if
it's token is known :

  collection = muddyit.collections.find('a0ret4')

=== Viewing all analysed pages

You can iterate through all the analysed pages in a collection, be aware that
the Muddy API provides the pages as paginated sets, so it may take some time to
page through a complete set of pages in a collection (due to repeated HTTP requests
for each new paginated set of results).

  require 'muddyit_fu'
  muddyit =  Muddyit.new('./config.yml')
  collection = muddyit.collections.find(:all).first
  collection.pages.find(:all) do |page|
    puts page.title
    page.entities.each do |entity|
      puts "\t#{entity.uri}"
    end
  end

=== Finding a particular page or pages

Each page in a collection is assigned a unique alphanumeric identifier.  Whilst
this can be used to find a given page in a collection, it is possible to search
for the page using other attributes :

  page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df')
  page = collection.pages.find(:all, :uri => 'http://news.bbc.co.uk/1/hi/business/8186840.stm').first
  page = collection.pages.find(:all, :title => 'BBC NEWS | Business | ITV in 25m Friends Reunited sale').first

=== Rereshing a page's results

A page can be 'refereshed' (the entity extraction is run again) by calling the
refresh method on a page object :

  page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df')
  updated_page = page.update

=== Deleting a page from a collection

A page can be removed from a collection by calling the 'destroy' method on a
page object :

  page = collection.pages.find('5d0e32b6-fd0b-400a-ac49-dae965a292df')
  page.destroy

=== View all pages containing entity 'Gordon Brown'

If we want to find all pages that reference the grounded entity for 'Gordon Brown' then
it can be searched for using it's DBpedia URI :

  require 'muddyit_fu'
  muddyit = Muddyit.new('./config.yml')
  collection = muddyit.collections.find('a0ret4')
  collection.pages.find_by_entity('http://dbpedia.org/resource/Gordon_Brown') do |page|
    puts "#{page.identifier} - #{page.title}"
  end

=== Find related entities for 'Gordon Brown'

To find other entities that occur frequently with 'Gordon Brown' in this
collection :

  require 'muddyit_fu'
  muddyit = Muddyit.new('./config.yml')
  collection = muddyit.collections.find('a0ret4')
  puts "Related entity\tOccurance
  collection.entities.find_related('http://dbpedia.org/resource/Gordon_Brown').each do |entry|
    puts "#{entry[:entity].uri}\t#{entry[:count]}"
  end

=== Find related content for : http://news.bbc.co.uk/1/hi/uk_politics/7878418.stm

To find other content in the collection that shares similar entities with the
analysed page that has a uri 'http://news.bbc.co.uk/1/hi/uk_politics/7878418.stm' :

  require 'muddyit_fu'
  muddyit = Muddyit.new('./config.yml')
  collection = muddyit.collections.find(:all).first
  page = collection.pages.find(:all, :uri => 'http://news.bbc.co.uk/1/hi/uk_politics/7878418.stm').first
  puts "Page : #{page.title}\n\n"
  page.related_content.each do |results|
    puts "#{results[:page].title} #{results[:count]}"
  end

== Batch processing content and the Muddy queue

The Muddy platform runs a background job queue that allows many requests to be
made in quick succession (rather than waiting for the full extraction request to
complete), with analysis of the pages happening asynchronously via the queue
and being stored in the collection at a later time.  This can be useful when trying
to analyse large content collections.  To send a request to the queue use :

  collection = muddyit.collections.find('a0ret4')
  collection.pages.create('http://news.bbc.co.uk/1/hi/uk_politics/8011321.stm', {:realtime => false})

== Contact

  Author: Rob Lee
  Email: support [at] muddy.it
  Main Repository: http://github.com/rattle/muddyit_fu/tree/master