Sha256: 2d9e58764a5878d56a749e3626028fbe0eca3de29d044b3e2abd196fc913f32a

Contents?: true

Size: 1.34 KB

Versions: 1

Compression:

Stored size: 1.34 KB

Contents

= ankusa

Ankusa is a text classifier in Ruby that uses Hadoop's HBase for storage.  Because it uses HBase as a backend, the training corpus can be many terabytes in size.

Ankusa currently uses a Naive Bayes classifier.  It ignores common words (a.k.a, stop words) and stems all others.  Additionally, it uses Laplacian smoothing in the classification method.

== Installation
First, install HBase / Hadoop.  Make sure the HBase Thrift interface has been started as well.  Then:

  gem install ankusa

== Basic Usage
  require 'rubygems'
  require 'ankusa'

  # connect to HBase 
  storage = Ankusa::HBaseStorage.new 'localhost'
  c = Ankusa::Classifier.new storage

  # Each of these calls will return a bag-of-words
  # has with stemmed words as keys and counts as values
  c.train :spam, "This is some spammy text"
  c.train :good, "This is not the bad stuff"

  # This will return the most likely class (as symbol)
  puts c.classify "This is some spammy text"

  # This will return Hash with classes as keys and 
  # membership probability as values
  puts c.classifications "This is some spammy text"

  # If you have a large corpus, the probabilities will
  # likely all be 0.  In that case, you must use log
  # likelihood values
  puts c.log_likelihoods "This is some spammy text"

  # get a list of all classes
  puts c.classes

  # close connection
  storage.close

Version data entries

1 entries across 1 versions & 1 rubygems

Version Path
ankusa-0.0.6 README.rdoc