Sha256: 2d9e58764a5878d56a749e3626028fbe0eca3de29d044b3e2abd196fc913f32a
Contents?: true
Size: 1.34 KB
Versions: 1
Compression:
Stored size: 1.34 KB
Contents
= ankusa Ankusa is a text classifier in Ruby that uses Hadoop's HBase for storage. Because it uses HBase as a backend, the training corpus can be many terabytes in size. Ankusa currently uses a Naive Bayes classifier. It ignores common words (a.k.a, stop words) and stems all others. Additionally, it uses Laplacian smoothing in the classification method. == Installation First, install HBase / Hadoop. Make sure the HBase Thrift interface has been started as well. Then: gem install ankusa == Basic Usage require 'rubygems' require 'ankusa' # connect to HBase storage = Ankusa::HBaseStorage.new 'localhost' c = Ankusa::Classifier.new storage # Each of these calls will return a bag-of-words # has with stemmed words as keys and counts as values c.train :spam, "This is some spammy text" c.train :good, "This is not the bad stuff" # This will return the most likely class (as symbol) puts c.classify "This is some spammy text" # This will return Hash with classes as keys and # membership probability as values puts c.classifications "This is some spammy text" # If you have a large corpus, the probabilities will # likely all be 0. In that case, you must use log # likelihood values puts c.log_likelihoods "This is some spammy text" # get a list of all classes puts c.classes # close connection storage.close
Version data entries
1 entries across 1 versions & 1 rubygems
Version | Path |
---|---|
ankusa-0.0.6 | README.rdoc |