README.md in yanbi-ml-0.1.2 vs README.md in yanbi-ml-0.2.0

- old
+ new

@@ -1,8 +1,8 @@ # YANBI-ML -Yet Another Naive Bayes Implementation +Yet Another Naive Bayes Implementation - Bayes and Fisher document classifiers ## Installation Add this line to your application's Gemfile: @@ -32,13 +32,31 @@ classifier.train_raw(:odd, "one three five seven") classifier.classify_raw("one two three") => :odd ``` +## What is a Fisher Classifier? + +An alternative to the standard Bayesian classifier that can also give very accurate results. A Bayesian classifier works by computing a single, document-wide probability for each class that a document might belong to. A Fisher classifer, by contrast, will compute a probability for each individual feature in a document. If the document does not belong to a given class, then you would expect to get a random distribution of probabilities for the features in the document. In fact, the eponymous Fisher showed that you would generally get a *chi squared distribution* of probabilities. If the document does belong to a given class, you would expect the probabilities to be skewed towards higher probabilities, instead of being randomly distributed. A Fisher classifier uses the Fisher statistical method (p-value) to determine the degree to which the features in the document diverge from the expected random probability. + +## I don't care, I just want to use it! + +Fortunately the interface is pretty consistent: + +```ruby +classifier = Yanbi::Fisher.default(:even, :odd) +classifier.train_raw(:even, "two four six eight") +classifier.train_raw(:odd, "one three five seven") + +classifier.classify_raw("one two three") => :odd +``` + +See? Easy. + ## Bags (of words) -A bag of words is a just a Hash of word counts (a multi-set of word frequencies, to ML folk). This makes a useful abstraction because you can use it with more than one kind of classifier, and because the bag provides a natural location for various kinds of pre-processing you might want to do to the words (features) of the text before training with or classifying them. +A bag of words is a just a Hash of word counts (a multi-set of word frequencies, to ML folk). This makes a useful abstraction because you can use it with more than one kind of classifier, and because the bag provides a natural location for various kinds of pre-processing you might want to do to the words (features) of the text before training with or classifying them. Although a single bag can contain as many documents as you want, in practice it's a good idea to treat word bags as corresponding to a single document. A handful of classes are provided: <ul> <li>WordBag - basic, default bag of words</li> @@ -161,10 +179,45 @@ docs.each_doc do |d| d.remove(STOP_WORDS) end ``` +## Feature thresholds + +A method on the classifier is provided to prune infrequently seen features. This is often one of the first things recommended for improving the accuracy of a classifier in real world applications. Note that when you prune features, there's no un-pruning afterwards - so be sure you actually want to do it! + + +```ruby +classifier = Yanbi.default(:even, :odd) + +#...tons of training happens here... + +#we now have thousands of documents. Ignore any words we haven't +#seen at least a dozen times + +classifier.set_significance(12) + +#actually, the 'odd' category is especially noisy, so let's make +#that two dozen for odd items + +classifier.set_significance(24, :odd) +``` + +## Persisting + +After going to all of the trouble of training a classifier on a large corpus, it would be very useful to save it to disk for later use. You can do just that with the appropriately named save and load functions: + +```ruby +classifier.save('testclassifier') + +#...some time later + +newclassifier = Yanbi::Bayes.load('testclassifier') +``` + +Note that an .obj extension is added to saved classifiers by default - no need to explicitly include it. + ## Putting it all together ```ruby classifier = Yanbi.default(:stuff, :otherstuff) @@ -174,14 +227,46 @@ other = Yanbi::Corpus.new other.add_file('biglistofotherstuff.txt', '@@@@') stuff.each_doc {|d| classifier.train(:stuff, d)} otherstuff.each_doc {|d| classifier.train(:otherstuff, d)} + +#...classify all the things.... ``` +A slightly fancier example: + +```ruby + +STOP_WORDS = %w(in the a and at of) + +#classify using stemmed words +classifier = Yanbi::Bayes.new(Yanbi::StemmedWordBag, :stuff, :otherstuff) + +#create our corpora +stuff = Yanbi::Corpus.new(Yanbi::StemmedWordBag) +stuff.add_file('biglistofstuff.txt', '****') + +other = Yanbi::Corpus.new(Yanbi::StemmedWordBag) +other.add_file('biglistofotherstuff.txt', '@@@@') + +#get rid of those nasty stop words +stuff.each_doc {|d| d.remove(STOP_WORDS} +otherstuff.each_doc {|d| d.remove(STOP_WORDS} + +#train away! +stuff.each_doc {|d| classifier.train(:stuff, d)} +otherstuff.each_doc {|d| classifier.train(:otherstuff, d)} + +#get rid of the long tail +classifier.set_significance(50) + +#...classify all the things.... +``` + ## Contributing -Bug reports and pull requests are welcome on GitHub at https://github.com/rdormer/yanbi-ml. +Bug reports, corrections of any tragic mathematical misunderstandings, and pull requests are welcome on GitHub at https://github.com/rdormer/yanbi-ml. ## License The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).