README.md in yanbi-ml-0.1.2 vs README.md in yanbi-ml-0.2.0
- old
+ new
@@ -1,8 +1,8 @@
# YANBI-ML
-Yet Another Naive Bayes Implementation
+Yet Another Naive Bayes Implementation - Bayes and Fisher document classifiers
## Installation
Add this line to your application's Gemfile:
@@ -32,13 +32,31 @@
classifier.train_raw(:odd, "one three five seven")
classifier.classify_raw("one two three") => :odd
```
+## What is a Fisher Classifier?
+
+An alternative to the standard Bayesian classifier that can also give very accurate results. A Bayesian classifier works by computing a single, document-wide probability for each class that a document might belong to. A Fisher classifer, by contrast, will compute a probability for each individual feature in a document. If the document does not belong to a given class, then you would expect to get a random distribution of probabilities for the features in the document. In fact, the eponymous Fisher showed that you would generally get a *chi squared distribution* of probabilities. If the document does belong to a given class, you would expect the probabilities to be skewed towards higher probabilities, instead of being randomly distributed. A Fisher classifier uses the Fisher statistical method (p-value) to determine the degree to which the features in the document diverge from the expected random probability.
+
+## I don't care, I just want to use it!
+
+Fortunately the interface is pretty consistent:
+
+```ruby
+classifier = Yanbi::Fisher.default(:even, :odd)
+classifier.train_raw(:even, "two four six eight")
+classifier.train_raw(:odd, "one three five seven")
+
+classifier.classify_raw("one two three") => :odd
+```
+
+See? Easy.
+
## Bags (of words)
-A bag of words is a just a Hash of word counts (a multi-set of word frequencies, to ML folk). This makes a useful abstraction because you can use it with more than one kind of classifier, and because the bag provides a natural location for various kinds of pre-processing you might want to do to the words (features) of the text before training with or classifying them.
+A bag of words is a just a Hash of word counts (a multi-set of word frequencies, to ML folk). This makes a useful abstraction because you can use it with more than one kind of classifier, and because the bag provides a natural location for various kinds of pre-processing you might want to do to the words (features) of the text before training with or classifying them. Although a single bag can contain as many documents as you want, in practice it's a good idea to treat word bags as corresponding to a single document.
A handful of classes are provided:
<ul>
<li>WordBag - basic, default bag of words</li>
@@ -161,10 +179,45 @@
docs.each_doc do |d|
d.remove(STOP_WORDS)
end
```
+## Feature thresholds
+
+A method on the classifier is provided to prune infrequently seen features. This is often one of the first things recommended for improving the accuracy of a classifier in real world applications. Note that when you prune features, there's no un-pruning afterwards - so be sure you actually want to do it!
+
+
+```ruby
+classifier = Yanbi.default(:even, :odd)
+
+#...tons of training happens here...
+
+#we now have thousands of documents. Ignore any words we haven't
+#seen at least a dozen times
+
+classifier.set_significance(12)
+
+#actually, the 'odd' category is especially noisy, so let's make
+#that two dozen for odd items
+
+classifier.set_significance(24, :odd)
+```
+
+## Persisting
+
+After going to all of the trouble of training a classifier on a large corpus, it would be very useful to save it to disk for later use. You can do just that with the appropriately named save and load functions:
+
+```ruby
+classifier.save('testclassifier')
+
+#...some time later
+
+newclassifier = Yanbi::Bayes.load('testclassifier')
+```
+
+Note that an .obj extension is added to saved classifiers by default - no need to explicitly include it.
+
## Putting it all together
```ruby
classifier = Yanbi.default(:stuff, :otherstuff)
@@ -174,14 +227,46 @@
other = Yanbi::Corpus.new
other.add_file('biglistofotherstuff.txt', '@@@@')
stuff.each_doc {|d| classifier.train(:stuff, d)}
otherstuff.each_doc {|d| classifier.train(:otherstuff, d)}
+
+#...classify all the things....
```
+A slightly fancier example:
+
+```ruby
+
+STOP_WORDS = %w(in the a and at of)
+
+#classify using stemmed words
+classifier = Yanbi::Bayes.new(Yanbi::StemmedWordBag, :stuff, :otherstuff)
+
+#create our corpora
+stuff = Yanbi::Corpus.new(Yanbi::StemmedWordBag)
+stuff.add_file('biglistofstuff.txt', '****')
+
+other = Yanbi::Corpus.new(Yanbi::StemmedWordBag)
+other.add_file('biglistofotherstuff.txt', '@@@@')
+
+#get rid of those nasty stop words
+stuff.each_doc {|d| d.remove(STOP_WORDS}
+otherstuff.each_doc {|d| d.remove(STOP_WORDS}
+
+#train away!
+stuff.each_doc {|d| classifier.train(:stuff, d)}
+otherstuff.each_doc {|d| classifier.train(:otherstuff, d)}
+
+#get rid of the long tail
+classifier.set_significance(50)
+
+#...classify all the things....
+```
+
## Contributing
-Bug reports and pull requests are welcome on GitHub at https://github.com/rdormer/yanbi-ml.
+Bug reports, corrections of any tragic mathematical misunderstandings, and pull requests are welcome on GitHub at https://github.com/rdormer/yanbi-ml.
## License
The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).