README.md in yanbi-ml-0.2.4 vs README.md in yanbi-ml-0.3.0

- old
+ new

@@ -179,9 +179,36 @@ docs.each_doc do |d| d.remove(STOP_WORDS) end ``` +There's also a single, global bag of words that contains all of the words seen in every document in the corpus. This is accessed (surprisingly) through the 'all' attribute. + +```ruby +# Non unique global list of words +docs.all.words + +# Unique global list of words +docs.all.words.uniq +``` + +Note that this global word bag is updated whenever you remove words through by iterating through documents with each_doc. + + +## Dictionaries + +Speaking of a global list of words, the corpus class also allows you to capture a snapshot of the unique list of words in a set of documents as a dictionary object. This object can then be used to encode strings as integer arrays of indices: + +```ruby +my_dictionary = docs.to_index + +# Get an integer mapping of the words in this string +indices = my_dictionary.to_idx('the quick brown fox') +``` + +Words not present in the dictionary will be returned as nils. This is useful for working with other types of classifiers that might not be capable of accepting straight text. + + ## Feature thresholds A method on the classifier is provided to prune infrequently seen features. This is often one of the first things recommended for improving the accuracy of a classifier in real world applications. Note that when you prune features, there's no un-pruning afterwards - so be sure you actually want to do it!