# fastText Ruby [fastText](https://fasttext.cc) - efficient text classification and representation learning - for Ruby [![Build Status](https://github.com/ankane/fastText-ruby/actions/workflows/build.yml/badge.svg)](https://github.com/ankane/fastText-ruby/actions) ## Installation Add this line to your application’s Gemfile: ```ruby gem "fasttext" ``` ## Getting Started fastText has two primary use cases: - [text classification](#text-classification) - [word representations](#word-representations) ## Text Classification Prep your data ```ruby # documents x = [ "text from document one", "text from document two", "text from document three" ] # labels y = ["ham", "ham", "spam"] ``` > Use an array if a document has multiple labels Train a model ```ruby model = FastText::Classifier.new model.fit(x, y) ``` Get predictions ```ruby model.predict(x) ``` Save the model to a file ```ruby model.save_model("model.bin") ``` Load the model from a file ```ruby model = FastText.load_model("model.bin") ``` Evaluate the model ```ruby model.test(x_test, y_test) ``` Get words and labels ```ruby model.words model.labels ``` > Use `include_freq: true` to get their frequency Search for the best hyperparameters ```ruby model.fit(x, y, autotune_set: [x_valid, y_valid]) ``` Compress the model - significantly reduces size but sacrifices a little performance ```ruby model.quantize model.save_model("model.ftz") ``` ## Word Representations Prep your data ```ruby x = [ "text from document one", "text from document two", "text from document three" ] ``` Train a model ```ruby model = FastText::Vectorizer.new model.fit(x) ``` Get nearest neighbors ```ruby model.nearest_neighbors("asparagus") ``` Get analogies ```ruby model.analogies("berlin", "germany", "france") ``` Get a word vector ```ruby model.word_vector("carrot") ``` Get a sentence vector ```ruby model.sentence_vector("sentence text") ``` Get words ```ruby model.words ``` Save the model to a file ```ruby model.save_model("model.bin") ``` Load the model from a file ```ruby model = FastText.load_model("model.bin") ``` Use continuous bag-of-words ```ruby model = FastText::Vectorizer.new(model: "cbow") ``` ## Parameters Text classification ```ruby FastText::Classifier.new( lr: 0.1, # learning rate dim: 100, # size of word vectors ws: 5, # size of the context window epoch: 5, # number of epochs min_count: 1, # minimal number of word occurences min_count_label: 1, # minimal number of label occurences minn: 0, # min length of char ngram maxn: 0, # max length of char ngram neg: 5, # number of negatives sampled word_ngrams: 1, # max length of word ngram loss: "softmax", # loss function {ns, hs, softmax, ova} bucket: 2000000, # number of buckets thread: 3, # number of threads lr_update_rate: 100, # change the rate of updates for the learning rate t: 0.0001, # sampling threshold label_prefix: "__label__", # label prefix verbose: 2, # verbose pretrained_vectors: nil, # pretrained word vectors (.vec file) autotune_metric: "f1", # autotune optimization metric autotune_predictions: 1, # autotune predictions autotune_duration: 300, # autotune search time in seconds autotune_model_size: nil # autotune model size, like 2M ) ``` Word representations ```ruby FastText::Vectorizer.new( model: "skipgram", # unsupervised fasttext model {cbow, skipgram} lr: 0.05, # learning rate dim: 100, # size of word vectors ws: 5, # size of the context window epoch: 5, # number of epochs min_count: 5, # minimal number of word occurences minn: 3, # min length of char ngram maxn: 6, # max length of char ngram neg: 5, # number of negatives sampled word_ngrams: 1, # max length of word ngram loss: "ns", # loss function {ns, hs, softmax, ova} bucket: 2000000, # number of buckets thread: 3, # number of threads lr_update_rate: 100, # change the rate of updates for the learning rate t: 0.0001, # sampling threshold verbose: 2 # verbose ) ``` ## Input Files Input can be read directly from files ```ruby model.fit("train.txt", autotune_set: "valid.txt") model.test("test.txt") ``` Each line should be a document ```txt text from document one text from document two text from document three ``` For text classification, lines should start with a list of labels prefixed with `__label__` ```txt __label__ham text from document one __label__ham text from document two __label__spam text from document three ``` ## Pretrained Models There are a number of [pretrained models](https://fasttext.cc/docs/en/supervised-models.html) you can download ### Language Identification Download one of the [pretrained models](https://fasttext.cc/docs/en/language-identification.html) and load it ```ruby model = FastText.load_model("lid.176.ftz") ``` Get language predictions ```ruby model.predict("bon appétit") ``` ## History View the [changelog](https://github.com/ankane/fastText-ruby/blob/master/CHANGELOG.md) ## Contributing Everyone is encouraged to help improve this project. Here are a few ways you can help: - [Report bugs](https://github.com/ankane/fastText-ruby/issues) - Fix bugs and [submit pull requests](https://github.com/ankane/fastText-ruby/pulls) - Write, clarify, or fix documentation - Suggest or add new features To get started with development: ```sh git clone --recursive https://github.com/ankane/fastText-ruby.git cd fastText-ruby bundle install bundle exec rake compile bundle exec rake test ```