# tomoto.rb :tomato: [tomoto](https://github.com/bab2min/tomotopy) - high performance topic modeling - for Ruby [![Build Status](https://github.com/ankane/tomoto-ruby/workflows/build/badge.svg?branch=master)](https://github.com/ankane/tomoto-ruby/actions) ## Installation Add this line to your application’s Gemfile: ```ruby gem "tomoto" ``` ## Getting Started Train a model ```ruby model = Tomoto::LDA.new(k: 2) model.add_doc("text from document one") model.add_doc("text from document two") model.add_doc("text from document three") model.train(100) # iterations ``` Get the summary ```ruby model.summary ``` Get topic words ```ruby model.topic_words ``` Save the model to a file ```ruby model.save("model.bin") ``` Load the model from a file ```ruby model = Tomoto::LDA.load("model.bin") ``` Get topic probabilities for a document ```ruby doc = model.docs[0] doc.topics ``` Get the number of words for each topic ```ruby model.count_by_topics ``` Get the vocab ```ruby model.vocabs ``` Get the log likelihood per word ```ruby model.ll_per_word ``` Perform inference for unseen documents ```ruby doc = model.make_doc("unseen doc") topic_dist, ll = model.infer(doc) ``` ## Models Supports: - Latent Dirichlet Allocation (`LDA`) - Labeled LDA (`LLDA`) - Partially Labeled LDA (`PLDA`) - Supervised LDA (`SLDA`) - Dirichlet Multinomial Regression (`DMR`) - Generalized Dirichlet Multinomial Regression (`GDMR`) - Hierarchical Dirichlet Process (`HDP`) - Hierarchical LDA (`HLDA`) - Multi Grain LDA (`MGLDA`) - Pachinko Allocation (`PA`) - Hierarchical PA (`HPA`) - Correlated Topic Model (`CT`) - Dynamic Topic Model (`DT`) ## API This library follows the [tomotopy API](https://bab2min.github.io/tomotopy/v0.9.0/en/). There are a few changes to make it more Ruby-like: - The `get_` prefix has been removed from methods (`topic_words` instead of `get_topic_words`) - Methods that return booleans use `?` instead of `is_` (`live_topic?` instead of `is_live_topic`) If a method or option you need isn’t supported, feel free to open an issue. ## Examples - [LDA](examples/lda_basic.rb) - [HDP](examples/hdp_basic.rb) ## Tokenization Documents are tokenized by whitespace by default, or you can perform your own tokenization. ```ruby model.add_doc(["tokens", "from", "document", "one"]) ``` ## Performance tomoto uses AVX2, AVX, or SSE2 instructions to increase performance on machines that support it. Check which instruction set architecture it’s using with: ```ruby Tomoto.isa ``` ## Parallelism Choose a [parallelism algorithm](https://bab2min.github.io/tomotopy/v0.9.0/en/#parallel-sampling-algorithms) with: ```ruby model.train(parallel: :partition) ``` Supported values are `:default`, `:none`, `:copy_merge`, and `:partition`. ## History View the [changelog](https://github.com/ankane/tomoto-ruby/blob/master/CHANGELOG.md) ## Contributing Everyone is encouraged to help improve this project. Here are a few ways you can help: - [Report bugs](https://github.com/ankane/tomoto-ruby/issues) - Fix bugs and [submit pull requests](https://github.com/ankane/tomoto-ruby/pulls) - Write, clarify, or fix documentation - Suggest or add new features To get started with development: ```sh git clone --recursive https://github.com/ankane/tomoto-ruby.git cd tomoto-ruby bundle install bundle exec rake compile bundle exec rake test ```