# Bling Fire Ruby [Bling Fire](https://github.com/microsoft/BlingFire) - high speed text tokenization - for Ruby [![Build Status](https://github.com/ankane/blingfire-ruby/actions/workflows/build.yml/badge.svg)](https://github.com/ankane/blingfire-ruby/actions) ## Installation Add this line to your application’s Gemfile: ```ruby gem "blingfire" ``` ## Getting Started Create a model ```ruby model = BlingFire::Model.new ``` Tokenize words ```ruby model.text_to_words(text) ``` Tokenize sentences ```ruby model.text_to_sentences(text) ``` Get offsets for words ```ruby words, start_offsets, end_offsets = model.text_to_words_with_offsets(text) ``` Get offsets for sentences ```ruby sentences, start_offsets, end_offsets = model.text_to_sentences_with_offsets(text) ``` ## Pre-trained Models Bling Fire comes with a default model that follows the tokenization logic of NLTK with a few changes. You can also download other models: - [BERT Base](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_base_tok.bin), [BERT Base Cased](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_base_cased_tok.bin), [BERT Chinese](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_chinese.bin), [BERT Multilingual Cased](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_multi_cased.bin) - [GPT-2](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/gpt2.bin) - [Laser 100k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/laser100k.bin), [Laser 250k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/laser250k.bin), [Laser 500k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/laser500k.bin) - [RoBERTa](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/roberta.bin) - [Syllab](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/syllab.bin) - [URI 100k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/uri100k.bin), [URI 250k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/uri250k.bin), [URI 500k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/uri500k.bin) - [XLM-RoBERTa](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/xlm_roberta_base.bin) - [XLNet](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/xlnet.bin), [XLNet No Norm](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/xlnet_nonorm.bin) - [WBD](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/wbd_chuni.bin) Load a model ```ruby model = BlingFire.load_model("bert_base_tok.bin") ``` Convert text to ids ```ruby model.text_to_ids(text) ``` Get offsets for ids ```ruby ids, start_offsets, end_offsets = model.text_to_ids_with_offsets(text) ``` Disable prefix space ```ruby model = BlingFire.load_model("roberta.bin", prefix: false) ``` ## Ids to Text Load a model ```ruby model = BlingFire.load_model("bert_base_tok.i2w") ``` Convert ids to text ```ruby model.ids_to_text(ids) ``` ## History View the [changelog](https://github.com/ankane/blingfire-ruby/blob/master/CHANGELOG.md) ## Contributing Everyone is encouraged to help improve this project. Here are a few ways you can help: - [Report bugs](https://github.com/ankane/blingfire-ruby/issues) - Fix bugs and [submit pull requests](https://github.com/ankane/blingfire-ruby/pulls) - Write, clarify, or fix documentation - Suggest or add new features To get started with development: ```sh git clone https://github.com/ankane/blingfire-ruby.git cd blingfire-ruby bundle install bundle exec rake vendor:all download:models bundle exec rake test ```