Ruby port of [TinySegmenter.js](http://chasen.org/~taku/software/TinySegmenter/) for tokenizing Japanese text. Ruby 1.9 or higher required. [![Build Status](https://secure.travis-ci.org/6/tiny_segmenter.png?branch=master)](http://travis-ci.org/6/tiny_segmenter) ### Install `gem install tiny_segmenter` or add `tiny_segmenter` to your `Gemfile` ### Usage ```ruby ts = TinySegmenter.new p ts.segment("今晩は!良い天気ですね") # => ["今晩", "は", "!", "良い", "天気", "です", "ね"] ``` Input text should be UTF-8 encoded. ### How it works The Naive Bayes model was trained using the [RWCP corpus](http://research.nii.ac.jp/src/list.html) and optimized using L1-norm regularization (e.g. [this](https://research.microsoft.com/pubs/78900/andrew07scalable.pdf)). The resultant model is quite compact, yet (according to the author) has about a 95% accuracy rate. ### License BSD - see http://chasen.org/~taku/software/TinySegmenter/LICENCE.txt