Sha256: 6a67141076017c8791b5d754a8166639a97994f9a28dd2c61b7a2dc9aadf33c9
Contents?: true
Size: 971 Bytes
Versions: 1
Compression:
Stored size: 971 Bytes
Contents
Ruby port of [TinySegmenter.js](http://chasen.org/~taku/software/TinySegmenter/) for tokenizing Japanese text. Ruby 1.9 or higher required. [![Build Status](https://secure.travis-ci.org/6/tiny_segmenter.png?branch=master)](http://travis-ci.org/6/tiny_segmenter) ### Install `gem install tiny_segmenter` or add `tiny_segmenter` to your `Gemfile` ### Usage ```ruby ts = TinySegmenter.new p ts.segment("今晩は!良い天気ですね") # => ["今晩", "は", "!", "良い", "天気", "です", "ね"] ``` Input text should be UTF-8 encoded. ### How it works The Naive Bayes model was trained using the [RWCP corpus](http://research.nii.ac.jp/src/list.html) and optimized using L1-norm regularization (e.g. [this](https://research.microsoft.com/pubs/78900/andrew07scalable.pdf)). The resultant model is quite compact, yet (according to the author) has about a 95% accuracy rate. ### License BSD - see http://chasen.org/~taku/software/TinySegmenter/LICENCE.txt
Version data entries
1 entries across 1 versions & 1 rubygems
Version | Path |
---|---|
tiny_segmenter-0.0.4 | README.md |