Sha256: 6a67141076017c8791b5d754a8166639a97994f9a28dd2c61b7a2dc9aadf33c9

Contents?: true

Size: 971 Bytes

Versions: 1

Compression:

Stored size: 971 Bytes

Contents

Ruby port of [TinySegmenter.js](http://chasen.org/~taku/software/TinySegmenter/) for tokenizing Japanese text. Ruby 1.9 or higher required.

[![Build Status](https://secure.travis-ci.org/6/tiny_segmenter.png?branch=master)](http://travis-ci.org/6/tiny_segmenter)

### Install

`gem install tiny_segmenter` or add `tiny_segmenter` to your `Gemfile`

### Usage

```ruby
ts = TinySegmenter.new
p ts.segment("今晩は!良い天気ですね")
# => ["今晩", "は", "!", "良い", "天気", "です", "ね"]
```

Input text should be UTF-8 encoded.

### How it works

The Naive Bayes model was trained using the [RWCP corpus](http://research.nii.ac.jp/src/list.html) and optimized using L1-norm regularization (e.g. [this](https://research.microsoft.com/pubs/78900/andrew07scalable.pdf)). The resultant model is quite compact, yet (according to the author) has about a 95% accuracy rate.

### License

BSD - see http://chasen.org/~taku/software/TinySegmenter/LICENCE.txt

Version data entries

1 entries across 1 versions & 1 rubygems

Version Path
tiny_segmenter-0.0.4 README.md