analects.rb
===========

[![Gem Version](https://badge.fury.io/rb/analects.png)][gem]
[![Build Status](https://secure.travis-ci.org/plexus/analects.png?branch=master)][travis]
[![Dependency Status](https://gemnasium.com/plexus/analects.png)][gemnasium]
[![Code Climate](https://codeclimate.com/github/plexus/analects.png)][codeclimate]

[gem]: https://rubygems.org/gems/analects
[travis]: https://travis-ci.org/plexus/analects
[gemnasium]: https://gemnasium.com/plexus/analects
[codeclimate]: https://codeclimate.com/github/plexus/analects

Public datasets on the Chinese language, accessible from Ruby

## Download the data

With Rake

```ruby
# Rakefile
require 'analects/rake_tasks'

Analects.init_rake_tasks do
  data_dir '/tmp/analects' # defaults to ~/.analects

  task :import_cedict do
    library.cedict.each do |entry|
      # ..
    end
  end
end
```

```sh
rake analects:download:all        # download all sources
rake analects:download:cedict     # download CC-CEDICT
rake analects:download:chise_ids  # download Chise-IDS
rake analects:download:hsk        # download HSK data
rake analects:download:unihan     # download Unihan database
```

Or from Ruby

```ruby
analects = Analects::Library.new(data_dir: '/tmp/analects')
analects.cedict.retrieve
analects.chise_ids.retrieve
```

## Use the data

```ruby
analects = Analects::Library.new(data_dir: '/tmp/analects')
analects.cedict.take(3)
# => [["AA制", "AA制", "A A zhi4", "/to split the bill/to go Dutch/"], ["A咖", "A咖", "A ka1", "/class \"A\"/top grade/"], ["A片", "A片", "A pian4", "/adult movie/pornography/"]]

analects.chise_ids.to_a.sample(3)
# [["U+59BF", "妿", "⿱加女"], ["U-0002441B", "𤐛", "⿰火閙"], ["U+83A1", "莡", "⿱艹足"]]
```

## Other stuff

Analects wraps RMMSeg for easy segmenting of Chinese text

```ruby
Analects::Tokenizer.new.tokenize("为待那个朋友拿哟出来，咿呀噢哎…")
# => ["为", "待", "那个", "朋友", "拿", "哟", "出来", "，", "咿", "呀", "噢", "哎", "…"]
```

If you have Chinese text in GB or BIG5 encoding, you can do stuff like this

```ruby
Analects::Encoding.valid_cjk(str)
Analects::Encoding.from_gb(str)   # returns UTF-8
Analects::Encoding.from_big5(str) # returns UTF-8
```

## License

Copyright ⓒ Arne Brasseur 2012-2014

Licensed as GPL-v3