Efficient Pure Ruby Unicode Normalization (eprun)
=================================================

(pronounced e-prune)

The Talk
--------

Please see the
[Internationalization & Unicode Conference 37](http://www.unicodeconference.org/)
talk on
[Implementing Normalization in Pure Ruby - the Fast and Easy Way](http://www.sw.it.aoyama.ac.jp/2013/pub/RubyNorm/).

Directories and Files
---------------------

*   lib/normalize.rb: The core normalization code.
*   lib/string_normalize.rm: String#normalize.
*   lib/generate.rb: Generation script, generates lib/normalize_tables.rb
    from data/UnicodeData.txt and data/CompositionExclusions.txt.
    This needs to be run only once when updating to a new Unicode version.
*   lib/normalize_tables.rb: Data used for normalization,
    automatically generated by lib/generate.rb.
*   data/: All three files in this directory are downloaded from the
    [Unicode Character Database](http://www.unicode.org/Public/UCD/latest/ucd/).
    They are currently at Unicode version 6.3. They need to be updated for
    a newer Unicode version (happens about once a year).
*   test/test_normalize.rb: Tests for lib/string_normalize.rb,
    using data/NormalizationTest.txt.
*   benchmark/benchmark.rb: Runs the benchmark with example text files.
    Automatically checks for existing gems/libraries; if e.g. the unicode_util
    gem is not available, that part of the benchmark is skipped.
    This also applies to eprun, which will not be run on Ruby 1.8.
*   benchmark/Deutsch_.txt, Japanese_.txt, Korean_.txt, Vietnamese_.txt:
    example texts extracted from random Wikipedia pages
    (see http://en.wikipedia.org/wiki/Wikipedia:Random).
    The languages are choosen based on number of characters affected
    by normalization (Deutsch < Japanese < Vietnamese < Korean).
    These files have somewhat differing lengths,
    so the results cannot directly be compared across languages.
    Adding other files with ending "_.txt" will include them in
    the benchmark.
*   benchmark/benchmark_results.rb:
    Results of benchmark for eprun, unicode_utils,
    ActiveSupport::Multibyte (version 3.0.0), twitter_cldr, and the unicode gem.
    Eprun, unicode_utils, and unicode normalizations are run 100 times each,
    ActiveSupport::Multibyte is run 10 times each, and
    twitter_cldr is run only 1 time (didn't want to wait any longer).
*   benchmark/benchmark_results_jruby.txt:
    Results of benchmark when using jruby (excludes unicode gem),
    version 1.7.4 (1.9.3p392, 2013-05-16 2390d3b on Java HotSpot(TM) Client VM 1.7.0_07-b10 [Windows 7-x86]).
*   benchmark/benchmark.pl: Runs the benchmark using Perl, both with
    xsub (i.e. C) version (run 100 times) and pure Perl version
    (run 10 times).
*   benchmark/benchmark_results_pl.txt: Results of Perl benchmarks.

TODOs and Ideas
---------------
*   Publish as a gem, or several gems.
*   Deal better with encodings other than UTF-8.
*   Add methods such as String#nfc, String#nfd,...
*   Add methods for normalization variants.
*   See [talk](http://www.sw.it.aoyama.ac.jp/2013/pub/RubyNorm/) for more.