Efficient Pure Ruby Unicode Normalization (eprun) ================================================= (pronounced e-prune) The Talk -------- Please see the [Internationalization & Unicode Conference 37](http://www.unicodeconference.org/) talk on [Implementing Normalization in Pure Ruby - the Fast and Easy Way](http://www.sw.it.aoyama.ac.jp/2013/pub/RubyNorm/). Directories and Files --------------------- * lib/normalize.rb: The core normalization code. * lib/string_normalize.rm: String#normalize. * lib/generate.rb: Generation script, generates lib/normalize_tables.rb from data/UnicodeData.txt and data/CompositionExclusions.txt. This needs to be run only once when updating to a new Unicode version. * lib/normalize_tables.rb: Data used for normalization, automatically generated by lib/generate.rb. * data/: All three files in this directory are downloaded from the [Unicode Character Database](http://www.unicode.org/Public/UCD/latest/ucd/). They are currently at Unicode version 6.3. They need to be updated for a newer Unicode version (happens about once a year). * test/test_normalize.rb: Tests for lib/string_normalize.rb, using data/NormalizationTest.txt. * benchmark/benchmark.rb: Runs the benchmark with example text files. Automatically checks for existing gems/libraries; if e.g. the unicode_util gem is not available, that part of the benchmark is skipped. This also applies to eprun, which will not be run on Ruby 1.8. * benchmark/Deutsch_.txt, Japanese_.txt, Korean_.txt, Vietnamese_.txt: example texts extracted from random Wikipedia pages (see http://en.wikipedia.org/wiki/Wikipedia:Random). The languages are choosen based on number of characters affected by normalization (Deutsch < Japanese < Vietnamese < Korean). These files have somewhat differing lengths, so the results cannot directly be compared across languages. Adding other files with ending "_.txt" will include them in the benchmark. * benchmark/benchmark_results.rb: Results of benchmark for eprun, unicode_utils, ActiveSupport::Multibyte (version 3.0.0), twitter_cldr, and the unicode gem. Eprun, unicode_utils, and unicode normalizations are run 100 times each, ActiveSupport::Multibyte is run 10 times each, and twitter_cldr is run only 1 time (didn't want to wait any longer). * benchmark/benchmark_results_jruby.txt: Results of benchmark when using jruby (excludes unicode gem), version 1.7.4 (1.9.3p392, 2013-05-16 2390d3b on Java HotSpot(TM) Client VM 1.7.0_07-b10 [Windows 7-x86]). * benchmark/benchmark.pl: Runs the benchmark using Perl, both with xsub (i.e. C) version (run 100 times) and pure Perl version (run 10 times). * benchmark/benchmark_results_pl.txt: Results of Perl benchmarks. TODOs and Ideas --------------- * Publish as a gem, or several gems. * Deal better with encodings other than UTF-8. * Add methods such as String#nfc, String#nfd,... * Add methods for normalization variants. * See [talk](http://www.sw.it.aoyama.ac.jp/2013/pub/RubyNorm/) for more.