# README ## Name groonga-normalizer-mysql ## Description Groonga-normalizer-mysql is a Groonga plugin. It provides MySQL compatible normalizers and a custom normalizers to Groonga. Here are MySQL compatible normalizers: * `NormalizerMySQLGeneralCI` for `utf8mb4_general_ci` * `NormalizerMySQLUnicodeCI` for `utf8mb4_unicode_ci` * `NormalizerMySQLUnicode520CI` for `utf8mb4_unicode_520_ci` * `NormalizerMySQLUnicode900` for `utf8mb4_900_ai_ci`, `utf8mb4_900_as_ci`, `utf8mb4_900_as_cs`, `utf8mb4_ja_900_as_cs` and `utf8mb4_ja_900_as_cs_ks`. Here are custom normalizers: * `NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark` * It's based on `NormalizerMySQLUnicodeCI` * `NormalizerMySQLUnicode520CIExceptKanaCIKanaWithVoicedSoundMark` * It's based on `NormalizerMySQLUnicode520CI` They are self-descriptive name but long. They are variant normalizers of `NormalizerMySQLUnicodeCI` and `NormalizerMySQLUnicode520CI`. They have different behaviors. The followings are the different behaviors. They describes with `NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark` but they are true for `NormalizerMySQLUnicode520CIExceptKanaCIKanaWithVoicedSoundMark`. * `NormalizerMySQLUnicodeCI` normalizes all small Hiragana such as `ぁ`, `っ` to Hiragana such as `あ`, `つ`. `NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark` doesn't normalize `ぁ` to `あ` nor `っ` to `つ`. `ぁ` and `あ` are different characters. `っ` and `つ` are also different characters. This behavior is described by `ExceptKanaCI` in the long name. This following behaviors ared described by `ExceptKanaWithVoicedSoundMark` in the long name. * `NormalizerMySQLUnicode` normalizes all Hiragana with voiced sound mark such as `が` to Hiragana without voiced sound mark such as `か`. `NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark` doesn't normalize `が` to `か`. `が` and `か` are different characters. * `NormalizerMySQLUnicode` normalizes all Hiragana with semi-voiced sound mark such as `ぱ` to Hiragana without semi-voiced sound mark such as `は`. `NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark` doesn't normalize `ぱ` to `は`. `ぱ` and `は` are different characters. * `NormalizerMySQLUnicode` normalizes all Katakana with voiced sound mark such as `ガ` to Katakana without voiced sound mark such as `カ`. `NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark` doesn't normalize `ガ` to `カ`. `ガ` and `カ` are different characters. * `NormalizerMySQLUnicode` normalizes all Katakana with semi-voiced sound mark such as `パ` to Hiragana without semi-voiced sound mark such as `ハ`. `NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark` doesn't normalize `パ` to `ハ`. `パ` and `ハ` are different characters. * `NormalizerMySQLUnicode` normalizes all halfwidth Katakana with voiced sound mark such as `ガ` to halfwidth Katakana without voiced sound mark such as `カ`. `NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark` normalizes all halfwidth Katakana with voided sound mark such as `ガ` to fullwidth Katakana with voiced sound mark such as `ガ`. `NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark` and `NormalizerMySQLUnicode520CIExceptKanaCIKanaWithVoicedSoundMark` and are MySQL incompatible normalizers but they are useful for Japanese text. For example, `ふらつく` and `ブラック` has different means. `NormalizerMySQLUnicodeCI` identifies `ふらつく` with `ブラック` but `NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark` doesn't identify them. ## Install ### Debian GNU/Linux [Add apt-line for the Groonga deb package repository](http://groonga.org/docs/install/debian.html) and install `groonga-normalizer-mysql` package: % sudo apt-get -y install groonga-normalizer-mysql ### Ubuntu [Add apt-line for the Groonga deb package repository](http://groonga.org/docs/install/ubuntu.html) and install `groonga-normalizer-mysql` package: % sudo apt-get -y install groonga-normalizer-mysql ### CentOS Install `groonga-repository` package: % sudo rpm -ivh https://packages.groonga.org/centos/groonga-release-latest.noarch.rpm % sudo yum makecache Then install `groonga-normalizer-mysql` package: % sudo yum install -y groonga-normalizer-mysql ### macOS - Homebrew Install `groonga-normalizer-mysql` package: % brew install groonga-normalizer-mysql ### Windows You need to build from source. Here are build instructions. #### Build system Install the following build tools: * [Microsoft Visual C++](https://visualstudio.microsoft.com/vs/features/cplusplus/) * [CMake](http://www.cmake.org/) #### Build Groonga Download the latest Groonga source from [packages.groonga.org](http://packages.groonga.org/source/groonga/). Source file name is formatted as `groonga-X.Y.Z.zip`. Extract the source and move to the source folder: > cd ...\groonga-X.Y.Z groonga-X.Y.Z> Run CMake. Here is a command line to install Groonga to `C:\groonga` folder: groonga-X.Y.Z> cmake . -G "Visual Studio 14 Win64" -DCMAKE_INSTALL_PREFIX=C:\groonga Build: groonga-X.Y.Z> cmake --build . --config Release Install: groonga-X.Y.Z> cmake --build . --config Release --target Install #### Build groonga-normalizer-mysql Download the latest groonga-normalizer-mysql source from [packages.groonga.org](http://packages.groonga.org/source/groonga-normalizer-mysql/). Source file name is formatted as `groonga-normalizer-X.Y.Z.zip`. Extract the source and move to the source folder: > cd ...\groonga-normalizer-mysql-X.Y.Z groonga-normalizer-mysql-X.Y.Z> IMPORTANT!!!: Set `PKG_CONFIG_PATH` environment variable: groonga-normalizer-mysql-X.Y.Z> set PKG_CONFIG_PATH=C:\groonga\local\lib\pkgconfig Run CMake. Here is a command line to install Groonga to `C:\groonga` folder: groonga-normalizer-mysql-X.Y.Z> cmake . -G "Visual Studio 14 Win64" -DCMAKE_INSTALL_PREFIX=C:\groonga Build: groonga-normalizer-mysql-X.Y.Z> cmake --build . --config Release Install: groonga-normalizer-mysql-X.Y.Z> cmake --build . --config Release --target Install ## Usage First, you need to register `normalizers/mysql` plugin: groonga> register normalizers/mysql Then, you can use `NormalizerMySQLGeneralCI` and `NormalizerMySQLUnicodeCI` as normalizers: groonga> table_create Lexicon TABLE_PAT_KEY --default_tokenizer TokenBigram --normalizer NormalizerMySQLGeneralCI ## Dependencies * Groonga >= 8.0.4 ## Mailing list * English: [groonga-talk@lists.sourceforge.net](https://lists.sourceforge.net/lists/listinfo/groonga-talk) * Japanese: [groonga-dev@lists.sourceforge.jp](http://lists.sourceforge.jp/mailman/listinfo/groonga-dev) ## Thanks * Alexander Barkov \: The author of `MYSQL_SOURCE/strings/ctype-utf8.c`. * ... ## Authors * Kouhei Sutou \ ## License LGPLv2 only. See doc/text/lgpl-2.0.txt for details. This program uses normalization table defined in MySQL source code. So this program is derived work of `MYSQL_SOURCE/strings/ctype-utf8.c`, `MYSQL_SOURCE/strings/uca900_data.h`, `MYSQL_SOURCE/strings/uca900_ja_data.h`. This program is the same license as them and they are licensed under LGPLv2 only.