groonga - An open-source fulltext search engine and column store.

7.7. Normalizers

7.7.1. Summary

Groonga has normalizer module that normalizes text. It is used when tokenizing text and storing table key. For example, A and a are processed as the same character after normalization.

Normalizer module can be added as a plugin. You can customize text normalization by registering your normalizer plugins to Groonga.

A normalizer module is attached to a table. A table can have zero or one normalizer module. You can attach a normalizer module to a table by normalizer option in table_create.

Here is an example table_create that uses NormalizerAuto normalizer module:

Execution example:

table_create Dictionary TABLE_HASH_KEY ShortText --normalizer NormalizerAuto
# [[0, 1337566253.89858, 0.000355720520019531], true]

Note

Groonga 2.0.9 or earlier doesn't have --normalizer option in table_create. KEY_NORMALIZE flag was used instead.

You can open an old database by Groonga 2.1.0 or later. An old database means that the database is created by Groonga 2.0.9 or earlier. But you cannot open the opened old database by Groonga 2.0.9 or earlier. Once you open the old database by Groonga 2.1.0 or later, KEY_NORMALIZE flag information in the old database is converted to normalizer information. So Groonga 2.0.9 or earlier cannot find KEY_NORMALIZE flag information in the opened old database.

Keys of a table that has a normalizer module are normalized:

Execution example:

load --table Dictionary
[
{"_key": "Apple"},
{"_key": "black"},
{"_key": "COLOR"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 3]
select Dictionary
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         3
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "_key",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "apple"
#       ],
#       [
#         2,
#         "black"
#       ],
#       [
#         3,
#         "color"
#       ]
#     ]
#   ]
# ]

NormalizerAuto normalizer normalizes a text as a downcased text. For example, "Apple" is normalized to "apple", "black" is normalized to "blank" and "COLOR" is normalized to "color".

If a table is a lexicon for fulltext search, tokenized tokens are normalized. Because tokens are stored as table keys. Table keys are normalized as described above.

7.7.2. Built-in normalizers

Here is a list of built-in normalizers:

  • NormalizerAuto
  • NormalizerNFKC51

7.7.2.1. NormalizerAuto

Normally you should use NormalizerAuto normalizer. NormalizerAuto was the normalizer for Groonga 2.0.9 or earlier. KEY_NORMALIZE flag in table_create on Groonga 2.0.9 or earlier equals to --normalizer NormalizerAuto option in table_create on Groonga 2.1.0 or later.

NormalizerAuto supports all encoding. It uses Unicode NFKC (Normalization Form Compatibility Composition) for UTF-8 encoding text. It uses encoding specific original normalization for other encodings. The results of those original normalization are similar to NFKC.

For example, half-width katakana (such as U+FF76 HALFWIDTH KATAKANA LETTER KA) + half-width katakana voiced sound mark (U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK) is normalized to full-width katakana with voiced sound mark (U+30AC KATAKANA LETTER GA). The former is two chracters but the latter is one character.

Here is an example that uses NormalizerAuto normalizer:

Execution example:

table_create NormalLexicon TABLE_HASH_KEY ShortText --normalizer NormalizerAuto
# [[0, 1337566253.89858, 0.000355720520019531], true]

7.7.2.2. NormalizerNFKC51

NormalizerNFKC51 normalizes texts by Unicode NFKC (Normalization Form Compatibility Composition) for Unicode version 5.1. It supports only UTF-8 encoding.

Normally you don't need to use NormalizerNFKC51 explicitly. You can use NormalizerAuto instead.

Here is an example that uses NormalizerNFKC51 normalizer:

Execution example:

table_create NFKC51Lexicon TABLE_HASH_KEY ShortText --normalizer NormalizerNFKC51
# [[0, 1337566253.89858, 0.000355720520019531], true]

7.7.3. Additional normalizers

Here is a list of additional normalizers provided by groonga-normalizer-mysql plugins:

  • NormalizerMySQLGeneralCI
  • NormalizerMySQLUnicodeCI
  • NormalizerMySQLUnicodeCIExceptKanaCIKanaWithVoicedSoundMark

groonga-normalizer-mysql is a Groonga plugin. It provides MySQL compatible normalizers to Groonga. NormalizerMySQLGeneralCI corresponds to utf8mb4_general_ci.

You need to register normalizers/mysql plugin in advance.

Execution example:

register normalizers/mysql
# [[0, 1337566253.89858, 0.000355720520019531], true]

Here is an example that uses NormalizerMySQLGeneralCI normalizer:

Execution example:

table_create MySQLGeneralLexicon TABLE_HASH_KEY ShortText --normalizer NormalizerMySQLGeneralCI
# [[0, 1337566253.89858, 0.000355720520019531], true]

7.7.4. See also

Table Of Contents

Previous topic

7.6. Column

Next topic

7.8. Tokenizers

This Page