.. -*- rst -*- .. highlightlang:: none .. groonga-command .. database: commands_tokenize ``tokenize`` ============ Summary ------- ``tokenize`` command tokenizes text by the specified tokenizer. It is useful to debug tokenization. Syntax ------ ``tokenize`` command has required parameters and optional parameters. ``tokenizer`` and ``string`` are required parameters. Others are optional:: tokenize tokenizer string [normalizer=null] [flags=NONE] [mode=ADD] [token_filters=NONE] Usage ----- Here is a simple example. .. groonga-command .. include:: ../../example/reference/commands/tokenize/simple_example.log .. tokenize TokenBigram "Fulltext Search" It has only required parameters. ``tokenizer`` is ``TokenBigram`` and ``string`` is ``"Fulltext Search"``. It returns tokens that is generated by tokenizing ``"Fulltext Search"`` with ``TokenBigram`` tokenizer. It doesn't normalize ``"Fulltext Search"``. Parameters ---------- This section describes all parameters. Parameters are categorized. Required parameters ^^^^^^^^^^^^^^^^^^^ There are required parameters, ``tokenizer`` and ``string``. .. _tokenize-tokenizer: ``tokenizer`` """"""""""""" It specifies the tokenizer name. ``tokenize`` command uses the tokenizer that is named ``tokenizer``. See :doc:`/reference/tokenizers` about built-in tokenizers. Here is an example to use built-in ``TokenTrigram`` tokenizer. .. groonga-command .. include:: ../../example/reference/commands/tokenize/tokenizer_token_trigram.log .. tokenize TokenTrigram "Fulltext Search" If you want to use other tokenizers, you need to register additional tokenizer plugin by :doc:`register` command. For example, you can use `KyTea `_ based tokenizer by registering ``tokenizers/kytea``. .. _tokenize-string: ``string`` """""""""" It specifies any string which you want to tokenize. If you want to include spaces in ``string``, you need to quote ``string`` by single quotation (``'``) or double quotation (``"``). Here is an example to use spaces in ``string``. .. groonga-command .. include:: ../../example/reference/commands/tokenize/string_include_spaces.log .. tokenize TokenBigram "Groonga is a fast fulltext earch engine!" Optional parameters ^^^^^^^^^^^^^^^^^^^ There are optional parameters. .. _tokenize-normalizer: ``normalizer`` """""""""""""" It specifies the normalizer name. ``tokenize`` command uses the normalizer that is named ``normalizer``. Normalizer is important for N-gram family tokenizers such as ``TokenBigram``. Normalizer detects character type for each character while normalizing. N-gram family tokenizers use character types while tokenizing. Here is an example that doesn't use normalizer. .. groonga-command .. include:: ../../example/reference/commands/tokenize/normalizer_none.log .. tokenize TokenBigram "Fulltext Search" All alphabets are tokenized by two characters. For example, ``Fu`` is a token. Here is an example that uses normalizer. .. groonga-command .. include:: ../../example/reference/commands/tokenize/normalizer_use.log .. tokenize TokenBigram "Fulltext Search" NormalizerAuto Continuous alphabets are tokenized as one token. For example, ``fulltext`` is a token. If you want to tokenize by two characters with noramlizer, use ``TokenBigramSplitSymbolAlpha``. .. groonga-command .. include:: ../../example/reference/commands/tokenize/normalizer_use_with_split_symbol_alpha.log .. tokenize TokenBigramSplitSymbolAlpha "Fulltext Search" NormalizerAuto All alphabets are tokenized by two characters. And they are normalized to lower case characters. For example, ``fu`` is a token. .. _tokenize-flags: ``flags`` """"""""" It specifies a tokenization customize options. You can specify multiple options separated by "``|``". For example, ``NONE|ENABLE_TOKENIZED_DELIMITER``. Here are available flags. .. list-table:: :header-rows: 1 * - Flag - Description * - ``NONE`` - Just ignored. * - ``ENABLE_TOKENIZED_DELIMITER`` - Enables tokenized delimiter. See :doc:`/reference/tokenizers` about tokenized delimiter details. Here is an example that uses ``ENABLE_TOKENIZED_DELIMITER``. .. groonga-command .. include:: ../../example/reference/commands/tokenize/flags_enable_tokenized_delimiter.log .. tokenize TokenDelimit "Full￾text Sea￾crch" NormalizerAuto ENABLE_TOKENIZED_DELIMITER ``TokenDelimit`` tokenizer is one of tokenized delimiter supported tokenizer. ``ENABLE_TOKENIZED_DELIMITER`` enables tokenized delimiter. Tokenized delimiter is special character that indicates token border. It is ``U+FFFE``. The character is not assigned any character. It means that the character is not appeared in normal string. So the character is good character for this puropose. If ``ENABLE_TOKENIZED_DELIMITER`` is enabled, the target string is treated as already tokenized string. Tokenizer just tokenizes by tokenized delimiter. .. _tokenize-mode: ``mode`` """""""" It specifies a tokenize mode. If the mode is specified ``ADD``, the text is tokenized by the rule that adding a document. If the mode is specified ``GET``, the text is tokenized by the rule that searching a document. If the mode is omitted, the text is tokenized by the ``ADD`` mode. The default mode is ``ADD``. Here is an example to the ``ADD`` mode. .. groonga-command .. include:: ../../example/reference/commands/tokenize/add_mode.log .. tokenize TokenBigram "Fulltext Search" --mode ADD The last alphabet is tokenized by one character. Here is an example to the ``GET`` mode. .. groonga-command .. include:: ../../example/reference/commands/tokenize/get_mode.log .. tokenize TokenBigram "Fulltext Search" --mode GET The last alphabet is tokenized by two characters. .. _tokenize-token-filters: ``token_filters`` """"""""""""""""" It specifies the token filter names. ``tokenize`` command uses the tokenizer that is named ``token_filters``. See :doc:`/reference/token_filters` about token filters. .. _tokenize-return-value: Return value ------------ ``tokenize`` command returns tokenized tokens. Each token has some attributes except token itself. The attributes will be increased in the feature:: [HEADER, tokens] ``HEADER`` See :doc:`/reference/command/output_format` about ``HEADER``. ``tokens`` ``tokens`` is an array of token. Token is an object that has the following attributes. .. list-table:: :header-rows: 1 * - Name - Description * - ``value`` - Token itself. * - ``position`` - The N-th token. See also -------- * :doc:`/reference/tokenizers`