7.8. Tokenizers¶
7.8.1. Summary¶
Groonga has tokenizer module that tokenizes text. It is used when the following cases:
Tokenizer is an important module for full-text search. You can change trade-off between precision and recall by changing tokenizer.
Normally, TokenBigram is a suitable tokenizer. If you don't know much about tokenizer, it's recommended that you choose TokenBigram.
You can try a tokenizer by tokenize and table_tokenize. Here is an example to try TokenBigram tokenizer by tokenize:
Execution example:
tokenize TokenBigram "Hello World"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "He"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "el"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ll"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "lo"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "o "
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": " W"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "Wo"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "or"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "rl"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "ld"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "d"
# }
# ]
# ]
7.8.2. What is "tokenize"?¶
"tokenize" is the process that extracts zero or more tokens from a text. There are some "tokenize" methods.
For example, Hello World
is tokenized to the following tokens by
bigram tokenize method:
He
el
ll
lo
o_
(_
means a white-space)_W
(_
means a white-space)Wo
or
rl
ld
In the above example, 10 tokens are extracted from one text Hello
World
.
For example, Hello World
is tokenized to the following tokens by
white-space-separate tokenize method:
Hello
World
In the above example, 2 tokens are extracted from one text Hello
World
.
Token is used as search key. You can find indexed documents only by
tokens that are extracted by used tokenize method. For example, you
can find Hello World
by ll
with bigram tokenize method but you
can't find Hello World
by ll
with white-space-separate tokenize
method. Because white-space-separate tokenize method doesn't extract
ll
token. It just extracts Hello
and World
tokens.
In general, tokenize method that generates small tokens increases recall but decreases precision. Tokenize method that generates large tokens increases precision but decreases recall.
For example, we can find Hello World
and A or B
by or
with
bigram tokenize method. Hello World
is a noise for people who
wants to search "logical and". It means that precision is
decreased. But recall is increased.
We can find only A or B
by or
with white-space-separate
tokenize method. Because World
is tokenized to one token World
with white-space-separate tokenize method. It means that precision is
increased for people who wants to search "logical and". But recall is
decreased because Hello World
that contains or
isn't found.
7.8.3. Built-in tokenizsers¶
Here is a list of built-in tokenizers:
TokenBigram
TokenBigramSplitSymbol
TokenBigramSplitSymbolAlpha
TokenBigramSplitSymbolAlphaDigit
TokenBigramIgnoreBlank
TokenBigramIgnoreBlankSplitSymbol
TokenBigramIgnoreBlankSplitAlpha
TokenBigramIgnoreBlankSplitAlphaDigit
TokenUnigram
TokenTrigram
TokenDelimit
TokenDelimitNull
TokenMecab
TokenRegexp
7.8.3.1. TokenBigram
¶
TokenBigram
is a bigram based tokenizer. It's recommended to use
this tokenizer for most cases.
Bigram tokenize method tokenizes a text to two adjacent characters
tokens. For example, Hello
is tokenized to the following tokens:
He
el
ll
lo
Bigram tokenize method is good for recall because you can find all texts by query consists of two or more characters.
In general, you can't find all texts by query consists of one
character because one character token doesn't exist. But you can find
all texts by query consists of one character in Groonga. Because
Groonga find tokens that start with query by predictive search. For
example, Groonga can find ll
and lo
tokens by l
query.
Bigram tokenize method isn't good for precision because you can find
texts that includes query in word. For example, you can find world
by or
. This is more sensitive for ASCII only languages rather than
non-ASCII languages. TokenBigram
has solution for this problem
described in the bellow.
TokenBigram
behavior is different when it's worked with any
Normalizers.
If no normalizer is used, TokenBigram
uses pure bigram (all tokens
except the last token have two characters) tokenize method:
Execution example:
tokenize TokenBigram "Hello World"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "He"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "el"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ll"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "lo"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "o "
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": " W"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "Wo"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "or"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "rl"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "ld"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "d"
# }
# ]
# ]
If normalizer is used, TokenBigram
uses white-space-separate like
tokenize method for ASCII characters. TokenBigram
uses bigram
tokenize method for non-ASCII characters.
You may be confused with this combined behavior. But it's reasonable for most use cases such as English text (only ASCII characters) and Japanese text (ASCII and non-ASCII characters are mixed).
Most languages consists of only ASCII characters use white-space for word separator. White-space-separate tokenize method is suitable for the case.
Languages consists of non-ASCII characters don't use white-space for word separator. Bigram tokenize method is suitable for the case.
Mixed tokenize method is suitable for mixed language case.
If you want to use bigram tokenize method for ASCII character, see
TokenBigramSplitXXX
type tokenizers such as
TokenBigramSplitSymbolAlpha.
Let's confirm TokenBigram
behavior by example.
TokenBigram
uses one or more white-spaces as token delimiter for
ASCII characters:
Execution example:
tokenize TokenBigram "Hello World" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "hello"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "world"
# }
# ]
# ]
TokenBigram
uses character type change as token delimiter for
ASCII characters. Character type is one of them:
- Alphabet
- Digit
- Symbol (such as
(
,)
and!
)- Hiragana
- Katakana
- Kanji
- Others
The following example shows two token delimiters:
- at between
100
(digits) andcents
(alphabets)- at between
cents
(alphabets) and!!!
(symbols)
Execution example:
tokenize TokenBigram "100cents!!!" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "100"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "cents"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "!!!"
# }
# ]
# ]
Here is an example that TokenBigram
uses bigram tokenize method
for non-ASCII characters.
Execution example:
tokenize TokenBigram "日本語の勉強" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "日本"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "本語"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "語の"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "の勉"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "勉強"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "強"
# }
# ]
# ]
7.8.3.2. TokenBigramSplitSymbol
¶
TokenBigramSplitSymbol
is similar to TokenBigram. The
difference between them is symbol handling. TokenBigramSplitSymbol
tokenizes symbols by bigram tokenize method:
Execution example:
tokenize TokenBigramSplitSymbol "100cents!!!" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "100"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "cents"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
7.8.3.3. TokenBigramSplitSymbolAlpha
¶
TokenBigramSplitSymbolAlpha
is similar to TokenBigram. The
difference between them is symbol and alphabet
handling. TokenBigramSplitSymbolAlpha
tokenizes symbols and
alphabets by bigram tokenize method:
Execution example:
tokenize TokenBigramSplitSymbolAlpha "100cents!!!" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "100"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "ce"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "en"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "nt"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "ts"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "s!"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
7.8.3.4. TokenBigramSplitSymbolAlphaDigit
¶
TokenBigramSplitSymbolAlphaDigit
is similar to
TokenBigram. The difference between them is symbol, alphabet
and digit handling. TokenBigramSplitSymbolAlphaDigit
tokenizes
symbols, alphabets and digits by bigram tokenize method. It means that
all characters are tokenized by bigram tokenize method:
Execution example:
tokenize TokenBigramSplitSymbolAlphaDigit "100cents!!!" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "10"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "00"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "0c"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "ce"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "en"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "nt"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "ts"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "s!"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
7.8.3.5. TokenBigramIgnoreBlank
¶
TokenBigramIgnoreBlank
is similar to TokenBigram. The
difference between them is blank handling. TokenBigramIgnoreBlank
ignores white-spaces in continuous symbols and non-ASCII characters.
You can find difference of them by 日 本 語 ! ! !
text because it
has symbols and non-ASCII characters.
Here is a result by TokenBigram :
Execution example:
tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "日"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "本"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "語"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
Here is a result by TokenBigramIgnoreBlank
:
Execution example:
tokenize TokenBigramIgnoreBlank "日 本 語 ! ! !" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "日本"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "本語"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "語"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "!!!"
# }
# ]
# ]
7.8.3.6. TokenBigramIgnoreBlankSplitSymbol
¶
TokenBigramIgnoreBlankSplitSymbol
is similar to
TokenBigram. The differences between them are the followings:
- Blank handling
- Symbol handling
TokenBigramIgnoreBlankSplitSymbol
ignores white-spaces in
continuous symbols and non-ASCII characters.
TokenBigramIgnoreBlankSplitSymbol
tokenizes symbols by bigram
tokenize method.
You can find difference of them by 日 本 語 ! ! !
text because it
has symbols and non-ASCII characters.
Here is a result by TokenBigram :
Execution example:
tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "日"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "本"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "語"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
Here is a result by TokenBigramIgnoreBlankSplitSymbol
:
Execution example:
tokenize TokenBigramIgnoreBlankSplitSymbol "日 本 語 ! ! !" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "日本"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "本語"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "語!"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
7.8.3.7. TokenBigramIgnoreBlankSplitSymbolAlpha
¶
TokenBigramIgnoreBlankSplitSymbolAlpha
is similar to
TokenBigram. The differences between them are the followings:
- Blank handling
- Symbol and alphabet handling
TokenBigramIgnoreBlankSplitSymbolAlpha
ignores white-spaces in
continuous symbols and non-ASCII characters.
TokenBigramIgnoreBlankSplitSymbolAlpha
tokenizes symbols and
alphabets by bigram tokenize method.
You can find difference of them by Hello 日 本 語 ! ! !
text because it
has symbols and non-ASCII characters with white spaces and alphabets.
Here is a result by TokenBigram :
Execution example:
tokenize TokenBigram "Hello 日 本 語 ! ! !" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "hello"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "日"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "本"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "語"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
Here is a result by TokenBigramIgnoreBlankSplitSymbolAlpha
:
Execution example:
tokenize TokenBigramIgnoreBlankSplitSymbolAlpha "Hello 日 本 語 ! ! !" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "he"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "el"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ll"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "lo"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "o日"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "日本"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "本語"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "語!"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
7.8.3.8. TokenBigramIgnoreBlankSplitSymbolAlphaDigit
¶
TokenBigramIgnoreBlankSplitSymbolAlphaDigit
is similar to
TokenBigram. The differences between them are the followings:
- Blank handling
- Symbol, alphabet and digit handling
TokenBigramIgnoreBlankSplitSymbolAlphaDigit
ignores white-spaces
in continuous symbols and non-ASCII characters.
TokenBigramIgnoreBlankSplitSymbolAlphaDigit
tokenizes symbols,
alphabets and digits by bigram tokenize method. It means that all
characters are tokenized by bigram tokenize method.
You can find difference of them by Hello 日 本 語 ! ! ! 777
text
because it has symbols and non-ASCII characters with white spaces,
alphabets and digits.
Here is a result by TokenBigram :
Execution example:
tokenize TokenBigram "Hello 日 本 語 ! ! ! 777" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "hello"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "日"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "本"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "語"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "777"
# }
# ]
# ]
Here is a result by TokenBigramIgnoreBlankSplitSymbolAlphaDigit
:
Execution example:
tokenize TokenBigramIgnoreBlankSplitSymbolAlphaDigit "Hello 日 本 語 ! ! ! 777" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "he"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "el"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ll"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "lo"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "o日"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "日本"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "本語"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "語!"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "!7"
# },
# {
# "position": 11,
# "force_prefix": false,
# "value": "77"
# },
# {
# "position": 12,
# "force_prefix": false,
# "value": "77"
# },
# {
# "position": 13,
# "force_prefix": false,
# "value": "7"
# }
# ]
# ]
7.8.3.9. TokenUnigram
¶
TokenUnigram
is similar to TokenBigram. The differences
between them is token unit. TokenBigram uses 2 characters per
token. TokenUnigram
uses 1 character per token.
Execution example:
tokenize TokenUnigram "100cents!!!" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "100"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "cents"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "!!!"
# }
# ]
# ]
7.8.3.10. TokenTrigram
¶
TokenTrigram
is similar to TokenBigram. The differences
between them is token unit. TokenBigram uses 2 characters per
token. TokenTrigram
uses 3 characters per token.
Execution example:
tokenize TokenTrigram "10000cents!!!!!" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "10000"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "cents"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "!!!!!"
# }
# ]
# ]
7.8.3.11. TokenDelimit
¶
TokenDelimit
extracts token by splitting one or more space
characters (U+0020
). For example, Hello World
is tokenized to
Hello
and World
.
TokenDelimit
is suitable for tag text. You can extract groonga
and full-text-search
and http
as tags from groonga
full-text-search http
.
Here is an example of TokenDelimit
:
Execution example:
tokenize TokenDelimit "Groonga full-text-search HTTP" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "groonga"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "full-text-search"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "http"
# }
# ]
# ]
7.8.3.12. TokenDelimitNull
¶
TokenDelimitNull
is similar to TokenDelimit. The
difference between them is separator character. TokenDelimit
uses space character (U+0020
) but TokenDelimitNull
uses NUL
character (U+0000
).
TokenDelimitNull
is also suitable for tag text.
Here is an example of TokenDelimitNull
:
Execution example:
tokenize TokenDelimitNull "Groonga\u0000full-text-search\u0000HTTP" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "groongau0000full-text-searchu0000http"
# }
# ]
# ]
7.8.3.13. TokenMecab
¶
TokenMecab
is a tokenizer based on MeCab part-of-speech and
morphological analyzer.
MeCab doesn't depend on Japanese. You can use MeCab for other languages by creating dictionary for the languages. You can use NAIST Japanese Dictionary for Japanese.
TokenMecab
is good for precision rather than recall. You can find
東京都
and 京都
texts by 京都
query with
TokenBigram but 東京都
isn't expected. You can find only
京都
text by 京都
query with TokenMecab
.
If you want to support neologisms, you need to keep updating your MeCab dictionary. It needs maintain cost. (TokenBigram doesn't require dictionary maintenance because TokenBigram doesn't use dictionary.) mecab-ipadic-NEologd : Neologism dictionary for MeCab may help you.
Here is an example of TokenMeCab
. 東京都
is tokenized to 東京
and 都
. They don't include 京都
:
Execution example:
tokenize TokenMecab "東京都"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "東京"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "都"
# }
# ]
# ]
7.8.3.14. TokenRegexp
¶
New in version 5.0.1.
Caution
This tokenizer is experimental. Specification may be changed.
Caution
This tokenizer can be used only with UTF-8. You can't use this tokenizer with EUC-JP, Shift_JIS and so on.
TokenRegexp
is a tokenizer for supporting regular expression
search by index.
In general, regular expression search is evaluated as sequential search. But the following cases can be evaluated as index search:
- Literal only case such as
hello
- The beginning of text and literal case such as
\A/home/alice
- The end of text and literal case such as
\.txt\z
In most cases, index search is faster than sequential search.
TokenRegexp
is based on bigram tokenize method. TokenRegexp
adds the beginning of text mark (U+FFEF
) at the begging of text
and the end of text mark (U+FFF0
) to the end of text when you
index text:
Execution example:
tokenize TokenRegexp "/home/alice/test.txt" NormalizerAuto --mode ADD
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": ""
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "/h"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ho"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "om"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "me"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "e/"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "/a"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "al"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "li"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "ic"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "ce"
# },
# {
# "position": 11,
# "force_prefix": false,
# "value": "e/"
# },
# {
# "position": 12,
# "force_prefix": false,
# "value": "/t"
# },
# {
# "position": 13,
# "force_prefix": false,
# "value": "te"
# },
# {
# "position": 14,
# "force_prefix": false,
# "value": "es"
# },
# {
# "position": 15,
# "force_prefix": false,
# "value": "st"
# },
# {
# "position": 16,
# "force_prefix": false,
# "value": "t."
# },
# {
# "position": 17,
# "force_prefix": false,
# "value": ".t"
# },
# {
# "position": 18,
# "force_prefix": false,
# "value": "tx"
# },
# {
# "position": 19,
# "force_prefix": false,
# "value": "xt"
# },
# {
# "position": 20,
# "force_prefix": false,
# "value": "t"
# },
# {
# "position": 21,
# "force_prefix": false,
# "value": ""
# }
# ]
# ]