7.9. Token filters¶
7.9.1. Summary¶
Groonga has token filter module that some processes tokenized token.
Token filter module can be added as a plugin.
You can customize tokenized token by registering your token filters plugins to Groonga.
A table can have zero or more token filters. You can attach token filters to a table by token_filters option in table_create.
Here is an example table_create
that uses TokenFilterStopWord
token filter module:
Execution example:
register token_filters/stop_word
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Terms TABLE_PAT_KEY ShortText \
--default_tokenizer TokenBigram \
--normalizer NormalizerAuto \
--token_filters TokenFilterStopWord
# [[0, 1337566253.89858, 0.000355720520019531], true]
7.9.2. Available token filters¶
Here is the list of available token filters:
TokenFilterStopWord
TokenFilterStem
7.9.2.1. TokenFilterStopWord
¶
TokenFilterStopWord
removes stop words from tokenized token
in searching the documents.
TokenFilterStopWord
can specify stop word after adding the
documents because it removes token in searching the documents.
The stop word is specified is_stop_word
column on lexicon table.
Here is an example that uses TokenFilterStopWord
token filter:
Execution example:
register token_filters/stop_word
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Memos TABLE_NO_KEY
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Memos content COLUMN_SCALAR ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Terms TABLE_PAT_KEY ShortText \
--default_tokenizer TokenBigram \
--normalizer NormalizerAuto \
--token_filters TokenFilterStopWord
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms is_stop_word COLUMN_SCALAR Bool
# [[0, 1337566253.89858, 0.000355720520019531], true]
load --table Terms
[
{"_key": "and", "is_stop_word": true}
]
# [[0, 1337566253.89858, 0.000355720520019531], 1]
load --table Memos
[
{"content": "Hello"},
{"content": "Hello and Good-bye"},
{"content": "Good-bye"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 3]
select Memos --match_columns content --query "Hello and"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 2
# ],
# [
# [
# "_id",
# "UInt32"
# ],
# [
# "content",
# "ShortText"
# ]
# ],
# [
# 1,
# "Hello"
# ],
# [
# 2,
# "Hello and Good-bye"
# ]
# ]
# ]
# ]
and
token is marked as stop word in Terms
table.
"Hello"
that doesn't have and
in content is matched. Because
and
is a stop word and and
is removed from query.
7.9.2.2. TokenFilterStem
¶
TokenFilterStem
stems tokenized token.
Here is an example that uses TokenFilterStem
token filter:
Execution example:
register token_filters/stem
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Memos TABLE_NO_KEY
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Memos content COLUMN_SCALAR ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Terms TABLE_PAT_KEY ShortText \
--default_tokenizer TokenBigram \
--normalizer NormalizerAuto \
--token_filters TokenFilterStem
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
# [[0, 1337566253.89858, 0.000355720520019531], true]
load --table Memos
[
{"content": "I develop Groonga"},
{"content": "I'm developing Groonga"},
{"content": "I developed Groonga"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 3]
select Memos --match_columns content --query "develops"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 3
# ],
# [
# [
# "_id",
# "UInt32"
# ],
# [
# "content",
# "ShortText"
# ]
# ],
# [
# 1,
# "I develop Groonga"
# ],
# [
# 2,
# "I'm developing Groonga"
# ],
# [
# 3,
# "I developed Groonga"
# ]
# ]
# ]
# ]
All of develop
, developing
, developed
and develops
tokens are stemmed as develop
. So we can find develop
,
developing
and developed
by develops
query.