7.9. Token filters

7.9.1. Summary

Groonga has token filter module that some processes tokenized token.

Token filter module can be added as a plugin.

You can customize tokenized token by registering your token filters plugins to Groonga.

A table can have zero or more token filters. You can attach token filters to a table by token_filters option in table_create.

Here is an example table_create that uses TokenFilterStopWord token filter module:

Execution example:

register token_filters/stop_word
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto \
  --token_filters TokenFilterStopWord
# [[0, 1337566253.89858, 0.000355720520019531], true]

7.9.2. Available token filters

Here is the list of available token filters:

  • TokenFilterStopWord
  • TokenFilterStem

7.9.2.1. TokenFilterStopWord

TokenFilterStopWord removes stop words from tokenized token in searching the documents.

TokenFilterStopWord can specify stop word after adding the documents because it removes token in searching the documents.

The stop word is specified is_stop_word column on lexicon table.

Here is an example that uses TokenFilterStopWord token filter:

Execution example:

register token_filters/stop_word
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Memos TABLE_NO_KEY
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Memos content COLUMN_SCALAR ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto \
  --token_filters TokenFilterStopWord
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms is_stop_word COLUMN_SCALAR Bool
# [[0, 1337566253.89858, 0.000355720520019531], true]
load --table Terms
[
{"_key": "and", "is_stop_word": true}
]
# [[0, 1337566253.89858, 0.000355720520019531], 1]
load --table Memos
[
{"content": "Hello"},
{"content": "Hello and Good-bye"},
{"content": "Good-bye"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 3]
select Memos --match_columns content --query "Hello and"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         2
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "content",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "Hello"
#       ],
#       [
#         2,
#         "Hello and Good-bye"
#       ]
#     ]
#   ]
# ]

and token is marked as stop word in Terms table.

"Hello" that doesn't have and in content is matched. Because and is a stop word and and is removed from query.

7.9.2.2. TokenFilterStem

TokenFilterStem stems tokenized token.

Here is an example that uses TokenFilterStem token filter:

Execution example:

register token_filters/stem
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Memos TABLE_NO_KEY
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Memos content COLUMN_SCALAR ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto \
  --token_filters TokenFilterStem
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
# [[0, 1337566253.89858, 0.000355720520019531], true]
load --table Memos
[
{"content": "I develop Groonga"},
{"content": "I'm developing Groonga"},
{"content": "I developed Groonga"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 3]
select Memos --match_columns content --query "develops"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         3
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "content",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "I develop Groonga"
#       ],
#       [
#         2,
#         "I'm developing Groonga"
#       ],
#       [
#         3,
#         "I developed Groonga"
#       ]
#     ]
#   ]
# ]

All of develop, developing, developed and develops tokens are stemmed as develop. So we can find develop, developing and developed by develops query.

7.9.3. See also