groonga - オープンソースのカラムストア機能付き全文検索エンジン

5.3. 補完

このセクションでは以下の補完機能について説明します。:

  • どのように動作するか

  • 使い方

  • 学習方法

5.3.1. どのように動作するか

補完機能は補完される語を計算するために3種類の検索を使います。

  1. 登録されている語を前方一致RK検索。

  2. 学習したデータを共起検索。

  3. 登録されている語を前方一致検索。(実行しないこともある)

5.3.2. 使い方

Groongaは補完機能を使うために suggest コマンドを用意しています。 --type complete オプションを使うと補完機能を利用できます。

例えば、"en"と入力したときの補完結果を取得するコマンドは以下のようになります。:

実行例:

suggest --table item_query --column kana --types complete --frequency_threshold 1 --query en
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   {
#     "complete": [
#       [
#         1
#       ],
#       [
#         [
#           "_key",
#           "ShortText"
#         ],
#         [
#           "_score",
#           "Int32"
#         ]
#       ],
#       [
#         "engine",
#         1
#       ]
#     ]
#   }
# ]

5.3.3. 学習方法

共起検索は学習データを使います。学習データはクエリログやアクセスログなどを元に作成します。学習データを作成するには、タイムスタンプ付きの入力シーケンスと、タイムスタンプ付きの検索実行時の入力内容が必要です。

例えば、ユーザが"engine"で検索したいとします。ユーザが以下のようなシーケンスで検索クエリを入力したとします。:

  1. 2011-08-10T13:33:23+09:00: e
  2. 2011-08-10T13:33:23+09:00: en
  3. 2011-08-10T13:33:24+09:00: eng
  4. 2011-08-10T13:33:24+09:00: engi
  5. 2011-08-10T13:33:24+09:00: engin
  6. 2011-08-10T13:33:25+09:00: engine (検索実行!)

以下のコマンドでこの入力シーケンスから学習できます。:

load --table event_query --each 'suggest_preparer(_id, type, item, sequence, time, pair_query)'
[
{"sequence": "1", "time": 1312950803.86057, "item": "e"},
{"sequence": "1", "time": 1312950803.96857, "item": "en"},
{"sequence": "1", "time": 1312950804.26057, "item": "eng"},
{"sequence": "1", "time": 1312950804.56057, "item": "engi"},
{"sequence": "1", "time": 1312950804.76057, "item": "engin"},
{"sequence": "1", "time": 1312950805.86057, "item": "engine", "type": "submit"}
]

5.3.4. How to update RK reading data

Groonga requires registered word and its reading for RK search, so load such data in the advance.

Here is the example to register "日本" which means Japanese in english.

実行例:

load --table event_query --each 'suggest_preparer(_id, type, item, sequence, time, pair_query)'
[
{"sequence": "1", "time": 1312950805.86058, "item": "日本", "type": "submit"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 1]

Here is the example to update RK data to complete "日本".

実行例:

load --table item_query
[
{"_key":"日本", "kana":["ニホン", "ニッポン"]}
]
# [[0, 1337566253.89858, 0.000355720520019531], 1]

Then you can complete registered word "日本" by RK input - "nihon".

実行例:

suggest --table item_query --column kana --types complete --frequency_threshold 1 --query nihon
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   {
#     "complete": [
#       [
#         1
#       ],
#       [
#         [
#           "_key",
#           "ShortText"
#         ],
#         [
#           "_score",
#           "Int32"
#         ]
#       ],
#       [
#         "日本",
#         2
#       ]
#     ]
#   }
# ]

Without loading above RK data, you can't complete registered word "日本" by query - "nihon".

As the column type of item_query table is VECTOR_COLUMN, you can register multiple readings for registered word.

This is the reason that you can also complete the registered word "日本" by query - "nippon".

実行例:

suggest --table item_query --column kana --types complete --frequency_threshold 1 --query nippon
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   {
#     "complete": [
#       [
#         1
#       ],
#       [
#         [
#           "_key",
#           "ShortText"
#         ],
#         [
#           "_score",
#           "Int32"
#         ]
#       ],
#       [
#         "日本",
#         2
#       ]
#     ]
#   }
# ]

This feature is very convenient because you can search registered word even though Japanese IM is disabled.

If there are multiple candidates as completed result, you can customize priority to set the value of "boost" column in item_query table.

Here is the example to customize priority for RK search.

実行例:

load --table event_query --each 'suggest_preparer(_id, type, item, sequence, time, pair_query)'
[
{"sequence": "1", "time": 1312950805.86059, "item": "日本語", "type": "submit"}
{"sequence": "1", "time": 1312950805.86060, "item": "日本人", "type": "submit"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 2]
load --table item_query
[
{"_key":"日本語", "kana":"ニホンゴ"}
{"_key":"日本人", "kana":"ニホンジン"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 2]
suggest --table item_query --column kana --types complete --frequency_threshold 1 --query nihon
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   {
#     "complete": [
#       [
#         3
#       ],
#       [
#         [
#           "_key",
#           "ShortText"
#         ],
#         [
#           "_score",
#           "Int32"
#         ]
#       ],
#       [
#         "日本",
#         2
#       ],
#       [
#         "日本人",
#         2
#       ],
#       [
#         "日本語",
#         2
#       ]
#     ]
#   }
# ]
load --table item_query
[
{"_key":"日本人", "boost": 100},
]
# [[0, 1337566253.89858, 0.000355720520019531], 1]
suggest --table item_query --column kana --types complete --frequency_threshold 1 --query nihon
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   {
#     "complete": [
#       [
#         3
#       ],
#       [
#         [
#           "_key",
#           "ShortText"
#         ],
#         [
#           "_score",
#           "Int32"
#         ]
#       ],
#       [
#         "日本人",
#         102
#       ],
#       [
#         "日本",
#         2
#       ],
#       [
#         "日本語",
#         2
#       ]
#     ]
#   }
# ]

目次

前のトピックへ

5.2. チュートリアル

次のトピックへ

5.4. 補正

このページ