.. -*- rst -*- .. highlightlang:: none .. groonga-command .. database: scorer_tf_idf ``scorer_tf_idf`` ================= .. note:: This scorer is an experimental feature. .. versionadded:: 5.0.1 Summary ------- ``scorer_tf_idf`` is a scorer based of `TF-IDF `_ (term frequency-inverse document frequency) score function. To put it simply, TF (term frequency) divided by DF (document frequency) is TF-IDF. "TF" means that "the number of occurrences is more important". "TF divided by DF" means that "the number of occurrences of important term is more important". The default score function in Groonga is TF (term frequency). It doesn't care about term importance but is fast. TF-IDF cares about term importance but is slower than TF. TF-IDF will compute more suitable score rather than TF for many cases. But it's not perfect. If document contains many same keywords such as "They are keyword, keyword, keyword ... and keyword", it increases score by TF and TF-IDF. Search engine spammer may use the technique. But TF-IDF doesn't guard from the technique. `Okapi BM25 `_ can solve the case. But it's more slower than TF-IDF and not implemented yet in Groonga. Groonga provides :doc:`scorer_tf_at_most` scorer that can also solve the case. .. include:: ../scoring_note.rst Syntax ------ This scorer has only one parameter:: scorer_tf_idf(column) scorer_tf_idf(index) Usage ----- This section describes how to use this scorer. Here are a schema definition and sample data to show usage. Sample schema: .. groonga-command .. include:: ../example/reference/scorers/scorer_tf_idf/usage_setup_schema.log .. table_create Logs TABLE_NO_KEY .. column_create Logs message COLUMN_SCALAR Text .. .. table_create Terms TABLE_PAT_KEY ShortText \ .. --default_tokenizer TokenBigram \ .. --normalizer NormalizerAuto .. column_create Terms message_index COLUMN_INDEX|WITH_POSITION Logs message Sample data: .. groonga-command .. include:: ../example/reference/scorers/scorer_tf_idf/usage_setup_data.log .. load --table Logs .. [ .. {"message": "Error"}, .. {"message": "Warning"}, .. {"message": "Warning Warning"}, .. {"message": "Warning Warning Warning"}, .. {"message": "Info"}, .. {"message": "Info Info"}, .. {"message": "Info Info Info"}, .. {"message": "Info Info Info Info"}, .. {"message": "Notice"}, .. {"message": "Notice Notice"}, .. {"message": "Notice Notice Notice"}, .. {"message": "Notice Notice Notice Notice"}, .. {"message": "Notice Notice Notice Notice Notice"} .. ] You specify ``scorer_tf_idf`` in :ref:`select-match-columns` like the following: .. groonga-command .. include:: ../example/reference/scorers/scorer_tf_idf/usage_no_weight.log .. select Logs \ .. --match_columns "scorer_tf_idf(message)" \ .. --query "Error OR Info" \ .. --output_columns "message, _score" \ .. --sortby "-_score" Both the score of ``Info Info Info`` and the score of ``Error`` are ``2`` even ``Info Info Info`` includes three ``Info`` terms. Because ``Error`` is more important term rather than ``Info``. The number of documents that include ``Info`` is ``4``. The number of documents that include ``Error`` is ``1``. Term that is included in less documents means that the term is more characteristic term. Characteristic term is important term. Parameters ---------- This section describes all parameters. Required parameters ^^^^^^^^^^^^^^^^^^^ There is only one required parameters. ``column`` """""""""" The data column that is match target. The data column must be indexed. ``index`` """"""""" The index column to be used for search. Optional parameters ^^^^^^^^^^^^^^^^^^^ There is no optional parameters. Return value ------------ This scorer returns score as :ref:`builtin-type-float`. :doc:`/reference/commands/select` returns ``_score`` as ``Int32`` not ``Float``. Because it casts to ``Int32`` from ``Float`` for keeping backward compatibility. Score is computed as TF-IDF based algorithm. See also -------- * :doc:`../scorer`