groonga - An open-source fulltext search engine and column store.

7.12.8. html_untag

7.12.8.1. Summary

html_untag strips HTML tags from HTML and outputs plain text.

html_untag is used in --output_columns described at output_columns.

7.12.8.2. Syntax

html_untag requires only one argument. It is html.

html_untag(html)

7.12.8.3. Requirements

html_untag requires Groonga 3.0.5 or later.

html_untag requires コマンドバージョン 2 or later.

7.12.8.4. Usage

Here are a schema definition and sample data to show usage.

Sample schema:

Execution example:

table_create WebClips TABLE_HASH_KEY ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create WebClips content COLUMN_SCALAR ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]

Sample data:

Execution example:

load --table WebClips
[
{"_key": "http://groonga.org", "content": "groonga is <span class='emphasize'>fast</span>"},
{"_key": "http://mroonga.org", "content": "mroonga is <span class=\"emphasize\">fast</span>"},
]
# [[0, 1337566253.89858, 0.000355720520019531], 2]

Here is the simple usage of html_untag function which strips HTML tags from content of column.

Execution example:

select WebClips --output_columns "html_untag(content)" --command_version 2
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         2
#       ],
#       [
#         [
#           "html_untag",
#           "null"
#         ]
#       ],
#       [
#         "groonga is fast"
#       ],
#       [
#         "mroonga is fast"
#       ]
#     ]
#   ]
# ]

When executing the above query, you can see "span" tag with "class" attribute is stripped. Note that you must specify --command_version 2 to use html_untag function.

7.12.8.5. Parameters

There is one required parameter, html.

7.12.8.5.1. html

It specifies HTML text to be untagged.

7.12.8.6. Return value

html_untag returns plain text which is stripped HTML tags from HTML text.

Table Of Contents

Previous topic

7.12.7. highlight_html

Next topic

7.12.9. in_values

This Page