:plugin: grok
:type: filter

///////////////////////////////////////////
START - GENERATED VARIABLES, DO NOT EDIT!
///////////////////////////////////////////
:version: %VERSION%
:release_date: %RELEASE_DATE%
:changelog_url: %CHANGELOG_URL%
:include_path: ../../../../logstash/docs/include
///////////////////////////////////////////
END - GENERATED VARIABLES, DO NOT EDIT!
///////////////////////////////////////////

[id="plugins-{type}s-{plugin}"]

=== Grok filter plugin

include::{include_path}/plugin_header.asciidoc[]

==== Description

Parse arbitrary text and structure it.

Grok is a great way to parse unstructured log data into something structured and queryable.

This tool is perfect for syslog logs, apache and other webserver logs, mysql
logs, and in general, any log format that is generally written for humans
and not computer consumption.

Logstash ships with about 120 patterns by default. You can find them here:
<https://github.com/logstash-plugins/logstash-patterns-core/tree/master/patterns>. You can add
your own trivially. (See the `patterns_dir` setting)

If you need help building patterns to match your logs, you will find the
<http://grokdebug.herokuapp.com> and <http://grokconstructor.appspot.com/> applications quite useful!

===== Grok or Dissect? Or both?

The {logstash-ref}/plugins-filters-dissect.html[`dissect`] filter plugin
is another way to extract unstructured event data into fields using delimiters.

Dissect differs from Grok in that it does not use regular expressions and is faster. 
Dissect works well when data is reliably repeated.
Grok is a better choice when the structure of your text varies from line to line.

You can use both Dissect and Grok for a hybrid use case when a section of the
line is reliably repeated, but the entire line is not. The Dissect filter can
deconstruct the section of the line that is repeated. The Grok filter can process
the remaining field values with more regex predictability.

==== Grok Basics

Grok works by combining text patterns into something that matches your
logs.

The syntax for a grok pattern is `%{SYNTAX:SEMANTIC}`

The `SYNTAX` is the name of the pattern that will match your text. For
example, `3.44` will be matched by the `NUMBER` pattern and `55.3.244.1` will
be matched by the `IP` pattern. The syntax is how you match.

The `SEMANTIC` is the identifier you give to the piece of text being matched.
For example, `3.44` could be the duration of an event, so you could call it
simply `duration`. Further, a string `55.3.244.1` might identify the `client`
making a request.

For the above example, your grok filter would look something like this:
[source,ruby]
%{NUMBER:duration} %{IP:client}

Optionally you can add a data type conversion to your grok pattern. By default
all semantics are saved as strings. If you wish to convert a semantic's data type,
for example change a string to an integer then suffix it with the target data type.
For example `%{NUMBER:num:int}` which converts the `num` semantic from a string to an
integer. Currently the only supported conversions are `int` and `float`.

.Examples:

With that idea of a syntax and semantic, we can pull out useful fields from a
sample log like this fictional http request log:
[source,ruby]
    55.3.244.1 GET /index.html 15824 0.043

The pattern for this could be:
[source,ruby]
    %{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}

A more realistic example, let's read these logs from a file:
[source,ruby]
    input {
      file {
        path => "/var/log/http.log"
      }
    }
    filter {
      grok {
        match => { "message" => "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" }
      }
    }

After the grok filter, the event will have a few extra fields in it:

* `client: 55.3.244.1`
* `method: GET`
* `request: /index.html`
* `bytes: 15824`
* `duration: 0.043`

==== Regular Expressions

Grok sits on top of regular expressions, so any regular expressions are valid
in grok as well. The regular expression library is Oniguruma, and you can see
the full supported regexp syntax https://github.com/kkos/oniguruma/blob/master/doc/RE[on the Oniguruma
site].

==== Custom Patterns

Sometimes logstash doesn't have a pattern you need. For this, you have
a few options.

First, you can use the Oniguruma syntax for named capture which will
let you match a piece of text and save it as a field:
[source,ruby]
    (?<field_name>the pattern here)

For example, postfix logs have a `queue id` that is an 10 or 11-character
hexadecimal value. I can capture that easily like this:
[source,ruby]
    (?<queue_id>[0-9A-F]{10,11})

Alternately, you can create a custom patterns file.

* Create a directory called `patterns` with a file in it called `extra`
  (the file name doesn't matter, but name it meaningfully for yourself)
* In that file, write the pattern you need as the pattern name, a space, then
  the regexp for that pattern.

For example, doing the postfix queue id example as above:
[source,ruby]
    # contents of ./patterns/postfix:
    POSTFIX_QUEUEID [0-9A-F]{10,11}

Then use the `patterns_dir` setting in this plugin to tell logstash where
your custom patterns directory is. Here's a full example with a sample log:

[source,ruby]
-----
    Jan  1 06:25:43 mailserver14 postfix/cleanup[21403]: BEF25A72965: message-id=<20130101142543.5828399CCAF@mailserver14.example.com>
-----

[source,ruby]
-----
    filter {
      grok {
        patterns_dir => ["./patterns"]
        match => { "message" => "%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}" }
      }
    }
-----

The above will match and result in the following fields:

* `timestamp: Jan  1 06:25:43`
* `logsource: mailserver14`
* `program: postfix/cleanup`
* `pid: 21403`
* `queue_id: BEF25A72965`
* `syslog_message: message-id=<20130101142543.5828399CCAF@mailserver14.example.com>`

The `timestamp`, `logsource`, `program`, and `pid` fields come from the
`SYSLOGBASE` pattern which itself is defined by other patterns.

Another option is to define patterns _inline_ in the filter using `pattern_definitions`.
This is mostly for convenience and allows user to define a pattern which can be used just in that 
filter. This newly defined patterns in `pattern_definitions` will not be available outside of that particular `grok` filter.

[id="plugins-{type}s-{plugin}-ecs"]
==== Migrating to Elastic Common Schema (ECS)

To ease migration to the {ecs-ref}[Elastic Common Schema (ECS)], the filter
plugin offers a new set of ECS-compliant patterns in addition to the existing
patterns. The new ECS pattern definitions capture event field names that are
compliant with the schema.

The ECS pattern set has all of the pattern definitions from the legacy set, and is
a drop-in replacement. Use the <<plugins-{type}s-{plugin}-ecs_compatibility>>
setting to switch modes. 

New features and enhancements will be added to the ECS-compliant files. The
legacy patterns may still receive bug fixes which are backwards compatible.


[id="plugins-{type}s-{plugin}-options"]
==== Grok Filter Configuration Options

This plugin supports the following configuration options plus the <<plugins-{type}s-{plugin}-common-options>> described later.

[cols="<,<,<",options="header",]
|=======================================================================
|Setting |Input type|Required
| <<plugins-{type}s-{plugin}-break_on_match>> |<<boolean,boolean>>|No
| <<plugins-{type}s-{plugin}-ecs_compatibility>> |<<string,string>>|No
| <<plugins-{type}s-{plugin}-keep_empty_captures>> |<<boolean,boolean>>|No
| <<plugins-{type}s-{plugin}-match>> |<<hash,hash>>|No
| <<plugins-{type}s-{plugin}-named_captures_only>> |<<boolean,boolean>>|No
| <<plugins-{type}s-{plugin}-overwrite>> |<<array,array>>|No
| <<plugins-{type}s-{plugin}-pattern_definitions>> |<<hash,hash>>|No
| <<plugins-{type}s-{plugin}-patterns_dir>> |<<array,array>>|No
| <<plugins-{type}s-{plugin}-patterns_files_glob>> |<<string,string>>|No
| <<plugins-{type}s-{plugin}-tag_on_failure>> |<<array,array>>|No
| <<plugins-{type}s-{plugin}-tag_on_timeout>> |<<string,string>>|No
| <<plugins-{type}s-{plugin}-timeout_millis>> |<<number,number>>|No
| <<plugins-{type}s-{plugin}-timeout_scope>> |<<string,string>>|No
|=======================================================================

Also see <<plugins-{type}s-{plugin}-common-options>> for a list of options supported by all
filter plugins.

&nbsp;

[id="plugins-{type}s-{plugin}-break_on_match"]
===== `break_on_match` 

  * Value type is <<boolean,boolean>>
  * Default value is `true`

Break on first match. The first successful match by grok will result in the
filter being finished. If you want grok to try all patterns (maybe you are
parsing different things), then set this to false.

[id="plugins-{type}s-{plugin}-ecs_compatibility"]
===== `ecs_compatibility`

* Value type is <<string,string>>
* Supported values are:
** `disabled`: the plugin will load legacy (built-in) pattern definitions
** `v1`: all patterns provided by the plugin will use ECS compliant captures
* Default value depends on which version of Logstash is running:
** When Logstash provides a `pipeline.ecs_compatibility` setting, its value is used as the default
** Otherwise, the default value is `disabled`.

Controls this plugin's compatibility with the {ecs-ref}[Elastic Common Schema (ECS)].
The value of this setting affects extracted event field names when a composite pattern (such as `HTTPD_COMMONLOG`) is matched.

[id="plugins-{type}s-{plugin}-keep_empty_captures"]
===== `keep_empty_captures` 

  * Value type is <<boolean,boolean>>
  * Default value is `false`

If `true`, keep empty captures as event fields.

[id="plugins-{type}s-{plugin}-match"]
===== `match` 

  * Value type is <<hash,hash>>
  * Default value is `{}`

A hash that defines the mapping of _where to look_, and with which patterns.

For example, the following will match an existing value in the `message` field for the given pattern, and if a match is found will add the field `duration` to the event with the captured value:
[source,ruby]
    filter {
      grok {
        match => {
          "message" => "Duration: %{NUMBER:duration}"
        }
      }
    }

If you need to match multiple patterns against a single field, the value can be an array of patterns:
[source,ruby]
    filter {
      grok {
        match => {
          "message" => [
            "Duration: %{NUMBER:duration}",
            "Speed: %{NUMBER:speed}"
          ]
        }
      }
    }

[id="plugins-{type}s-{plugin}-named_captures_only"]
===== `named_captures_only` 

  * Value type is <<boolean,boolean>>
  * Default value is `true`

If `true`, only store named captures from grok.

[id="plugins-{type}s-{plugin}-overwrite"]
===== `overwrite` 

  * Value type is <<array,array>>
  * Default value is `[]`

The fields to overwrite.

This allows you to overwrite a value in a field that already exists.

For example, if you have a syslog line in the `message` field, you can
overwrite the `message` field with part of the match like so:
[source,ruby]
    filter {
      grok {
        match => { "message" => "%{SYSLOGBASE} %{DATA:message}" }
        overwrite => [ "message" ]
      }
    }

In this case, a line like `May 29 16:37:11 sadness logger: hello world`
will be parsed and `hello world` will overwrite the original message.

[id="plugins-{type}s-{plugin}-pattern_definitions"]
===== `pattern_definitions` 

  * Value type is <<hash,hash>>
  * Default value is `{}`

A hash of pattern-name and pattern tuples defining custom patterns to be used by 
the current filter. Patterns matching existing names will override the pre-existing 
definition. Think of this as inline patterns available just for this definition of 
grok

[id="plugins-{type}s-{plugin}-patterns_dir"]
===== `patterns_dir` 

  * Value type is <<array,array>>
  * Default value is `[]`


Logstash ships by default with a bunch of patterns, so you don't
necessarily need to define this yourself unless you are adding additional
patterns. You can point to multiple pattern directories using this setting.
Note that Grok will read all files in the directory matching the patterns_files_glob
and assume it's a pattern file (including any tilde backup files). 
[source,ruby]
    patterns_dir => ["/opt/logstash/patterns", "/opt/logstash/extra_patterns"]

Pattern files are plain text with format:
[source,ruby]
    NAME PATTERN

For example:
[source,ruby]
    NUMBER \d+

The patterns are loaded when the pipeline is created.

[id="plugins-{type}s-{plugin}-patterns_files_glob"]
===== `patterns_files_glob` 

  * Value type is <<string,string>>
  * Default value is `"*"`

Glob pattern, used to select the pattern files in the directories
specified by patterns_dir

[id="plugins-{type}s-{plugin}-tag_on_failure"]
===== `tag_on_failure` 

  * Value type is <<array,array>>
  * Default value is `["_grokparsefailure"]`

Append values to the `tags` field when there has been no
successful match

[id="plugins-{type}s-{plugin}-tag_on_timeout"]
===== `tag_on_timeout` 

  * Value type is <<string,string>>
  * Default value is `"_groktimeout"`

Tag to apply if a grok regexp times out.

[id="plugins-{type}s-{plugin}-target"]
===== `target`

  * Value type is <<string,string>>
  * There is no default value for this setting

Define target namespace for placing matches.

[id="plugins-{type}s-{plugin}-timeout_millis"]
===== `timeout_millis` 

  * Value type is <<number,number>>
  * Default value is `30000`

Attempt to terminate regexps after this amount of time.
This applies per pattern if multiple patterns are applied
This will never timeout early, but may take a little longer to timeout.
Actual timeout is approximate based on a 250ms quantization.
Set to 0 to disable timeouts

[id="plugins-{type}s-{plugin}-timeout_scope"]
===== `timeout_scope`

  * Value type is <<string,string>>
  * Default value is `"pattern"`
  * Supported values are `"pattern"` and `"event"`

When multiple patterns are provided to <<plugins-{type}s-{plugin}-match>>,
the timeout has historically applied to _each_ pattern, incurring overhead
for each and every pattern that is attempted; when the grok filter is
configured with `timeout_scope => event`, the plugin instead enforces
a single timeout across all attempted matches on the event, so it can
achieve similar safeguard against runaway matchers with significantly
less overhead.

It's usually better to scope the timeout for the whole event.


[id="plugins-{type}s-{plugin}-common-options"]
include::{include_path}/{type}.asciidoc[]