README.md in fluent-plugin-bigquery-1.2.0 vs README.md in fluent-plugin-bigquery-2.0.0.beta

- old
+ new

@@ -1,15 +1,19 @@ # fluent-plugin-bigquery +**This README is for v2.0.0.beta. but it is not released yet. sorry.** + [Fluentd](http://fluentd.org) output plugin to load/insert data into Google BigQuery. -- **Plugin type**: BufferedOutput +- **Plugin type**: Output * insert data over streaming inserts + * plugin type is `bigquery_insert` * for continuous real-time insertions * https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases * load data + * plugin type is `bigquery_load` * for data loading as batch jobs, for big amount of data * https://developers.google.com/bigquery/loading-data-into-bigquery Current version of this plugin supports Google API with Service Account Authentication, but does not support OAuth flow for installed applications. @@ -29,59 +33,64 @@ ## Configuration ### Options -| name | type | required? | placeholder? | default | description | -| :------------------------------------- | :------------ | :----------- | :---------- | :------------------------- | :----------------------- | -| method | string | no | no | insert | `insert` (Streaming Insert) or `load` (load job) | -| auth_method | enum | yes | no | private_key | `private_key` or `json_key` or `compute_engine` or `application_default` | -| email | string | yes (private_key) | no | nil | GCP Service Account Email | -| private_key_path | string | yes (private_key) | no | nil | GCP Private Key file path | -| private_key_passphrase | string | yes (private_key) | no | nil | GCP Private Key Passphrase | -| json_key | string | yes (json_key) | no | nil | GCP JSON Key file path or JSON Key string | -| project | string | yes | yes | nil | | -| dataset | string | yes | yes | nil | | -| table | string | yes (either `tables`) | yes | nil | | -| tables | array(string) | yes (either `table`) | yes | nil | can set multi table names splitted by `,` | -| template_suffix | string | no | yes | nil | can use `%{time_slice}` placeholder replaced by `time_slice_format` | -| auto_create_table | bool | no | no | false | If true, creates table automatically | -| skip_invalid_rows | bool | no | no | false | Only `insert` method. | -| max_bad_records | integer | no | no | 0 | Only `load` method. If the number of bad records exceeds this value, an invalid error is returned in the job result. | -| ignore_unknown_values | bool | no | no | false | Accept rows that contain values that do not match the schema. The unknown values are ignored. | -| schema | array | yes (either `fetch_schema` or `schema_path`) | no | nil | Schema Definition. It is formatted by JSON. | -| schema_path | string | yes (either `fetch_schema`) | no | nil | Schema Definition file path. It is formatted by JSON. | -| fetch_schema | bool | yes (either `schema_path`) | no | false | If true, fetch table schema definition from Bigquery table automatically. | -| fetch_schema_table | string | no | yes | nil | If set, fetch table schema definition from this table, If fetch_schema is false, this param is ignored | -| schema_cache_expire | integer | no | no | 600 | Value is second. If current time is after expiration interval, re-fetch table schema definition. | -| insert_id_field | string | no | no | nil | Use key as `insert_id` of Streaming Insert API parameter. | -| add_insert_timestamp | string | no | no | nil | Adds a timestamp column just before sending the rows to BigQuery, so that buffering time is not taken into account. Gives a field in BigQuery which represents the insert time of the row. | -| allow_retry_insert_errors | bool | no | no | false | Retry to insert rows when an insertErrors occurs. There is a possibility that rows are inserted in duplicate. | -| request_timeout_sec | integer | no | no | nil | Bigquery API response timeout | -| request_open_timeout_sec | integer | no | no | 60 | Bigquery API connection, and request timeout. If you send big data to Bigquery, set large value. | -| time_partitioning_type | enum | no (either day) | no | nil | Type of bigquery time partitioning feature(experimental feature on BigQuery). | -| time_partitioning_expiration | time | no | no | nil | Expiration milliseconds for bigquery time partitioning. (experimental feature on BigQuery) | +#### common -### Deprecated +| name | type | required? | placeholder? | default | description | +| :------------------------------------- | :------------ | :----------- | :---------- | :------------------------- | :----------------------- | +| auth_method | enum | yes | no | private_key | `private_key` or `json_key` or `compute_engine` or `application_default` | +| email | string | yes (private_key) | no | nil | GCP Service Account Email | +| private_key_path | string | yes (private_key) | no | nil | GCP Private Key file path | +| private_key_passphrase | string | yes (private_key) | no | nil | GCP Private Key Passphrase | +| json_key | string | yes (json_key) | no | nil | GCP JSON Key file path or JSON Key string | +| project | string | yes | yes | nil | | +| dataset | string | yes | yes | nil | | +| table | string | yes (either `tables`) | yes | nil | | +| tables | array(string) | yes (either `table`) | yes | nil | can set multi table names splitted by `,` | +| auto_create_table | bool | no | no | false | If true, creates table automatically | +| ignore_unknown_values | bool | no | no | false | Accept rows that contain values that do not match the schema. The unknown values are ignored. | +| schema | array | yes (either `fetch_schema` or `schema_path`) | no | nil | Schema Definition. It is formatted by JSON. | +| schema_path | string | yes (either `fetch_schema`) | no | nil | Schema Definition file path. It is formatted by JSON. | +| fetch_schema | bool | yes (either `schema_path`) | no | false | If true, fetch table schema definition from Bigquery table automatically. | +| fetch_schema_table | string | no | yes | nil | If set, fetch table schema definition from this table, If fetch_schema is false, this param is ignored | +| schema_cache_expire | integer | no | no | 600 | Value is second. If current time is after expiration interval, re-fetch table schema definition. | +| request_timeout_sec | integer | no | no | nil | Bigquery API response timeout | +| request_open_timeout_sec | integer | no | no | 60 | Bigquery API connection, and request timeout. If you send big data to Bigquery, set large value. | +| time_partitioning_type | enum | no (either day) | no | nil | Type of bigquery time partitioning feature(experimental feature on BigQuery). | +| time_partitioning_expiration | time | no | no | nil | Expiration milliseconds for bigquery time partitioning. (experimental feature on BigQuery) | -| name | type | required? | placeholder? | default | description | -| :------------------------------------- | :------------ | :----------- | :---------- | :------------------------- | :----------------------- | -| replace_record_key | bool | no | no | false | Use other filter plugin. | -| replace_record_key_regexp{1-10} | string | no | no | nil | | +#### bigquery_insert +| name | type | required? | placeholder? | default | description | +| :------------------------------------- | :------------ | :----------- | :---------- | :------------------------- | :----------------------- | +| template_suffix | string | no | yes | nil | can use `%{time_slice}` placeholder replaced by `time_slice_format` | +| skip_invalid_rows | bool | no | no | false | | +| insert_id_field | string | no | no | nil | Use key as `insert_id` of Streaming Insert API parameter. see. https://docs.fluentd.org/v1.0/articles/api-plugin-helper-record_accessor | +| add_insert_timestamp | string | no | no | nil | Adds a timestamp column just before sending the rows to BigQuery, so that buffering time is not taken into account. Gives a field in BigQuery which represents the insert time of the row. | +| allow_retry_insert_errors | bool | no | no | false | Retry to insert rows when an insertErrors occurs. There is a possibility that rows are inserted in duplicate. | + +#### bigquery_load + +| name | type | required? | placeholder? | default | description | +| :------------------------------------- | :------------ | :----------- | :---------- | :------------------------- | :----------------------- | +| source_format | enum | no | no | json | Specify source format `json` or `csv` or `avro`. If you change this parameter, you must change formatter plugin via `<format>` config section. | +| max_bad_records | integer | no | no | 0 | If the number of bad records exceeds this value, an invalid error is returned in the job result. | + ### Buffer section | name | type | required? | default | description | | :------------------------------------- | :------------ | :----------- | :------------------------- | :----------------------- | | @type | string | no | memory (insert) or file (load) | | | chunk_limit_size | integer | no | 1MB (insert) or 1GB (load) | | | total_limit_size | integer | no | 1GB (insert) or 32GB (load) | | | chunk_records_limit | integer | no | 500 (insert) or nil (load) | | | flush_mode | enum | no | interval | default, lazy, interval, immediate | -| flush_interval | float | no | 0.25 (insert) or nil (load) | | -| flush_thread_interval | float | no | 0.05 (insert) or nil (load) | | -| flush_thread_burst_interval | float | no | 0.05 (insert) or nil (load) | | +| flush_interval | float | no | 1.0 (insert) or 3600 (load) | | +| flush_thread_interval | float | no | 0.05 (insert) or 5 (load) | | +| flush_thread_burst_interval | float | no | 0.05 (insert) or 5 (load) | | And, other params (defined by base class) are available see. https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/output.rb @@ -140,14 +149,12 @@ Configure insert specifications with target table schema, with your credentials. This is minimum configurations: ```apache <match dummy> - @type bigquery + @type bigquery_insert - method insert # default - auth_method private_key # default email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12 # private_key_passphrase notasecret # default @@ -179,18 +186,16 @@ For high rate inserts over streaming inserts, you should specify flush intervals and buffer chunk options: ```apache <match dummy> - @type bigquery - - method insert # default + @type bigquery_insert <buffer> flush_interval 0.1 # flush as frequent as possible - buffer_queue_limit 10240 # 1MB * 10240 -> 10GB! + total_limit_size 10g flush_thread_count 16 </buffer> auth_method private_key # default @@ -254,20 +259,16 @@ section in the Google BigQuery document. ### Load ```apache <match bigquery> - @type bigquery + @type bigquery_load - method load - <buffer> - @type file - path bigquery.*.buffer - flush_interval 1800 - flush_at_shutdown true - timekey_use_utc + path bigquery.*.buffer + flush_at_shutdown true + timekey_use_utc </buffer> auth_method json_key json_key json_key_path.json @@ -300,11 +301,11 @@ You first need to create a service account (client ID), download its JSON key and deploy the key with fluentd. ```apache <match dummy> - @type bigquery + @type bigquery_insert auth_method json_key json_key /home/username/.keys/00000000000000000000000000000000-jsonkey.json project yourproject_id @@ -317,11 +318,11 @@ You can also provide `json_key` as embedded JSON string like this. You need to only include `private_key` and `client_email` key from JSON key file. ```apache <match dummy> - @type bigquery + @type bigquery_insert auth_method json_key json_key {"private_key": "-----BEGIN PRIVATE KEY-----\n...", "client_email": "xxx@developer.gserviceaccount.com"} project yourproject_id @@ -338,11 +339,11 @@ In this authentication method, you need to add the API scope "https://www.googleapis.com/auth/bigquery" to the scope list of your Compute Engine instance, then you can configure fluentd like this. ```apache <match dummy> - @type bigquery + @type bigquery_insert auth_method compute_engine project yourproject_id dataset yourdataset_id @@ -380,11 +381,11 @@ For example, with the configuration below, data is inserted into tables `accesslog_2014_08`, `accesslog_2014_09` and so on. ```apache <match dummy> - @type bigquery + @type bigquery_insert ... project yourproject_id dataset yourdataset_id @@ -428,11 +429,11 @@ Use placeholder. ```apache <match dummy> - @type bigquery + @type bigquery_insert ... table accesslog$%Y%m%d <buffer time> @@ -451,11 +452,11 @@ NOTE: `auto_create_table` option cannot be used with `fetch_schema`. You should create the table on ahead to use `fetch_schema`. ```apache <match dummy> - @type bigquery + @type bigquery_insert ... auto_create_table true table accesslog_%Y_%m @@ -475,11 +476,11 @@ The examples above use the first method. In this method, you can also specify nested fields by prefixing their belonging record fields. ```apache <match dummy> - @type bigquery + @type bigquery_insert ... schema [ {"name": "time", "type": "INTEGER"}, @@ -526,11 +527,11 @@ The second method is to specify a path to a BigQuery schema file instead of listing fields. In this case, your fluent.conf looks like: ```apache <match dummy> - @type bigquery + @type bigquery_insert ... schema_path /path/to/httpd.schema </match> @@ -539,11 +540,11 @@ The third method is to set `fetch_schema` to `true` to enable fetch a schema using BigQuery API. In this case, your fluent.conf looks like: ```apache <match dummy> - @type bigquery + @type bigquery_insert ... fetch_schema true # fetch_schema_table other_table # if you want to fetch schema from other table @@ -557,13 +558,15 @@ ### Specifying insertId property BigQuery uses `insertId` property to detect duplicate insertion requests (see [data consistency](https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency) in Google BigQuery documents). You can set `insert_id_field` option to specify the field to use as `insertId` property. +`insert_id_field` can use fluentd record_accessor format like `$['key1'][0]['key2']`. +(detail. https://docs.fluentd.org/v1.0/articles/api-plugin-helper-record_accessor) ```apache <match dummy> - @type bigquery + @type bigquery_insert ... insert_id_field uuid schema [{"name": "uuid", "type": "STRING"}]