README.md in fluent-plugin-bigquery-0.3.4 vs README.md in fluent-plugin-bigquery-0.4.0

- old
+ new

@@ -19,51 +19,52 @@ ## Configuration ### Options -| name | type | required? | default | description | -| :------------------------------------- | :------------ | :----------- | :------------------------- | :----------------------- | -| method | string | no | insert | `insert` (Streaming Insert) or `load` (load job) | -| buffer_type | string | no | lightening (insert) or file (load) | | -| buffer_chunk_limit | integer | no | 1MB (insert) or 1GB (load) | | -| buffer_queue_limit | integer | no | 1024 (insert) or 32 (load) | | -| buffer_chunk_records_limit | integer | no | 500 | | -| flush_interval | float | no | 0.25 (*insert) or default of time sliced output (load) | | -| try_flush_interval | float | no | 0.05 (*insert) or default of time sliced output (load) | | -| auth_method | enum | yes | private_key | `private_key` or `json_key` or `compute_engine` or `application_default` | -| email | string | yes (private_key) | nil | GCP Service Account Email | -| private_key_path | string | yes (private_key) | nil | GCP Private Key file path | -| private_key_passphrase | string | yes (private_key) | nil | GCP Private Key Passphrase | -| json_key | string | yes (json_key) | nil | GCP JSON Key file path or JSON Key string | -| project | string | yes | nil | | -| table | string | yes (either `tables`) | nil | | -| tables | string | yes (either `table`) | nil | can set multi table names splitted by `,` | -| template_suffix | string | no | nil | can use `%{time_slice}` placeholder replaced by `time_slice_format` | -| auto_create_table | bool | no | false | If true, creates table automatically | -| skip_invalid_rows | bool | no | false | Only `insert` method. | -| max_bad_records | integer | no | 0 | Only `load` method. If the number of bad records exceeds this value, an invalid error is returned in the job result. | -| ignore_unknown_values | bool | no | false | Accept rows that contain values that do not match the schema. The unknown values are ignored. | -| schema_path | string | yes (either `fetch_schema`) | nil | Schema Definition file path. It is formatted by JSON. | -| fetch_schema | bool | yes (either `schema_path`) | false | If true, fetch table schema definition from Bigquery table automatically. | -| fetch_schema_table | string | no | nil | If set, fetch table schema definition from this table, If fetch_schema is false, this param is ignored | -| schema_cache_expire | integer | no | 600 | Value is second. If current time is after expiration interval, re-fetch table schema definition. | -| field_string | string | no | nil | see examples. | -| field_integer | string | no | nil | see examples. | -| field_float | string | no | nil | see examples. | -| field_boolean | string | no | nil | see examples. | -| field_timestamp | string | no | nil | see examples. | -| time_field | string | no | nil | If this param is set, plugin set formatted time string to this field. | -| time_format | string | no | nil | ex. `%s`, `%Y/%m%d %H:%M:%S` | -| replace_record_key | bool | no | false | see examples. | -| replace_record_key_regexp{1-10} | string | no | nil | see examples. | -| convert_hash_to_json | bool | no | false | If true, converts Hash value of record to JSON String. | -| insert_id_field | string | no | nil | Use key as `insert_id` of Streaming Insert API parameter. | -| request_timeout_sec | integer | no | nil | Bigquery API response timeout | -| request_open_timeout_sec | integer | no | 60 | Bigquery API connection, and request timeout. If you send big data to Bigquery, set large value. | -| time_partitioning_type | enum | no (either day) | nil | Type of bigquery time partitioning feature(experimental feature on BigQuery). | -| time_partitioning_expiration | time | no | nil | Expiration milliseconds for bigquery time partitioning. (experimental feature on BigQuery) | +| name | type | required? | default | description | +| :------------------------------------- | :------------ | :----------- | :------------------------- | :----------------------- | +| method | string | no | insert | `insert` (Streaming Insert) or `load` (load job) | +| buffer_type | string | no | lightening (insert) or file (load) | | +| buffer_chunk_limit | integer | no | 1MB (insert) or 1GB (load) | | +| buffer_queue_limit | integer | no | 1024 (insert) or 32 (load) | | +| buffer_chunk_records_limit | integer | no | 500 | | +| flush_interval | float | no | 0.25 (*insert) or default of time sliced output (load) | | +| try_flush_interval | float | no | 0.05 (*insert) or default of time sliced output (load) | | +| auth_method | enum | yes | private_key | `private_key` or `json_key` or `compute_engine` or `application_default` | +| email | string | yes (private_key) | nil | GCP Service Account Email | +| private_key_path | string | yes (private_key) | nil | GCP Private Key file path | +| private_key_passphrase | string | yes (private_key) | nil | GCP Private Key Passphrase | +| json_key | string | yes (json_key) | nil | GCP JSON Key file path or JSON Key string | +| project | string | yes | nil | | +| table | string | yes (either `tables`) | nil | | +| tables | string | yes (either `table`) | nil | can set multi table names splitted by `,` | +| template_suffix | string | no | nil | can use `%{time_slice}` placeholder replaced by `time_slice_format` | +| auto_create_table | bool | no | false | If true, creates table automatically | +| skip_invalid_rows | bool | no | false | Only `insert` method. | +| max_bad_records | integer | no | 0 | Only `load` method. If the number of bad records exceeds this value, an invalid error is returned in the job result. | +| ignore_unknown_values | bool | no | false | Accept rows that contain values that do not match the schema. The unknown values are ignored. | +| schema | array | yes (either `fetch_schema` or `schema_path`) | nil | Schema Definition. It is formatted by JSON. | +| schema_path | string | yes (either `fetch_schema`) | nil | Schema Definition file path. It is formatted by JSON. | +| fetch_schema | bool | yes (either `schema_path`) | false | If true, fetch table schema definition from Bigquery table automatically. | +| fetch_schema_table | string | no | nil | If set, fetch table schema definition from this table, If fetch_schema is false, this param is ignored | +| schema_cache_expire | integer | no | 600 | Value is second. If current time is after expiration interval, re-fetch table schema definition. | +| field_string (deprecated) | string | no | nil | see examples. | +| field_integer (deprecated) | string | no | nil | see examples. | +| field_float (deprecated) | string | no | nil | see examples. | +| field_boolean (deprecated) | string | no | nil | see examples. | +| field_timestamp (deprecated) | string | no | nil | see examples. | +| time_field | string | no | nil | If this param is set, plugin set formatted time string to this field. | +| time_format | string | no | nil | ex. `%s`, `%Y/%m%d %H:%M:%S` | +| replace_record_key | bool | no | false | see examples. | +| replace_record_key_regexp{1-10} | string | no | nil | see examples. | +| convert_hash_to_json (deprecated) | bool | no | false | If true, converts Hash value of record to JSON String. | +| insert_id_field | string | no | nil | Use key as `insert_id` of Streaming Insert API parameter. | +| request_timeout_sec | integer | no | nil | Bigquery API response timeout | +| request_open_timeout_sec | integer | no | 60 | Bigquery API connection, and request timeout. If you send big data to Bigquery, set large value. | +| time_partitioning_type | enum | no (either day) | nil | Type of bigquery time partitioning feature(experimental feature on BigQuery). | +| time_partitioning_expiration | time | no | nil | Expiration milliseconds for bigquery time partitioning. (experimental feature on BigQuery) | ### Standard Options | name | type | required? | default | description | | :------------------------------------- | :------------ | :----------- | :------------------------- | :----------------------- | @@ -94,14 +95,29 @@ table tablename time_format %s time_field time - field_integer time,status,bytes - field_string rhost,vhost,path,method,protocol,agent,referer - field_float requesttime - field_boolean bot_access,loginsession + schema [ + {"name": "time", "type": "INTEGER"}, + {"name": "status", "type": "INTEGER"}, + {"name": "bytes", "type": "INTEGER"}, + {"name": "vhost", "type": "STRING"}, + {"name": "path", "type": "STRING"}, + {"name": "method", "type": "STRING"}, + {"name": "protocol", "type": "STRING"}, + {"name": "agent", "type": "STRING"}, + {"name": "referer", "type": "STRING"}, + {"name": "remote", "type": "RECORD", "fields": [ + {"name": "host", "type": "STRING"}, + {"name": "ip", "type": "STRING"}, + {"name": "user", "type": "STRING"} + ]}, + {"name": "requesttime", "type": "FLOAT"}, + {"name": "bot_access", "type": "BOOLEAN"}, + {"name": "loginsession", "type": "BOOLEAN"} + ] </match> ``` For high rate inserts over streaming inserts, you should specify flush intervals and buffer chunk options: @@ -128,14 +144,29 @@ tables accesslog1,accesslog2,accesslog3 time_format %s time_field time - field_integer time,status,bytes - field_string rhost,vhost,path,method,protocol,agent,referer - field_float requesttime - field_boolean bot_access,loginsession + schema [ + {"name": "time", "type": "INTEGER"}, + {"name": "status", "type": "INTEGER"}, + {"name": "bytes", "type": "INTEGER"}, + {"name": "vhost", "type": "STRING"}, + {"name": "path", "type": "STRING"}, + {"name": "method", "type": "STRING"}, + {"name": "protocol", "type": "STRING"}, + {"name": "agent", "type": "STRING"}, + {"name": "referer", "type": "STRING"}, + {"name": "remote", "type": "RECORD", "fields": [ + {"name": "host", "type": "STRING"}, + {"name": "ip", "type": "STRING"}, + {"name": "user", "type": "STRING"} + ]}, + {"name": "requesttime", "type": "FLOAT"}, + {"name": "bot_access", "type": "BOOLEAN"}, + {"name": "loginsession", "type": "BOOLEAN"} + ] </match> ``` Important options for high rate events are: @@ -264,15 +295,11 @@ dataset yourdataset_id table tablename time_format %s time_field time - - field_integer time,status,bytes - field_string rhost,vhost,path,method,protocol,agent,referer - field_float requesttime - field_boolean bot_access,loginsession + ... </match> ``` #### Application default credentials @@ -417,14 +444,29 @@ ... time_format %s time_field time - field_integer time,response.status,response.bytes - field_string request.vhost,request.path,request.method,request.protocol,request.agent,request.referer,remote.host,remote.ip,remote.user - field_float request.time - field_boolean request.bot_access,request.loginsession + schema [ + {"name": "time", "type": "INTEGER"}, + {"name": "status", "type": "INTEGER"}, + {"name": "bytes", "type": "INTEGER"}, + {"name": "vhost", "type": "STRING"}, + {"name": "path", "type": "STRING"}, + {"name": "method", "type": "STRING"}, + {"name": "protocol", "type": "STRING"}, + {"name": "agent", "type": "STRING"}, + {"name": "referer", "type": "STRING"}, + {"name": "remote", "type": "RECORD", "fields": [ + {"name": "host", "type": "STRING"}, + {"name": "ip", "type": "STRING"}, + {"name": "user", "type": "STRING"} + ]}, + {"name": "requesttime", "type": "FLOAT"}, + {"name": "bot_access", "type": "BOOLEAN"}, + {"name": "loginsession", "type": "BOOLEAN"} + ] </match> ``` This schema accepts structured JSON data like: @@ -457,14 +499,13 @@ time_format %s time_field time schema_path /path/to/httpd.schema - field_integer time </match> ``` -where /path/to/httpd.schema is a path to the JSON-encoded schema file which you used for creating the table on BigQuery. +where /path/to/httpd.schema is a path to the JSON-encoded schema file which you used for creating the table on BigQuery. By using external schema file you are able to write full schema that does support NULLABLE/REQUIRED/REPEATED, this feature is really useful and adds full flexbility. The third method is to set `fetch_schema` to `true` to enable fetch a schema using BigQuery API. In this case, your fluent.conf looks like: ```apache <match dummy> @@ -475,11 +516,10 @@ time_format %s time_field time fetch_schema true # fetch_schema_table other_table # if you want to fetch schema from other table - field_integer time </match> ``` If you specify multiple tables in configuration file, plugin get all schema data from BigQuery and merge it. @@ -496,20 +536,17 @@ @type bigquery ... insert_id_field uuid - field_string uuid + schema [{"name": "uuid", "type": "STRING"}] </match> ``` ## TODO -* support optional data fields -* support NULLABLE/REQUIRED/REPEATED field options in field list style of configuration * OAuth installed application credentials support * Google API discovery expiration -* Error classes * check row size limits ## Authors * @tagomoris: First author, original version