README.md in fluent-plugin-bigquery-0.4.4 vs README.md in fluent-plugin-bigquery-0.5.0.beta1

- old
+ new

@@ -26,63 +26,91 @@ ## Configuration ### Options -| name | type | required? | default | description | -| :------------------------------------- | :------------ | :----------- | :------------------------- | :----------------------- | -| method | string | no | insert | `insert` (Streaming Insert) or `load` (load job) | -| buffer_type | string | no | lightening (insert) or file (load) | | -| buffer_chunk_limit | integer | no | 1MB (insert) or 1GB (load) | | -| buffer_queue_limit | integer | no | 1024 (insert) or 32 (load) | | -| buffer_chunk_records_limit | integer | no | 500 | | -| flush_interval | float | no | 0.25 (*insert) or default of time sliced output (load) | | -| try_flush_interval | float | no | 0.05 (*insert) or default of time sliced output (load) | | -| auth_method | enum | yes | private_key | `private_key` or `json_key` or `compute_engine` or `application_default` | -| email | string | yes (private_key) | nil | GCP Service Account Email | -| private_key_path | string | yes (private_key) | nil | GCP Private Key file path | -| private_key_passphrase | string | yes (private_key) | nil | GCP Private Key Passphrase | -| json_key | string | yes (json_key) | nil | GCP JSON Key file path or JSON Key string | -| project | string | yes | nil | | -| table | string | yes (either `tables`) | nil | | -| tables | string | yes (either `table`) | nil | can set multi table names splitted by `,` | -| template_suffix | string | no | nil | can use `%{time_slice}` placeholder replaced by `time_slice_format` | -| auto_create_table | bool | no | false | If true, creates table automatically | -| skip_invalid_rows | bool | no | false | Only `insert` method. | -| max_bad_records | integer | no | 0 | Only `load` method. If the number of bad records exceeds this value, an invalid error is returned in the job result. | -| ignore_unknown_values | bool | no | false | Accept rows that contain values that do not match the schema. The unknown values are ignored. | -| schema | array | yes (either `fetch_schema` or `schema_path`) | nil | Schema Definition. It is formatted by JSON. | -| schema_path | string | yes (either `fetch_schema`) | nil | Schema Definition file path. It is formatted by JSON. | -| fetch_schema | bool | yes (either `schema_path`) | false | If true, fetch table schema definition from Bigquery table automatically. | -| fetch_schema_table | string | no | nil | If set, fetch table schema definition from this table, If fetch_schema is false, this param is ignored | -| schema_cache_expire | integer | no | 600 | Value is second. If current time is after expiration interval, re-fetch table schema definition. | -| field_string (deprecated) | string | no | nil | see examples. | -| field_integer (deprecated) | string | no | nil | see examples. | -| field_float (deprecated) | string | no | nil | see examples. | -| field_boolean (deprecated) | string | no | nil | see examples. | -| field_timestamp (deprecated) | string | no | nil | see examples. | -| time_field | string | no | nil | If this param is set, plugin set formatted time string to this field. | -| time_format | string | no | nil | ex. `%s`, `%Y/%m%d %H:%M:%S` | -| replace_record_key | bool | no | false | see examples. | -| replace_record_key_regexp{1-10} | string | no | nil | see examples. | -| convert_hash_to_json (deprecated) | bool | no | false | If true, converts Hash value of record to JSON String. | -| insert_id_field | string | no | nil | Use key as `insert_id` of Streaming Insert API parameter. | -| add_insert_timestamp | string | no | no | nil | Adds a timestamp column just before sending the rows to BigQuery, so that buffering time is not taken into account. Gives a field in BigQuery which represents the insert time of the row. | -| allow_retry_insert_errors | bool | no | false | Retry to insert rows when an insertErrors occurs. There is a possibility that rows are inserted in duplicate. | -| request_timeout_sec | integer | no | nil | Bigquery API response timeout | -| request_open_timeout_sec | integer | no | 60 | Bigquery API connection, and request timeout. If you send big data to Bigquery, set large value. | -| time_partitioning_type | enum | no (either day) | nil | Type of bigquery time partitioning feature(experimental feature on BigQuery). | -| time_partitioning_expiration | time | no | nil | Expiration milliseconds for bigquery time partitioning. (experimental feature on BigQuery) | +| name | type | required? | placeholder? | default | description | +| :------------------------------------- | :------------ | :----------- | :---------- | :------------------------- | :----------------------- | +| method | string | no | no | insert | `insert` (Streaming Insert) or `load` (load job) | +| auth_method | enum | yes | no | private_key | `private_key` or `json_key` or `compute_engine` or `application_default` | +| email | string | yes (private_key) | no | nil | GCP Service Account Email | +| private_key_path | string | yes (private_key) | no | nil | GCP Private Key file path | +| private_key_passphrase | string | yes (private_key) | no | nil | GCP Private Key Passphrase | +| json_key | string | yes (json_key) | no | nil | GCP JSON Key file path or JSON Key string | +| project | string | yes | yes | nil | | +| dataset | string | yes | yes | nil | | +| table | string | yes (either `tables`) | yes | nil | | +| tables | array(string) | yes (either `table`) | yes | nil | can set multi table names splitted by `,` | +| template_suffix | string | no | yes | nil | can use `%{time_slice}` placeholder replaced by `time_slice_format` | +| auto_create_table | bool | no | no | false | If true, creates table automatically | +| skip_invalid_rows | bool | no | no | false | Only `insert` method. | +| max_bad_records | integer | no | no | 0 | Only `load` method. If the number of bad records exceeds this value, an invalid error is returned in the job result. | +| ignore_unknown_values | bool | no | no | false | Accept rows that contain values that do not match the schema. The unknown values are ignored. | +| schema | array | yes (either `fetch_schema` or `schema_path`) | no | nil | Schema Definition. It is formatted by JSON. | +| schema_path | string | yes (either `fetch_schema`) | no | nil | Schema Definition file path. It is formatted by JSON. | +| fetch_schema | bool | yes (either `schema_path`) | no | false | If true, fetch table schema definition from Bigquery table automatically. | +| fetch_schema_table | string | no | yes | nil | If set, fetch table schema definition from this table, If fetch_schema is false, this param is ignored | +| schema_cache_expire | integer | no | no | 600 | Value is second. If current time is after expiration interval, re-fetch table schema definition. | +| field_string | string | no | no | nil | see examples. | +| field_integer | string | no | no | nil | see examples. | +| field_float | string | no | no | nil | see examples. | +| field_boolean | string | no | no | nil | see examples. | +| field_timestamp | string | no | no | nil | see examples. | +| replace_record_key | bool | no | no | false | see examples. | +| replace_record_key_regexp{1-10} | string | no | no | nil | see examples. | +| convert_hash_to_json | bool | no | no | false | If true, converts Hash value of record to JSON String. | +| insert_id_field | string | no | no | nil | Use key as `insert_id` of Streaming Insert API parameter. | +| allow_retry_insert_errors | bool | no | false | Retry to insert rows when an insertErrors occurs. There is a possibility that rows are inserted in duplicate. | +| request_timeout_sec | integer | no | no | nil | Bigquery API response timeout | +| request_open_timeout_sec | integer | no | no | 60 | Bigquery API connection, and request timeout. If you send big data to Bigquery, set large value. | +| time_partitioning_type | enum | no (either day) | no | nil | Type of bigquery time partitioning feature(experimental feature on BigQuery). | +| time_partitioning_expiration | time | no | no | nil | Expiration milliseconds for bigquery time partitioning. (experimental feature on BigQuery) | -### Standard Options +### Buffer section +| name | type | required? | default | description | +| :------------------------------------- | :------------ | :----------- | :------------------------- | :----------------------- | +| @type | string | no | memory (insert) or file (load) | | +| chunk_limit_size | integer | no | 1MB (insert) or 1GB (load) | | +| total_limit_size | integer | no | 1GB (insert) or 32GB (load) | | +| chunk_records_limit | integer | no | 500 (insert) or nil (load) | | +| flush_mode | enum | no | interval | default, lazy, interval, immediate | +| flush_interval | float | no | 0.25 (insert) or nil (load) | | +| flush_thread_interval | float | no | 0.05 (insert) or nil (load) | | +| flush_thread_burst_interval | float | no | 0.05 (insert) or nil (load) | | + +And, other params (defined by base class) are available + +see. https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/output.rb + +### Inject section + +It is replacement of previous version `time_field` and `time_format`. + +For example. + +``` +<inject> + time_key time_field_name + time_type string + time_format %Y-%m-%d %H:%M:%S +</inject> +``` + | name | type | required? | default | description | | :------------------------------------- | :------------ | :----------- | :------------------------- | :----------------------- | -| localtime | bool | no | nil | Use localtime | -| utc | bool | no | nil | Use utc | +| hostname_key | string | no | nil | | +| hostname | string | no | nil | | +| tag_key | string | no | nil | | +| time_key | string | no | nil | | +| time_type | string | no | nil | | +| time_format | string | no | nil | | +| localtime | bool | no | true | | +| utc | bool | no | false | | +| timezone | string | no | nil | | -And see http://docs.fluentd.org/articles/output-plugin-overview#time-sliced-output-parameters +see. https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin_helper/inject.rb ## Examples ### Streaming inserts @@ -101,13 +129,10 @@ project yourproject_id dataset yourdataset_id table tablename - time_format %s - time_field time - schema [ {"name": "time", "type": "INTEGER"}, {"name": "status", "type": "INTEGER"}, {"name": "bytes", "type": "INTEGER"}, {"name": "vhost", "type": "STRING"}, @@ -133,30 +158,28 @@ ```apache <match dummy> @type bigquery method insert # default - - flush_interval 1 # flush as frequent as possible - - buffer_chunk_records_limit 300 # default rate limit for users is 100 - buffer_queue_limit 10240 # 1MB * 10240 -> 10GB! - - num_threads 16 - + + <buffer> + flush_interval 0.1 # flush as frequent as possible + + buffer_queue_limit 10240 # 1MB * 10240 -> 10GB! + + flush_thread_count 16 + </buffer> + auth_method private_key # default email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12 # private_key_passphrase notasecret # default project yourproject_id dataset yourdataset_id tables accesslog1,accesslog2,accesslog3 - time_format %s - time_field time - schema [ {"name": "time", "type": "INTEGER"}, {"name": "status", "type": "INTEGER"}, {"name": "bytes", "type": "INTEGER"}, {"name": "vhost", "type": "STRING"}, @@ -181,27 +204,27 @@ * `tables` * 2 or more tables are available with ',' separator * `out_bigquery` uses these tables for Table Sharding inserts * these must have same schema - * `buffer_chunk_limit` + * `buffer/chunk_limit_size` * max size of an insert or chunk (default 1000000 or 1MB) * the max size is limited to 1MB on BigQuery - * `buffer_chunk_records_limit` + * `buffer/chunk_records_limit` * number of records over streaming inserts API call is limited as 500, per insert or chunk * `out_bigquery` flushes buffer with 500 records for 1 inserts API call - * `buffer_queue_limit` + * `buffer/queue_length_limit` * BigQuery streaming inserts needs very small buffer chunks * for high-rate events, `buffer_queue_limit` should be configured with big number * Max 1GB memory may be used under network problem in default configuration - * `buffer_chunk_limit (default 1MB)` x `buffer_queue_limit (default 1024)` - * `num_threads` + * `chunk_limit_size (default 1MB)` x `queue_length_limit (default 1024)` + * `buffer/flush_thread_count` * threads for insert api calls in parallel * specify this option for 100 or more records per seconds * 10 or more threads seems good for inserts over internet * less threads may be good for Google Compute Engine instances (with low latency for BigQuery) - * `flush_interval` + * `buffer/flush_interval` * interval between data flushes (default 0.25) * you can set subsecond values such as `0.15` on Fluentd v0.10.42 or later See [Quota policy](https://cloud.google.com/bigquery/streaming-data-into-bigquery#quota) section in the Google BigQuery document. @@ -210,35 +233,32 @@ ```apache <match bigquery> @type bigquery method load - buffer_type file - buffer_path bigquery.*.buffer + + <buffer> + @type file + path bigquery.*.buffer flush_interval 1800 flush_at_shutdown true - try_flush_interval 1 - utc + timekey_use_utc + </buffer> auth_method json_key json_key json_key_path.json - time_format %s - time_field time - project yourproject_id dataset yourdataset_id auto_create_table true table yourtable%{time_slice} schema_path bq_schema.json </match> ``` I recommend to use file buffer and long flush interval. -__CAUTION: `flush_interval` default is still `0.25` even if `method` is `load` on current version.__ - ### Authentication There are four methods supported to fetch access token for the service account. 1. Public-Private key pair of GCP(Google Cloud Platform)'s service account @@ -302,12 +322,10 @@ project yourproject_id dataset yourdataset_id table tablename - time_format %s - time_field time ... </match> ``` #### Application default credentials @@ -323,16 +341,20 @@ 5. If you are running in Google Compute Engine production, the built-in service account associated with the virtual machine instance will be used. 6. If none of these conditions is true, an error will occur. ### Table id formatting +this plugin supports fluentd-0.14 style placeholder. + #### strftime formatting `table` and `tables` options accept [Time#strftime](http://ruby-doc.org/core-1.9.3/Time.html#method-i-strftime) format to construct table ids. Table ids are formatted at runtime -using the local time of the fluentd server. +using the chunk key time. +see. http://docs.fluentd.org/v0.14/articles/output-plugin-overview + For example, with the configuration below, data is inserted into tables `accesslog_2014_08`, `accesslog_2014_09` and so on. ```apache <match dummy> @@ -342,75 +364,58 @@ project yourproject_id dataset yourdataset_id table accesslog_%Y_%m + <buffer time> + timekey 1d + </buffer> ... </match> ``` #### record attribute formatting The format can be suffixed with attribute name. -__NOTE: This feature is available only if `method` is `insert`. Because it makes performance impact. Use `%{time_slice}` instead of it.__ +__CAUTION: format is different with previous version__ ```apache <match dummy> ... - table accesslog_%Y_%m@timestamp + table accesslog_${status_code} + + <buffer status_code> + </buffer> ... </match> ``` If attribute name is given, the time to be used for formatting is value of each row. The value for the time should be a UNIX time. #### time_slice_key formatting -Or, the options can use `%{time_slice}` placeholder. -`%{time_slice}` is replaced by formatted time slice key at runtime. -```apache -<match dummy> - @type bigquery +Instead, Use strftime formatting. - ... - table accesslog%{time_slice} - ... -</match> -``` +strftime formatting of current version is based on chunk key. +That is same with previous time_slice_key formatting . -#### record attribute value formatting -Or, `${attr_name}` placeholder is available to use value of attribute as part of table id. -`${attr_name}` is replaced by string value of the attribute specified by `attr_name`. - -__NOTE: This feature is available only if `method` is `insert`.__ - -```apache -<match dummy> - ... - table accesslog_%Y_%m_${subdomain} - ... -</match> -``` - -For example value of `subdomain` attribute is `"bq.fluent"`, table id will be like "accesslog_2016_03_bqfluent". - -- any type of attribute is allowed because stringified value will be used as replacement. -- acceptable characters are alphabets, digits and `_`. All other characters will be removed. - ### Date partitioned table support this plugin can insert (load) into date partitioned table. -Use `%{time_slice}`. +Use placeholder. ```apache <match dummy> @type bigquery ... - time_slice_format %Y%m%d - table accesslog$%{time_slice} + table accesslog$%Y%m%d + + <buffer time> + timekey 1d + </buffer> ... </match> ``` But, Dynamic table creating doesn't support date partitioned table yet. @@ -450,13 +455,10 @@ <match dummy> @type bigquery ... - time_format %s - time_field time - schema [ {"name": "time", "type": "INTEGER"}, {"name": "status", "type": "INTEGER"}, {"name": "bytes", "type": "INTEGER"}, {"name": "vhost", "type": "STRING"}, @@ -503,14 +505,11 @@ ```apache <match dummy> @type bigquery ... - - time_format %s - time_field time - + schema_path /path/to/httpd.schema </match> ``` where /path/to/httpd.schema is a path to the JSON-encoded schema file which you used for creating the table on BigQuery. By using external schema file you are able to write full schema that does support NULLABLE/REQUIRED/REPEATED, this feature is really useful and adds full flexbility. @@ -519,13 +518,10 @@ ```apache <match dummy> @type bigquery ... - - time_format %s - time_field time - + fetch_schema true # fetch_schema_table other_table # if you want to fetch schema from other table </match> ```