README.md in fluent-plugin-bigquery-0.4.4 vs README.md in fluent-plugin-bigquery-0.5.0.beta1
- old
+ new
@@ -26,63 +26,91 @@
## Configuration
### Options
-| name | type | required? | default | description |
-| :------------------------------------- | :------------ | :----------- | :------------------------- | :----------------------- |
-| method | string | no | insert | `insert` (Streaming Insert) or `load` (load job) |
-| buffer_type | string | no | lightening (insert) or file (load) | |
-| buffer_chunk_limit | integer | no | 1MB (insert) or 1GB (load) | |
-| buffer_queue_limit | integer | no | 1024 (insert) or 32 (load) | |
-| buffer_chunk_records_limit | integer | no | 500 | |
-| flush_interval | float | no | 0.25 (*insert) or default of time sliced output (load) | |
-| try_flush_interval | float | no | 0.05 (*insert) or default of time sliced output (load) | |
-| auth_method | enum | yes | private_key | `private_key` or `json_key` or `compute_engine` or `application_default` |
-| email | string | yes (private_key) | nil | GCP Service Account Email |
-| private_key_path | string | yes (private_key) | nil | GCP Private Key file path |
-| private_key_passphrase | string | yes (private_key) | nil | GCP Private Key Passphrase |
-| json_key | string | yes (json_key) | nil | GCP JSON Key file path or JSON Key string |
-| project | string | yes | nil | |
-| table | string | yes (either `tables`) | nil | |
-| tables | string | yes (either `table`) | nil | can set multi table names splitted by `,` |
-| template_suffix | string | no | nil | can use `%{time_slice}` placeholder replaced by `time_slice_format` |
-| auto_create_table | bool | no | false | If true, creates table automatically |
-| skip_invalid_rows | bool | no | false | Only `insert` method. |
-| max_bad_records | integer | no | 0 | Only `load` method. If the number of bad records exceeds this value, an invalid error is returned in the job result. |
-| ignore_unknown_values | bool | no | false | Accept rows that contain values that do not match the schema. The unknown values are ignored. |
-| schema | array | yes (either `fetch_schema` or `schema_path`) | nil | Schema Definition. It is formatted by JSON. |
-| schema_path | string | yes (either `fetch_schema`) | nil | Schema Definition file path. It is formatted by JSON. |
-| fetch_schema | bool | yes (either `schema_path`) | false | If true, fetch table schema definition from Bigquery table automatically. |
-| fetch_schema_table | string | no | nil | If set, fetch table schema definition from this table, If fetch_schema is false, this param is ignored |
-| schema_cache_expire | integer | no | 600 | Value is second. If current time is after expiration interval, re-fetch table schema definition. |
-| field_string (deprecated) | string | no | nil | see examples. |
-| field_integer (deprecated) | string | no | nil | see examples. |
-| field_float (deprecated) | string | no | nil | see examples. |
-| field_boolean (deprecated) | string | no | nil | see examples. |
-| field_timestamp (deprecated) | string | no | nil | see examples. |
-| time_field | string | no | nil | If this param is set, plugin set formatted time string to this field. |
-| time_format | string | no | nil | ex. `%s`, `%Y/%m%d %H:%M:%S` |
-| replace_record_key | bool | no | false | see examples. |
-| replace_record_key_regexp{1-10} | string | no | nil | see examples. |
-| convert_hash_to_json (deprecated) | bool | no | false | If true, converts Hash value of record to JSON String. |
-| insert_id_field | string | no | nil | Use key as `insert_id` of Streaming Insert API parameter. |
-| add_insert_timestamp | string | no | no | nil | Adds a timestamp column just before sending the rows to BigQuery, so that buffering time is not taken into account. Gives a field in BigQuery which represents the insert time of the row. |
-| allow_retry_insert_errors | bool | no | false | Retry to insert rows when an insertErrors occurs. There is a possibility that rows are inserted in duplicate. |
-| request_timeout_sec | integer | no | nil | Bigquery API response timeout |
-| request_open_timeout_sec | integer | no | 60 | Bigquery API connection, and request timeout. If you send big data to Bigquery, set large value. |
-| time_partitioning_type | enum | no (either day) | nil | Type of bigquery time partitioning feature(experimental feature on BigQuery). |
-| time_partitioning_expiration | time | no | nil | Expiration milliseconds for bigquery time partitioning. (experimental feature on BigQuery) |
+| name | type | required? | placeholder? | default | description |
+| :------------------------------------- | :------------ | :----------- | :---------- | :------------------------- | :----------------------- |
+| method | string | no | no | insert | `insert` (Streaming Insert) or `load` (load job) |
+| auth_method | enum | yes | no | private_key | `private_key` or `json_key` or `compute_engine` or `application_default` |
+| email | string | yes (private_key) | no | nil | GCP Service Account Email |
+| private_key_path | string | yes (private_key) | no | nil | GCP Private Key file path |
+| private_key_passphrase | string | yes (private_key) | no | nil | GCP Private Key Passphrase |
+| json_key | string | yes (json_key) | no | nil | GCP JSON Key file path or JSON Key string |
+| project | string | yes | yes | nil | |
+| dataset | string | yes | yes | nil | |
+| table | string | yes (either `tables`) | yes | nil | |
+| tables | array(string) | yes (either `table`) | yes | nil | can set multi table names splitted by `,` |
+| template_suffix | string | no | yes | nil | can use `%{time_slice}` placeholder replaced by `time_slice_format` |
+| auto_create_table | bool | no | no | false | If true, creates table automatically |
+| skip_invalid_rows | bool | no | no | false | Only `insert` method. |
+| max_bad_records | integer | no | no | 0 | Only `load` method. If the number of bad records exceeds this value, an invalid error is returned in the job result. |
+| ignore_unknown_values | bool | no | no | false | Accept rows that contain values that do not match the schema. The unknown values are ignored. |
+| schema | array | yes (either `fetch_schema` or `schema_path`) | no | nil | Schema Definition. It is formatted by JSON. |
+| schema_path | string | yes (either `fetch_schema`) | no | nil | Schema Definition file path. It is formatted by JSON. |
+| fetch_schema | bool | yes (either `schema_path`) | no | false | If true, fetch table schema definition from Bigquery table automatically. |
+| fetch_schema_table | string | no | yes | nil | If set, fetch table schema definition from this table, If fetch_schema is false, this param is ignored |
+| schema_cache_expire | integer | no | no | 600 | Value is second. If current time is after expiration interval, re-fetch table schema definition. |
+| field_string | string | no | no | nil | see examples. |
+| field_integer | string | no | no | nil | see examples. |
+| field_float | string | no | no | nil | see examples. |
+| field_boolean | string | no | no | nil | see examples. |
+| field_timestamp | string | no | no | nil | see examples. |
+| replace_record_key | bool | no | no | false | see examples. |
+| replace_record_key_regexp{1-10} | string | no | no | nil | see examples. |
+| convert_hash_to_json | bool | no | no | false | If true, converts Hash value of record to JSON String. |
+| insert_id_field | string | no | no | nil | Use key as `insert_id` of Streaming Insert API parameter. |
+| allow_retry_insert_errors | bool | no | false | Retry to insert rows when an insertErrors occurs. There is a possibility that rows are inserted in duplicate. |
+| request_timeout_sec | integer | no | no | nil | Bigquery API response timeout |
+| request_open_timeout_sec | integer | no | no | 60 | Bigquery API connection, and request timeout. If you send big data to Bigquery, set large value. |
+| time_partitioning_type | enum | no (either day) | no | nil | Type of bigquery time partitioning feature(experimental feature on BigQuery). |
+| time_partitioning_expiration | time | no | no | nil | Expiration milliseconds for bigquery time partitioning. (experimental feature on BigQuery) |
-### Standard Options
+### Buffer section
+| name | type | required? | default | description |
+| :------------------------------------- | :------------ | :----------- | :------------------------- | :----------------------- |
+| @type | string | no | memory (insert) or file (load) | |
+| chunk_limit_size | integer | no | 1MB (insert) or 1GB (load) | |
+| total_limit_size | integer | no | 1GB (insert) or 32GB (load) | |
+| chunk_records_limit | integer | no | 500 (insert) or nil (load) | |
+| flush_mode | enum | no | interval | default, lazy, interval, immediate |
+| flush_interval | float | no | 0.25 (insert) or nil (load) | |
+| flush_thread_interval | float | no | 0.05 (insert) or nil (load) | |
+| flush_thread_burst_interval | float | no | 0.05 (insert) or nil (load) | |
+
+And, other params (defined by base class) are available
+
+see. https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/output.rb
+
+### Inject section
+
+It is replacement of previous version `time_field` and `time_format`.
+
+For example.
+
+```
+<inject>
+ time_key time_field_name
+ time_type string
+ time_format %Y-%m-%d %H:%M:%S
+</inject>
+```
+
| name | type | required? | default | description |
| :------------------------------------- | :------------ | :----------- | :------------------------- | :----------------------- |
-| localtime | bool | no | nil | Use localtime |
-| utc | bool | no | nil | Use utc |
+| hostname_key | string | no | nil | |
+| hostname | string | no | nil | |
+| tag_key | string | no | nil | |
+| time_key | string | no | nil | |
+| time_type | string | no | nil | |
+| time_format | string | no | nil | |
+| localtime | bool | no | true | |
+| utc | bool | no | false | |
+| timezone | string | no | nil | |
-And see http://docs.fluentd.org/articles/output-plugin-overview#time-sliced-output-parameters
+see. https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin_helper/inject.rb
## Examples
### Streaming inserts
@@ -101,13 +129,10 @@
project yourproject_id
dataset yourdataset_id
table tablename
- time_format %s
- time_field time
-
schema [
{"name": "time", "type": "INTEGER"},
{"name": "status", "type": "INTEGER"},
{"name": "bytes", "type": "INTEGER"},
{"name": "vhost", "type": "STRING"},
@@ -133,30 +158,28 @@
```apache
<match dummy>
@type bigquery
method insert # default
-
- flush_interval 1 # flush as frequent as possible
-
- buffer_chunk_records_limit 300 # default rate limit for users is 100
- buffer_queue_limit 10240 # 1MB * 10240 -> 10GB!
-
- num_threads 16
-
+
+ <buffer>
+ flush_interval 0.1 # flush as frequent as possible
+
+ buffer_queue_limit 10240 # 1MB * 10240 -> 10GB!
+
+ flush_thread_count 16
+ </buffer>
+
auth_method private_key # default
email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com
private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12
# private_key_passphrase notasecret # default
project yourproject_id
dataset yourdataset_id
tables accesslog1,accesslog2,accesslog3
- time_format %s
- time_field time
-
schema [
{"name": "time", "type": "INTEGER"},
{"name": "status", "type": "INTEGER"},
{"name": "bytes", "type": "INTEGER"},
{"name": "vhost", "type": "STRING"},
@@ -181,27 +204,27 @@
* `tables`
* 2 or more tables are available with ',' separator
* `out_bigquery` uses these tables for Table Sharding inserts
* these must have same schema
- * `buffer_chunk_limit`
+ * `buffer/chunk_limit_size`
* max size of an insert or chunk (default 1000000 or 1MB)
* the max size is limited to 1MB on BigQuery
- * `buffer_chunk_records_limit`
+ * `buffer/chunk_records_limit`
* number of records over streaming inserts API call is limited as 500, per insert or chunk
* `out_bigquery` flushes buffer with 500 records for 1 inserts API call
- * `buffer_queue_limit`
+ * `buffer/queue_length_limit`
* BigQuery streaming inserts needs very small buffer chunks
* for high-rate events, `buffer_queue_limit` should be configured with big number
* Max 1GB memory may be used under network problem in default configuration
- * `buffer_chunk_limit (default 1MB)` x `buffer_queue_limit (default 1024)`
- * `num_threads`
+ * `chunk_limit_size (default 1MB)` x `queue_length_limit (default 1024)`
+ * `buffer/flush_thread_count`
* threads for insert api calls in parallel
* specify this option for 100 or more records per seconds
* 10 or more threads seems good for inserts over internet
* less threads may be good for Google Compute Engine instances (with low latency for BigQuery)
- * `flush_interval`
+ * `buffer/flush_interval`
* interval between data flushes (default 0.25)
* you can set subsecond values such as `0.15` on Fluentd v0.10.42 or later
See [Quota policy](https://cloud.google.com/bigquery/streaming-data-into-bigquery#quota)
section in the Google BigQuery document.
@@ -210,35 +233,32 @@
```apache
<match bigquery>
@type bigquery
method load
- buffer_type file
- buffer_path bigquery.*.buffer
+
+ <buffer>
+ @type file
+ path bigquery.*.buffer
flush_interval 1800
flush_at_shutdown true
- try_flush_interval 1
- utc
+ timekey_use_utc
+ </buffer>
auth_method json_key
json_key json_key_path.json
- time_format %s
- time_field time
-
project yourproject_id
dataset yourdataset_id
auto_create_table true
table yourtable%{time_slice}
schema_path bq_schema.json
</match>
```
I recommend to use file buffer and long flush interval.
-__CAUTION: `flush_interval` default is still `0.25` even if `method` is `load` on current version.__
-
### Authentication
There are four methods supported to fetch access token for the service account.
1. Public-Private key pair of GCP(Google Cloud Platform)'s service account
@@ -302,12 +322,10 @@
project yourproject_id
dataset yourdataset_id
table tablename
- time_format %s
- time_field time
...
</match>
```
#### Application default credentials
@@ -323,16 +341,20 @@
5. If you are running in Google Compute Engine production, the built-in service account associated with the virtual machine instance will be used.
6. If none of these conditions is true, an error will occur.
### Table id formatting
+this plugin supports fluentd-0.14 style placeholder.
+
#### strftime formatting
`table` and `tables` options accept [Time#strftime](http://ruby-doc.org/core-1.9.3/Time.html#method-i-strftime)
format to construct table ids.
Table ids are formatted at runtime
-using the local time of the fluentd server.
+using the chunk key time.
+see. http://docs.fluentd.org/v0.14/articles/output-plugin-overview
+
For example, with the configuration below,
data is inserted into tables `accesslog_2014_08`, `accesslog_2014_09` and so on.
```apache
<match dummy>
@@ -342,75 +364,58 @@
project yourproject_id
dataset yourdataset_id
table accesslog_%Y_%m
+ <buffer time>
+ timekey 1d
+ </buffer>
...
</match>
```
#### record attribute formatting
The format can be suffixed with attribute name.
-__NOTE: This feature is available only if `method` is `insert`. Because it makes performance impact. Use `%{time_slice}` instead of it.__
+__CAUTION: format is different with previous version__
```apache
<match dummy>
...
- table accesslog_%Y_%m@timestamp
+ table accesslog_${status_code}
+
+ <buffer status_code>
+ </buffer>
...
</match>
```
If attribute name is given, the time to be used for formatting is value of each row.
The value for the time should be a UNIX time.
#### time_slice_key formatting
-Or, the options can use `%{time_slice}` placeholder.
-`%{time_slice}` is replaced by formatted time slice key at runtime.
-```apache
-<match dummy>
- @type bigquery
+Instead, Use strftime formatting.
- ...
- table accesslog%{time_slice}
- ...
-</match>
-```
+strftime formatting of current version is based on chunk key.
+That is same with previous time_slice_key formatting .
-#### record attribute value formatting
-Or, `${attr_name}` placeholder is available to use value of attribute as part of table id.
-`${attr_name}` is replaced by string value of the attribute specified by `attr_name`.
-
-__NOTE: This feature is available only if `method` is `insert`.__
-
-```apache
-<match dummy>
- ...
- table accesslog_%Y_%m_${subdomain}
- ...
-</match>
-```
-
-For example value of `subdomain` attribute is `"bq.fluent"`, table id will be like "accesslog_2016_03_bqfluent".
-
-- any type of attribute is allowed because stringified value will be used as replacement.
-- acceptable characters are alphabets, digits and `_`. All other characters will be removed.
-
### Date partitioned table support
this plugin can insert (load) into date partitioned table.
-Use `%{time_slice}`.
+Use placeholder.
```apache
<match dummy>
@type bigquery
...
- time_slice_format %Y%m%d
- table accesslog$%{time_slice}
+ table accesslog$%Y%m%d
+
+ <buffer time>
+ timekey 1d
+ </buffer>
...
</match>
```
But, Dynamic table creating doesn't support date partitioned table yet.
@@ -450,13 +455,10 @@
<match dummy>
@type bigquery
...
- time_format %s
- time_field time
-
schema [
{"name": "time", "type": "INTEGER"},
{"name": "status", "type": "INTEGER"},
{"name": "bytes", "type": "INTEGER"},
{"name": "vhost", "type": "STRING"},
@@ -503,14 +505,11 @@
```apache
<match dummy>
@type bigquery
...
-
- time_format %s
- time_field time
-
+
schema_path /path/to/httpd.schema
</match>
```
where /path/to/httpd.schema is a path to the JSON-encoded schema file which you used for creating the table on BigQuery. By using external schema file you are able to write full schema that does support NULLABLE/REQUIRED/REPEATED, this feature is really useful and adds full flexbility.
@@ -519,13 +518,10 @@
```apache
<match dummy>
@type bigquery
...
-
- time_format %s
- time_field time
-
+
fetch_schema true
# fetch_schema_table other_table # if you want to fetch schema from other table
</match>
```