README.md in fluent-plugin-bigquery-1.2.0 vs README.md in fluent-plugin-bigquery-2.0.0.beta
- old
+ new
@@ -1,15 +1,19 @@
# fluent-plugin-bigquery
+**This README is for v2.0.0.beta. but it is not released yet. sorry.**
+
[Fluentd](http://fluentd.org) output plugin to load/insert data into Google BigQuery.
-- **Plugin type**: BufferedOutput
+- **Plugin type**: Output
* insert data over streaming inserts
+ * plugin type is `bigquery_insert`
* for continuous real-time insertions
* https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases
* load data
+ * plugin type is `bigquery_load`
* for data loading as batch jobs, for big amount of data
* https://developers.google.com/bigquery/loading-data-into-bigquery
Current version of this plugin supports Google API with Service Account Authentication, but does not support
OAuth flow for installed applications.
@@ -29,59 +33,64 @@
## Configuration
### Options
-| name | type | required? | placeholder? | default | description |
-| :------------------------------------- | :------------ | :----------- | :---------- | :------------------------- | :----------------------- |
-| method | string | no | no | insert | `insert` (Streaming Insert) or `load` (load job) |
-| auth_method | enum | yes | no | private_key | `private_key` or `json_key` or `compute_engine` or `application_default` |
-| email | string | yes (private_key) | no | nil | GCP Service Account Email |
-| private_key_path | string | yes (private_key) | no | nil | GCP Private Key file path |
-| private_key_passphrase | string | yes (private_key) | no | nil | GCP Private Key Passphrase |
-| json_key | string | yes (json_key) | no | nil | GCP JSON Key file path or JSON Key string |
-| project | string | yes | yes | nil | |
-| dataset | string | yes | yes | nil | |
-| table | string | yes (either `tables`) | yes | nil | |
-| tables | array(string) | yes (either `table`) | yes | nil | can set multi table names splitted by `,` |
-| template_suffix | string | no | yes | nil | can use `%{time_slice}` placeholder replaced by `time_slice_format` |
-| auto_create_table | bool | no | no | false | If true, creates table automatically |
-| skip_invalid_rows | bool | no | no | false | Only `insert` method. |
-| max_bad_records | integer | no | no | 0 | Only `load` method. If the number of bad records exceeds this value, an invalid error is returned in the job result. |
-| ignore_unknown_values | bool | no | no | false | Accept rows that contain values that do not match the schema. The unknown values are ignored. |
-| schema | array | yes (either `fetch_schema` or `schema_path`) | no | nil | Schema Definition. It is formatted by JSON. |
-| schema_path | string | yes (either `fetch_schema`) | no | nil | Schema Definition file path. It is formatted by JSON. |
-| fetch_schema | bool | yes (either `schema_path`) | no | false | If true, fetch table schema definition from Bigquery table automatically. |
-| fetch_schema_table | string | no | yes | nil | If set, fetch table schema definition from this table, If fetch_schema is false, this param is ignored |
-| schema_cache_expire | integer | no | no | 600 | Value is second. If current time is after expiration interval, re-fetch table schema definition. |
-| insert_id_field | string | no | no | nil | Use key as `insert_id` of Streaming Insert API parameter. |
-| add_insert_timestamp | string | no | no | nil | Adds a timestamp column just before sending the rows to BigQuery, so that buffering time is not taken into account. Gives a field in BigQuery which represents the insert time of the row. |
-| allow_retry_insert_errors | bool | no | no | false | Retry to insert rows when an insertErrors occurs. There is a possibility that rows are inserted in duplicate. |
-| request_timeout_sec | integer | no | no | nil | Bigquery API response timeout |
-| request_open_timeout_sec | integer | no | no | 60 | Bigquery API connection, and request timeout. If you send big data to Bigquery, set large value. |
-| time_partitioning_type | enum | no (either day) | no | nil | Type of bigquery time partitioning feature(experimental feature on BigQuery). |
-| time_partitioning_expiration | time | no | no | nil | Expiration milliseconds for bigquery time partitioning. (experimental feature on BigQuery) |
+#### common
-### Deprecated
+| name | type | required? | placeholder? | default | description |
+| :------------------------------------- | :------------ | :----------- | :---------- | :------------------------- | :----------------------- |
+| auth_method | enum | yes | no | private_key | `private_key` or `json_key` or `compute_engine` or `application_default` |
+| email | string | yes (private_key) | no | nil | GCP Service Account Email |
+| private_key_path | string | yes (private_key) | no | nil | GCP Private Key file path |
+| private_key_passphrase | string | yes (private_key) | no | nil | GCP Private Key Passphrase |
+| json_key | string | yes (json_key) | no | nil | GCP JSON Key file path or JSON Key string |
+| project | string | yes | yes | nil | |
+| dataset | string | yes | yes | nil | |
+| table | string | yes (either `tables`) | yes | nil | |
+| tables | array(string) | yes (either `table`) | yes | nil | can set multi table names splitted by `,` |
+| auto_create_table | bool | no | no | false | If true, creates table automatically |
+| ignore_unknown_values | bool | no | no | false | Accept rows that contain values that do not match the schema. The unknown values are ignored. |
+| schema | array | yes (either `fetch_schema` or `schema_path`) | no | nil | Schema Definition. It is formatted by JSON. |
+| schema_path | string | yes (either `fetch_schema`) | no | nil | Schema Definition file path. It is formatted by JSON. |
+| fetch_schema | bool | yes (either `schema_path`) | no | false | If true, fetch table schema definition from Bigquery table automatically. |
+| fetch_schema_table | string | no | yes | nil | If set, fetch table schema definition from this table, If fetch_schema is false, this param is ignored |
+| schema_cache_expire | integer | no | no | 600 | Value is second. If current time is after expiration interval, re-fetch table schema definition. |
+| request_timeout_sec | integer | no | no | nil | Bigquery API response timeout |
+| request_open_timeout_sec | integer | no | no | 60 | Bigquery API connection, and request timeout. If you send big data to Bigquery, set large value. |
+| time_partitioning_type | enum | no (either day) | no | nil | Type of bigquery time partitioning feature(experimental feature on BigQuery). |
+| time_partitioning_expiration | time | no | no | nil | Expiration milliseconds for bigquery time partitioning. (experimental feature on BigQuery) |
-| name | type | required? | placeholder? | default | description |
-| :------------------------------------- | :------------ | :----------- | :---------- | :------------------------- | :----------------------- |
-| replace_record_key | bool | no | no | false | Use other filter plugin. |
-| replace_record_key_regexp{1-10} | string | no | no | nil | |
+#### bigquery_insert
+| name | type | required? | placeholder? | default | description |
+| :------------------------------------- | :------------ | :----------- | :---------- | :------------------------- | :----------------------- |
+| template_suffix | string | no | yes | nil | can use `%{time_slice}` placeholder replaced by `time_slice_format` |
+| skip_invalid_rows | bool | no | no | false | |
+| insert_id_field | string | no | no | nil | Use key as `insert_id` of Streaming Insert API parameter. see. https://docs.fluentd.org/v1.0/articles/api-plugin-helper-record_accessor |
+| add_insert_timestamp | string | no | no | nil | Adds a timestamp column just before sending the rows to BigQuery, so that buffering time is not taken into account. Gives a field in BigQuery which represents the insert time of the row. |
+| allow_retry_insert_errors | bool | no | no | false | Retry to insert rows when an insertErrors occurs. There is a possibility that rows are inserted in duplicate. |
+
+#### bigquery_load
+
+| name | type | required? | placeholder? | default | description |
+| :------------------------------------- | :------------ | :----------- | :---------- | :------------------------- | :----------------------- |
+| source_format | enum | no | no | json | Specify source format `json` or `csv` or `avro`. If you change this parameter, you must change formatter plugin via `<format>` config section. |
+| max_bad_records | integer | no | no | 0 | If the number of bad records exceeds this value, an invalid error is returned in the job result. |
+
### Buffer section
| name | type | required? | default | description |
| :------------------------------------- | :------------ | :----------- | :------------------------- | :----------------------- |
| @type | string | no | memory (insert) or file (load) | |
| chunk_limit_size | integer | no | 1MB (insert) or 1GB (load) | |
| total_limit_size | integer | no | 1GB (insert) or 32GB (load) | |
| chunk_records_limit | integer | no | 500 (insert) or nil (load) | |
| flush_mode | enum | no | interval | default, lazy, interval, immediate |
-| flush_interval | float | no | 0.25 (insert) or nil (load) | |
-| flush_thread_interval | float | no | 0.05 (insert) or nil (load) | |
-| flush_thread_burst_interval | float | no | 0.05 (insert) or nil (load) | |
+| flush_interval | float | no | 1.0 (insert) or 3600 (load) | |
+| flush_thread_interval | float | no | 0.05 (insert) or 5 (load) | |
+| flush_thread_burst_interval | float | no | 0.05 (insert) or 5 (load) | |
And, other params (defined by base class) are available
see. https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/output.rb
@@ -140,14 +149,12 @@
Configure insert specifications with target table schema, with your credentials. This is minimum configurations:
```apache
<match dummy>
- @type bigquery
+ @type bigquery_insert
- method insert # default
-
auth_method private_key # default
email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com
private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12
# private_key_passphrase notasecret # default
@@ -179,18 +186,16 @@
For high rate inserts over streaming inserts, you should specify flush intervals and buffer chunk options:
```apache
<match dummy>
- @type bigquery
-
- method insert # default
+ @type bigquery_insert
<buffer>
flush_interval 0.1 # flush as frequent as possible
- buffer_queue_limit 10240 # 1MB * 10240 -> 10GB!
+ total_limit_size 10g
flush_thread_count 16
</buffer>
auth_method private_key # default
@@ -254,20 +259,16 @@
section in the Google BigQuery document.
### Load
```apache
<match bigquery>
- @type bigquery
+ @type bigquery_load
- method load
-
<buffer>
- @type file
- path bigquery.*.buffer
- flush_interval 1800
- flush_at_shutdown true
- timekey_use_utc
+ path bigquery.*.buffer
+ flush_at_shutdown true
+ timekey_use_utc
</buffer>
auth_method json_key
json_key json_key_path.json
@@ -300,11 +301,11 @@
You first need to create a service account (client ID),
download its JSON key and deploy the key with fluentd.
```apache
<match dummy>
- @type bigquery
+ @type bigquery_insert
auth_method json_key
json_key /home/username/.keys/00000000000000000000000000000000-jsonkey.json
project yourproject_id
@@ -317,11 +318,11 @@
You can also provide `json_key` as embedded JSON string like this.
You need to only include `private_key` and `client_email` key from JSON key file.
```apache
<match dummy>
- @type bigquery
+ @type bigquery_insert
auth_method json_key
json_key {"private_key": "-----BEGIN PRIVATE KEY-----\n...", "client_email": "xxx@developer.gserviceaccount.com"}
project yourproject_id
@@ -338,11 +339,11 @@
In this authentication method, you need to add the API scope "https://www.googleapis.com/auth/bigquery" to the scope list of your
Compute Engine instance, then you can configure fluentd like this.
```apache
<match dummy>
- @type bigquery
+ @type bigquery_insert
auth_method compute_engine
project yourproject_id
dataset yourdataset_id
@@ -380,11 +381,11 @@
For example, with the configuration below,
data is inserted into tables `accesslog_2014_08`, `accesslog_2014_09` and so on.
```apache
<match dummy>
- @type bigquery
+ @type bigquery_insert
...
project yourproject_id
dataset yourdataset_id
@@ -428,11 +429,11 @@
Use placeholder.
```apache
<match dummy>
- @type bigquery
+ @type bigquery_insert
...
table accesslog$%Y%m%d
<buffer time>
@@ -451,11 +452,11 @@
NOTE: `auto_create_table` option cannot be used with `fetch_schema`. You should create the table on ahead to use `fetch_schema`.
```apache
<match dummy>
- @type bigquery
+ @type bigquery_insert
...
auto_create_table true
table accesslog_%Y_%m
@@ -475,11 +476,11 @@
The examples above use the first method. In this method,
you can also specify nested fields by prefixing their belonging record fields.
```apache
<match dummy>
- @type bigquery
+ @type bigquery_insert
...
schema [
{"name": "time", "type": "INTEGER"},
@@ -526,11 +527,11 @@
The second method is to specify a path to a BigQuery schema file instead of listing fields. In this case, your fluent.conf looks like:
```apache
<match dummy>
- @type bigquery
+ @type bigquery_insert
...
schema_path /path/to/httpd.schema
</match>
@@ -539,11 +540,11 @@
The third method is to set `fetch_schema` to `true` to enable fetch a schema using BigQuery API. In this case, your fluent.conf looks like:
```apache
<match dummy>
- @type bigquery
+ @type bigquery_insert
...
fetch_schema true
# fetch_schema_table other_table # if you want to fetch schema from other table
@@ -557,13 +558,15 @@
### Specifying insertId property
BigQuery uses `insertId` property to detect duplicate insertion requests (see [data consistency](https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency) in Google BigQuery documents).
You can set `insert_id_field` option to specify the field to use as `insertId` property.
+`insert_id_field` can use fluentd record_accessor format like `$['key1'][0]['key2']`.
+(detail. https://docs.fluentd.org/v1.0/articles/api-plugin-helper-record_accessor)
```apache
<match dummy>
- @type bigquery
+ @type bigquery_insert
...
insert_id_field uuid
schema [{"name": "uuid", "type": "STRING"}]