README.md in fluent-plugin-bigquery-0.2.15 vs README.md in fluent-plugin-bigquery-0.2.16

- old
+ new

@@ -3,11 +3,11 @@ [Fluentd](http://fluentd.org) output plugin to load/insert data into Google BigQuery. * insert data over streaming inserts * for continuous real-time insertions * https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases -* (NOT IMPLEMENTED) load data +* load data * for data loading as batch jobs, for big amount of data * https://developers.google.com/bigquery/loading-data-into-bigquery Current version of this plugin supports Google API with Service Account Authentication, but does not support OAuth flow for installed applications. @@ -18,11 +18,11 @@ Configure insert specifications with target table schema, with your credentials. This is minimum configurations: ```apache <match dummy> - type bigquery + @type bigquery method insert # default auth_method private_key # default email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com @@ -45,11 +45,11 @@ For high rate inserts over streaming inserts, you should specify flush intervals and buffer chunk options: ```apache <match dummy> - type bigquery + @type bigquery method insert # default flush_interval 1 # flush as frequent as possible @@ -104,10 +104,41 @@ * you can set subsecond values such as `0.15` on Fluentd v0.10.42 or later See [Quota policy](https://cloud.google.com/bigquery/streaming-data-into-bigquery#quota) section in the Google BigQuery document. +### Load +```apache +<match bigquery> + @type bigquery + + method load + buffer_type file + buffer_path bigquery.*.buffer + flush_interval 1800 + flush_at_shutdown true + try_flush_interval 1 + utc + + auth_method json_key + json_key json_key_path.json + + time_format %s + time_field time + + project yourproject_id + dataset yourdataset_id + auto_create_table true + table yourtable%{time_slice} + schema_path bq_schema.json +</match> +``` + +I recommend to use file buffer and long flush interval. + +__CAUTION: `flush_interval` default is still `0.25` even if `method` is `load` on current version.__ + ### Authentication There are two methods supported to fetch access token for the service account. 1. Public-Private key pair of GCP(Google Cloud Platform)'s service account @@ -125,11 +156,11 @@ You first need to create a service account (client ID), download its JSON key and deploy the key with fluentd. ```apache <match dummy> - type bigquery + @type bigquery auth_method json_key json_key /home/username/.keys/00000000000000000000000000000000-jsonkey.json project yourproject_id @@ -142,11 +173,11 @@ You can also provide `json_key` as embedded JSON string like this. You need to only include `private_key` and `client_email` key from JSON key file. ```apache <match dummy> - type bigquery + @type bigquery auth_method json_key json_key {"private_key": "-----BEGIN PRIVATE KEY-----\n...", "client_email": "xxx@developer.gserviceaccount.com"} project yourproject_id @@ -163,11 +194,11 @@ In this authentication method, you need to add the API scope "https://www.googleapis.com/auth/bigquery" to the scope list of your Compute Engine instance, then you can configure fluentd like this. ```apache <match dummy> - type bigquery + @type bigquery auth_method compute_engine project yourproject_id dataset yourdataset_id @@ -196,21 +227,22 @@ 5. If you are running in Google Compute Engine production, the built-in service account associated with the virtual machine instance will be used. 6. If none of these conditions is true, an error will occur. ### Table id formatting +#### strftime formatting `table` and `tables` options accept [Time#strftime](http://ruby-doc.org/core-1.9.3/Time.html#method-i-strftime) format to construct table ids. Table ids are formatted at runtime using the local time of the fluentd server. For example, with the configuration below, data is inserted into tables `accesslog_2014_08`, `accesslog_2014_09` and so on. ```apache <match dummy> - type bigquery + @type bigquery ... project yourproject_id dataset yourdataset_id @@ -218,12 +250,15 @@ ... </match> ``` +#### record attribute formatting The format can be suffixed with attribute name. +__NOTE: This feature is available only if `method` is `insert`. Because it makes performance impact. Use `%{time_slice}` instead of it.__ + ```apache <match dummy> ... table accesslog_%Y_%m@timestamp ... @@ -231,37 +266,53 @@ ``` If attribute name is given, the time to be used for formatting is value of each row. The value for the time should be a UNIX time. +#### time_slice_key formatting Or, the options can use `%{time_slice}` placeholder. `%{time_slice}` is replaced by formatted time slice key at runtime. ```apache <match dummy> - type bigquery - + @type bigquery + ... - - project yourproject_id - dataset yourdataset_id table accesslog%{time_slice} - ... </match> ``` +#### record attribute value formatting +Or, `${attr_name}` placeholder is available to use value of attribute as part of table id. +`${attr_name}` is replaced by string value of the attribute specified by `attr_name`. + +__NOTE: This feature is available only if `method` is `insert`.__ + +```apache +<match dummy> + ... + table accesslog_%Y_%m_${subdomain} + ... +</match> +``` + +For example value of `subdomain` attribute is `"bq.fluent"`, table id will be like "accesslog_2016_03_bqfluent". + +- any type of attribute is allowed because stringified value will be used as replacement. +- acceptable characters are alphabets, digits and `_`. All other characters will be removed. + ### Dynamic table creating When `auto_create_table` is set to `true`, try to create the table using BigQuery API when insertion failed with code=404 "Not Found: Table ...". Next retry of insertion is expected to be success. NOTE: `auto_create_table` option cannot be used with `fetch_schema`. You should create the table on ahead to use `fetch_schema`. ```apache <match dummy> - type bigquery + @type bigquery ... auto_create_table true table accesslog_%Y_%m @@ -281,11 +332,11 @@ The examples above use the first method. In this method, you can also specify nested fields by prefixing their belonging record fields. ```apache <match dummy> - type bigquery + @type bigquery ... time_format %s time_field time @@ -320,11 +371,11 @@ The second method is to specify a path to a BigQuery schema file instead of listing fields. In this case, your fluent.conf looks like: ```apache <match dummy> - type bigquery + @type bigquery ... time_format %s time_field time @@ -337,11 +388,11 @@ The third method is to set `fetch_schema` to `true` to enable fetch a schema using BigQuery API. In this case, your fluent.conf looks like: ```apache <match dummy> - type bigquery + @type bigquery ... time_format %s time_field time @@ -361,10 +412,10 @@ BigQuery uses `insertId` property to detect duplicate insertion requests (see [data consistency](https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency) in Google BigQuery documents). You can set `insert_id_field` option to specify the field to use as `insertId` property. ```apache <match dummy> - type bigquery + @type bigquery ... insert_id_field uuid field_string uuid