README.md in fluent-plugin-bigquery-0.2.15 vs README.md in fluent-plugin-bigquery-0.2.16
- old
+ new
@@ -3,11 +3,11 @@
[Fluentd](http://fluentd.org) output plugin to load/insert data into Google BigQuery.
* insert data over streaming inserts
* for continuous real-time insertions
* https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases
-* (NOT IMPLEMENTED) load data
+* load data
* for data loading as batch jobs, for big amount of data
* https://developers.google.com/bigquery/loading-data-into-bigquery
Current version of this plugin supports Google API with Service Account Authentication, but does not support
OAuth flow for installed applications.
@@ -18,11 +18,11 @@
Configure insert specifications with target table schema, with your credentials. This is minimum configurations:
```apache
<match dummy>
- type bigquery
+ @type bigquery
method insert # default
auth_method private_key # default
email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com
@@ -45,11 +45,11 @@
For high rate inserts over streaming inserts, you should specify flush intervals and buffer chunk options:
```apache
<match dummy>
- type bigquery
+ @type bigquery
method insert # default
flush_interval 1 # flush as frequent as possible
@@ -104,10 +104,41 @@
* you can set subsecond values such as `0.15` on Fluentd v0.10.42 or later
See [Quota policy](https://cloud.google.com/bigquery/streaming-data-into-bigquery#quota)
section in the Google BigQuery document.
+### Load
+```apache
+<match bigquery>
+ @type bigquery
+
+ method load
+ buffer_type file
+ buffer_path bigquery.*.buffer
+ flush_interval 1800
+ flush_at_shutdown true
+ try_flush_interval 1
+ utc
+
+ auth_method json_key
+ json_key json_key_path.json
+
+ time_format %s
+ time_field time
+
+ project yourproject_id
+ dataset yourdataset_id
+ auto_create_table true
+ table yourtable%{time_slice}
+ schema_path bq_schema.json
+</match>
+```
+
+I recommend to use file buffer and long flush interval.
+
+__CAUTION: `flush_interval` default is still `0.25` even if `method` is `load` on current version.__
+
### Authentication
There are two methods supported to fetch access token for the service account.
1. Public-Private key pair of GCP(Google Cloud Platform)'s service account
@@ -125,11 +156,11 @@
You first need to create a service account (client ID),
download its JSON key and deploy the key with fluentd.
```apache
<match dummy>
- type bigquery
+ @type bigquery
auth_method json_key
json_key /home/username/.keys/00000000000000000000000000000000-jsonkey.json
project yourproject_id
@@ -142,11 +173,11 @@
You can also provide `json_key` as embedded JSON string like this.
You need to only include `private_key` and `client_email` key from JSON key file.
```apache
<match dummy>
- type bigquery
+ @type bigquery
auth_method json_key
json_key {"private_key": "-----BEGIN PRIVATE KEY-----\n...", "client_email": "xxx@developer.gserviceaccount.com"}
project yourproject_id
@@ -163,11 +194,11 @@
In this authentication method, you need to add the API scope "https://www.googleapis.com/auth/bigquery" to the scope list of your
Compute Engine instance, then you can configure fluentd like this.
```apache
<match dummy>
- type bigquery
+ @type bigquery
auth_method compute_engine
project yourproject_id
dataset yourdataset_id
@@ -196,21 +227,22 @@
5. If you are running in Google Compute Engine production, the built-in service account associated with the virtual machine instance will be used.
6. If none of these conditions is true, an error will occur.
### Table id formatting
+#### strftime formatting
`table` and `tables` options accept [Time#strftime](http://ruby-doc.org/core-1.9.3/Time.html#method-i-strftime)
format to construct table ids.
Table ids are formatted at runtime
using the local time of the fluentd server.
For example, with the configuration below,
data is inserted into tables `accesslog_2014_08`, `accesslog_2014_09` and so on.
```apache
<match dummy>
- type bigquery
+ @type bigquery
...
project yourproject_id
dataset yourdataset_id
@@ -218,12 +250,15 @@
...
</match>
```
+#### record attribute formatting
The format can be suffixed with attribute name.
+__NOTE: This feature is available only if `method` is `insert`. Because it makes performance impact. Use `%{time_slice}` instead of it.__
+
```apache
<match dummy>
...
table accesslog_%Y_%m@timestamp
...
@@ -231,37 +266,53 @@
```
If attribute name is given, the time to be used for formatting is value of each row.
The value for the time should be a UNIX time.
+#### time_slice_key formatting
Or, the options can use `%{time_slice}` placeholder.
`%{time_slice}` is replaced by formatted time slice key at runtime.
```apache
<match dummy>
- type bigquery
-
+ @type bigquery
+
...
-
- project yourproject_id
- dataset yourdataset_id
table accesslog%{time_slice}
-
...
</match>
```
+#### record attribute value formatting
+Or, `${attr_name}` placeholder is available to use value of attribute as part of table id.
+`${attr_name}` is replaced by string value of the attribute specified by `attr_name`.
+
+__NOTE: This feature is available only if `method` is `insert`.__
+
+```apache
+<match dummy>
+ ...
+ table accesslog_%Y_%m_${subdomain}
+ ...
+</match>
+```
+
+For example value of `subdomain` attribute is `"bq.fluent"`, table id will be like "accesslog_2016_03_bqfluent".
+
+- any type of attribute is allowed because stringified value will be used as replacement.
+- acceptable characters are alphabets, digits and `_`. All other characters will be removed.
+
### Dynamic table creating
When `auto_create_table` is set to `true`, try to create the table using BigQuery API when insertion failed with code=404 "Not Found: Table ...".
Next retry of insertion is expected to be success.
NOTE: `auto_create_table` option cannot be used with `fetch_schema`. You should create the table on ahead to use `fetch_schema`.
```apache
<match dummy>
- type bigquery
+ @type bigquery
...
auto_create_table true
table accesslog_%Y_%m
@@ -281,11 +332,11 @@
The examples above use the first method. In this method,
you can also specify nested fields by prefixing their belonging record fields.
```apache
<match dummy>
- type bigquery
+ @type bigquery
...
time_format %s
time_field time
@@ -320,11 +371,11 @@
The second method is to specify a path to a BigQuery schema file instead of listing fields. In this case, your fluent.conf looks like:
```apache
<match dummy>
- type bigquery
+ @type bigquery
...
time_format %s
time_field time
@@ -337,11 +388,11 @@
The third method is to set `fetch_schema` to `true` to enable fetch a schema using BigQuery API. In this case, your fluent.conf looks like:
```apache
<match dummy>
- type bigquery
+ @type bigquery
...
time_format %s
time_field time
@@ -361,10 +412,10 @@
BigQuery uses `insertId` property to detect duplicate insertion requests (see [data consistency](https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency) in Google BigQuery documents).
You can set `insert_id_field` option to specify the field to use as `insertId` property.
```apache
<match dummy>
- type bigquery
+ @type bigquery
...
insert_id_field uuid
field_string uuid