README.md in embulk-output-bigquery-0.2.3 vs README.md in embulk-output-bigquery-0.3.0.pre1
- old
+ new
@@ -1,6 +1,5 @@
-
# embulk-output-bigquery
[Embulk](https://github.com/embulk/embulk/) output plugin to load/insert data into [Google BigQuery](https://cloud.google.com/bigquery/) using [direct insert](https://cloud.google.com/bigquery/loading-data-into-bigquery#loaddatapostrequest)
## Overview
@@ -24,39 +23,65 @@
## Configuration
#### Original options
-| name | type | required? | default | description |
-|:--------------------------|:------------|:-----------|:-------------|:-----------------------|
-| mode | string | optional | "append" | [See below](#mode) |
-| auth_method | string | optional | "private_key" | `private_key` , `json_key` or `compute_engine`
-| service_account_email | string | required when auth_method is private_key | | Your Google service account email
-| p12_keyfile | string | required when auth_method is private_key | | Fullpath of private key in P12(PKCS12) format |
-| json_keyfile | string | required when auth_method is json_key | | Fullpath of json key |
-| sequence_format | string | optional | %03d.%02d | |
-| file_ext | string | optional | | e.g. ".csv.gz" ".json.gz" |
-| project | string | required | | project_id |
-| dataset | string | required | | dataset |
-| table | string | required | | table name |
-| auto_create_table | boolean | optional | 0 | [See below](#dynamic-table-creating) |
-| schema_file | string | optional | | /path/to/schema.json |
-| template_table | string | optional | | existing_table_name [See below](#dynamic-table-creating) |
-| prevent_duplicate_insert | boolean | optional | 0 | [See below](#data-consistency) |
-| delete_from_local_when_job_end | boolean | optional | 0 | If set to true, delete local file when job is end |
-| job_status_max_polling_time | int | optional | 3600 sec | Max job status polling time |
-| job_status_max_polling_time | int | optional | 10 sec | Job status polling interval |
-| is_skip_job_result_check | boolean | optional | 0 | |
-| application_name | string | optional | "Embulk BigQuery plugin" | Anything you like |
+| name | type | required? | default | description |
+|:-------------------------------------|:------------|:-----------|:-------------------------|:-----------------------|
+| mode | string | optional | "append" | [See below](#mode) |
+| auth_method | string | optional | "private_key" | `private_key` , `json_key` or `compute_engine`
+| service_account_email | string | required when auth_method is private_key | | Your Google service account email
+| p12_keyfile | string | required when auth_method is private_key | | Fullpath of private key in P12(PKCS12) format |
+| json_keyfile | string | required when auth_method is json_key | | Fullpath of json key |
+| project | string | required if json_keyfile is not given | | project_id |
+| dataset | string | required | | dataset |
+| table | string | required | | table name |
+| auto_create_dataset | boolean | optional | false | automatically create dataset |
+| auto_create_table | boolean | optional | false | [See below](#dynamic-table-creating) |
+| schema_file | string | optional | | /path/to/schema.json |
+| template_table | string | optional | | template table name [See below](#dynamic-table-creating) |
+| prevent_duplicate_insert | boolean | optional | false | [See below](#data-consistency) |
+| job_status_max_polling_time | int | optional | 3600 sec | Max job status polling time |
+| job_status_polling_interval | int | optional | 10 sec | Job status polling interval |
+| is_skip_job_result_check | boolean | optional | false | Skip waiting Load job finishes. Available for append, or delete_in_advance mode |
+| with_rehearsal | boolean | optional | false | Load `rehearsal_counts` records as a rehearsal. Rehearsal loads into REHEARSAL temporary table, and delete finally. You may use this option to investigate data errors as early stage as possible |
+| rehearsal_counts | integer | optional | 1000 | Specify number of records to load in a rehearsal |
+| column_options | hash | optional | | [See below](#column-options) |
+| default_timezone | string | optional | UTC | |
+| default_timestamp_format | string | optional | %Y-%m-%d %H:%M:%S.%6N | |
+| payload_column | string | optional | nil | [See below](#formatter-performance-issue) |
+| payload_column_index | integer | optional | nil | [See below](#formatter-performance-issue) |
+Client or request options
+
+| name | type | required? | default | description |
+|:-------------------------------------|:------------|:-----------|:-------------------------|:-----------------------|
+| timeout_sec | integer | optional | 300 | Seconds to wait for one block to be read |
+| open_timeout_sec | integer | optional | 300 | Seconds to wait for the connection to open |
+| retries | integer | optional | 5 | Number of retries |
+| application_name | string | optional | "Embulk BigQuery plugin" | User-Agent |
+
+Options for intermediate local files
+
+| name | type | required? | default | description |
+|:-------------------------------------|:------------|:-----------|:-------------------------|:-----------------------|
+| path_prefix | string | optional | | Path prefix of local files such as "/tmp/prefix_". Default randomly generates with [tempfile](http://ruby-doc.org/stdlib-2.2.3/libdoc/tempfile/rdoc/Tempfile.html) |
+| sequence_format | string | optional | .%d.%03d | Sequence format for pid, task index |
+| file_ext | string | optional | | The file extension of local files such as ".csv.gz" ".json.gz". Default automatically generates from `source_format` and `compression`|
+| skip_file_generation | boolean | optional | | Load already generated local files into BigQuery if available. Specify correct path_prefix and file_ext. |
+| delete_from_local_when_job_end | boolean | optional | false | If set to true, delete glocal file when job is end |
+| compression | string | optional | "NONE" | Compression of local files (`GZIP` or `NONE`) |
+
+`source_format` is also used to determine formatter (csv or jsonl).
+
#### Same options of bq command-line tools or BigQuery job's propery
Following options are same as [bq command-line tools](https://cloud.google.com/bigquery/bq-command-line-tool#creatingtablefromfile) or BigQuery [job's property](https://cloud.google.com/bigquery/docs/reference/v2/jobs#resource).
| name | type | required? | default | description |
|:--------------------------|:------------|:-----------|:-------------|:-----------------------|
-| source_format | string | required | "CSV" | File type (`NEWLINE_DELIMITED_JSON` or `CSV`) |
+| source_format | string | required | "CSV" | File type (`NEWLINE_DELIMITED_JSON` or `CSV`) |
| max_bad_records | int | optional | 0 | |
| field_delimiter | char | optional | "," | |
| encoding | string | optional | "UTF-8" | `UTF-8` or `ISO-8859-1` |
| ignore_unknown_values | boolean | optional | 0 | |
| allow_quoted_newlines | boolean | optional | 0 | Set true, if data contains newline characters. It may cause slow procsssing |
@@ -68,53 +93,51 @@
type: bigquery
mode: append
auth_method: private_key # default
service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
p12_keyfile: /path/to/p12_keyfile.p12
- path_prefix: /path/to/output
- file_ext: csv.gz
- source_format: CSV
project: your-project-000
dataset: your_dataset_name
table: your_table_name
- formatter:
- type: csv
- header_line: false
- encoders:
- - {type: gzip}
+ compression: GZIP
+ source_format: NEWLINE_DELIMITED_JSON
```
### mode
-4 modes are provided.
+5 modes are provided.
##### append
-default. When append mode, plugin will insert data into existing table.
+1. Load to temporary table.
+2. Copy temporary table to destination table. (WRITE_APPEND)
+##### append_direct
+
+Insert data into existing table directly.
+This is not transactional, i.e., if fails, the target table could have some rows inserted.
+
##### replace
1. Load to temporary table.
2. Copy temporary table to destination table. (WRITE_TRUNCATE)
```is_skip_job_result_check``` must be false when replace mode
##### replace_backup
1. Load to temporary table.
-2. Copy destination table to backup table. (table_name_old)
+2. Copy destination table to backup table. (dataset_old, table_old)
3. Copy temporary table to destination table. (WRITE_TRUNCATE)
```is_skip_job_result_check``` must be false when replace_backup mode.
##### delete_in_advance
1. Delete destination table, if it exists.
2. Load to destination table.
-```auto_create_table``` must be true when delete_in_advance mode.
-
### Authentication
There are three methods supported to fetch access token for the service account.
1. Public-Private key pair of GCP(Google Cloud Platform)'s service account
@@ -194,11 +217,11 @@
When `auto_create_table` is set to true, try to create the table using BigQuery API.
If table already exists, insert into it.
-There are 2 ways to set schema.
+There are 3 ways to set schema.
#### Set schema.json
Please set file path of schema.json.
@@ -220,10 +243,82 @@
auto_create_table: true
table: table_%Y_%m
template_table: existing_table_name
```
+#### Guess from Embulk Schema
+
+Plugin will try to guess BigQuery schema from Embulk schema. It is also configurable with `column_options`. See [Column Options](#column-options).
+
+### Column Options
+
+Column options are used to aid guessing BigQuery schema, or to define conversion of values:
+
+- **column_options**: advanced: an array of options for columns
+ - **name**: column name
+ - **type**: BigQuery type such as `BOOLEAN`, `INTEGER`, `FLOAT`, `STRING`, `TIMESTAMP`, and `RECORD`. See belows for supported conversion type.
+ - boolean: `BOOLEAN`, `STRING` (default: `BOOLEAN`)
+ - long: `BOOLEAN`, `INTEGER`, `FLOAT`, `STRING`, `TIMESTAMP` (default: `INTEGER`)
+ - double: `INTEGER`, `FLOAT`, `STRING`, `TIMESTAMP` (default: `FLOAT`)
+ - string: `BOOLEAN`, `INTEGER`, `FLOAT`, `STRING`, `TIMESTAMP`, `RECORD` (default: `STRING`)
+ - timestamp: `INTEGER`, `FLOAT`, `STRING`, `TIMESTAMP` (default: `TIMESTAMP`)
+ - json: `STRING`, `RECORD` (default: `STRING`)
+ - **mode**: BigQuery mode such as `NULLABLE`, `REQUIRED`, and `REPEATED` (string, default: `NULLABLE`)
+ - **fields**: Describes the nested schema fields if the type property is set to RECORD. Please note that this is **required** for `RECORD` column.
+ - **timestamp_format**: timestamp format to convert into/from `timestamp` (string, default is `default_timestamp_format`)
+ - **timezone**: timezone to convert into/from `timestamp` (string, default is `default_timezone`).
+- **default_timestamp_format**: default timestamp format for column_options (string, default is "%Y-%m-%d %H:%M:%S.%6N")
+- **default_timezone**: default timezone for column_options (string, default is "UTC")
+
+Example)
+
+```yaml
+out:
+ type: bigquery
+ auto_create_table: true
+ column_options:
+ - {name: date, type: STRING, timestamp_format: %Y-%m-%d, timezone: "Asia/Tokyo"}
+ - name: json_column
+ type: RECORD
+ fields:
+ - {name: key1, type: STRING}
+ - {name: key2, type: STRING}
+```
+
+NOTE: Type conversion is done in this jruby plugin, and could be slow. See [Formatter Performance Issue](#formatter-performance-issue) to improve the performance.
+
+### Formatter Performance Issue
+
+embulk-output-bigquery supports formatting records into CSV or JSON (and also formatting timestamp column).
+However, this plugin is written in jruby, and jruby plugins are slower than java plugins generally.
+
+Therefore, it is recommended to format records with filter plugins written in Java such as [embulk-filter-to_json](https://github.com/civitaspo/embulk-filter-to_json) as:
+
+```
+filters:
+ - type: to_json
+ column: {name: payload, type: string}
+ default_format: %Y-%m-%d %H:%M:%S.%6N
+out:
+ type: bigquery
+ payload_column_index: 0 # or, payload_column: payload
+```
+
+Furtheremore, if your files are originally jsonl or csv files, you can even skip a parser with [embulk-parser-none](https://github.com/sonots/embulk-parser-none) as:
+
+```
+in:
+ type: file
+ path_prefix: example/example.jsonl
+ parser:
+ type: none
+ column_name: payload
+out:
+ type: bigquery
+ payload_column_index: 0 # or, payload_column: payload
+```
+
### Data Consistency
When `prevent_duplicate_insert` is set to true, embulk-output-bigquery generate job ID from md5 hash of file and other options to prevent duplicate data insertion.
`job ID = md5(md5(file) + dataset + table + schema + source_format + file_delimiter + max_bad_records + encoding + ignore_unknown_values + allow_quoted_newlines)`
@@ -236,10 +331,41 @@
out:
type: bigquery
prevent_duplicate_insert: true
```
-## Build
+## Development
+### Run example:
+
+Prepare a json\_keyfile at /tmp/your-project-000.json, then
+
```
-$ ./gradlew gem
+$ embulk bundle install --path vendor/bundle
+$ embulk run -X page_size=1 -b . -l trace example/example.yml
```
+
+### Run test:
+
+```
+$ bundle exec rake test
+```
+
+To run tests which actually connects to BigQuery such as test/test\_bigquery\_client.rb,
+prepare a json\_keyfile at /tmp/your-project-000.json, then
+
+```
+$ CONNECT=1 bundle exec ruby test/test_bigquery_client.rb
+$ CONNECT=1 bundle exec ruby test/test_example.rb
+```
+
+### Release gem:
+
+Fix gemspec, then
+
+```
+$ bundle exec rake release
+```
+
+## ChangeLog
+
+[CHANGELOG.md](CHANGELOG.md)