README.md in embulk-output-bigquery-0.4.14 vs README.md in embulk-output-bigquery-0.5.0
- old
+ new
@@ -21,18 +21,10 @@
* https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases
Current version of this plugin supports Google API with Service Account Authentication, but does not support
OAuth flow for installed applications.
-### INCOMPATIBILITY CHANGES
-
-v0.3.x has incompatibility changes with v0.2.x. Please see [CHANGELOG.md](CHANGELOG.md) for details.
-
-* `formatter` option (formatter plugin support) is dropped. Use `source_format` option instead. (it already exists in v0.2.x too)
-* `encoders` option (encoder plugin support) is dropped. Use `compression` option instead (it already exists in v0.2.x too).
-* `mode: append` mode now expresses a transactional append, and `mode: append_direct` is one which is not transactional.
-
## Configuration
#### Original options
| name | type | required? | default | description |
@@ -45,14 +37,13 @@
| project | string | required if json_keyfile is not given | | project_id |
| dataset | string | required | | dataset |
| location | string | optional | nil | geographic location of dataset. See [Location](#location) |
| table | string | required | | table name, or table name with a partition decorator such as `table_name$20160929`|
| auto_create_dataset | boolean | optional | false | automatically create dataset |
-| auto_create_table | boolean | optional | false | See [Dynamic Table Creating](#dynamic-table-creating) and [Time Partitioning](#time-partitioning) |
+| auto_create_table | boolean | optional | true | `false` is available only for `append_direct` mode. Other modes requires `true`. See [Dynamic Table Creating](#dynamic-table-creating) and [Time Partitioning](#time-partitioning) |
| schema_file | string | optional | | /path/to/schema.json |
| template_table | string | optional | | template table name. See [Dynamic Table Creating](#dynamic-table-creating) |
-| prevent_duplicate_insert | boolean | optional | false | See [Prevent Duplication](#prevent-duplication) |
| job_status_max_polling_time | int | optional | 3600 sec | Max job status polling time |
| job_status_polling_interval | int | optional | 10 sec | Job status polling interval |
| is_skip_job_result_check | boolean | optional | false | Skip waiting Load job finishes. Available for append, or delete_in_advance mode |
| with_rehearsal | boolean | optional | false | Load `rehearsal_counts` records as a rehearsal. Rehearsal loads into REHEARSAL temporary table, and delete finally. You may use this option to investigate data errors as early stage as possible |
| rehearsal_counts | integer | optional | 1000 | Specify number of records to load in a rehearsal |
@@ -105,11 +96,10 @@
| allow_quoted_newlines | boolean | optional | false | Set true, if data contains newline characters. It may cause slow procsssing |
| time_partitioning | hash | optional | `{"type":"DAY"}` if `table` parameter has a partition decorator, otherwise nil | See [Time Partitioning](#time-partitioning) |
| time_partitioning.type | string | required | nil | The only type supported is DAY, which will generate one partition per day based on data loading time. |
| time_partitioning.expiration_ms | int | optional | nil | Number of milliseconds for which to keep the storage for a partition. |
| time_partitioning.field | string | optional | nil | `DATE` or `TIMESTAMP` column used for partitioning |
-| time_partitioning.require_partition_filter | boolean | optional | nil | If true, valid partition filter is required when query |
| clustering | hash | optional | nil | Currently, clustering is supported for partitioned tables, so must be used with `time_partitioning` option. See [clustered tables](https://cloud.google.com/bigquery/docs/clustered-tables) |
| clustering.fields | array | required | nil | One or more fields on which data should be clustered. The order of the specified columns determines the sort order of the data. |
| schema_update_options | array | optional | nil | (Experimental) List of `ALLOW_FIELD_ADDITION` or `ALLOW_FIELD_RELAXATION` or both. See [jobs#configuration.load.schemaUpdateOptions](https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load.schemaUpdateOptions). NOTE for the current status: `schema_update_options` does not work for `copy` job, that is, is not effective for most of modes such as `append`, `replace` and `replace_backup`. `delete_in_advance` deletes origin table so does not need to update schema. Only `append_direct` can utilize schema update. |
### Example
@@ -250,15 +240,10 @@
table: table_%Y_%m
```
### Dynamic table creating
-This plugin tries to create a table using BigQuery API when
-
-* mode is either of `delete_in_advance`, `replace`, `replace_backup`, `append`.
-* mode is `append_direct` and `auto_create_table` is true.
-
There are 3 ways to set schema.
#### Set schema.json
Please set file path of schema.json.
@@ -353,26 +338,10 @@
out:
type: bigquery
payload_column_index: 0 # or, payload_column: payload
```
-### Prevent Duplication
-
-`prevent_duplicate_insert` option is used to prevent inserting same data for modes `append` or `append_direct`.
-
-When `prevent_duplicate_insert` is set to true, embulk-output-bigquery generate job ID from md5 hash of file and other options.
-
-`job ID = md5(md5(file) + dataset + table + schema + source_format + file_delimiter + max_bad_records + encoding + ignore_unknown_values + allow_quoted_newlines)`
-
-[job ID must be unique(including failures)](https://cloud.google.com/bigquery/loading-data-into-bigquery#consistency) so that same data can't be inserted with same settings repeatedly.
-
-```yaml
-out:
- type: bigquery
- prevent_duplicate_insert: true
-```
-
### GCS Bucket
This is useful to reduce number of consumed jobs, which is limited by [100,000 jobs per project per day](https://cloud.google.com/bigquery/quotas#load_jobs).
This plugin originally loads local files into BigQuery in parallel, that is, consumes a number of jobs, say 24 jobs on 24 CPU core machine for example (this depends on embulk parameters such as `min_output_tasks` and `max_threads`).
@@ -399,35 +368,34 @@
```yaml
out:
type: bigquery
table: table_name$20160929
- auto_create_table: true
```
-You may configure `time_partitioning` parameter together to create table via `auto_create_table: true` option as:
+You may configure `time_partitioning` parameter together as:
```yaml
out:
type: bigquery
table: table_name$20160929
- auto_create_table: true
time_partitioning:
type: DAY
expiration_ms: 259200000
```
You can also create column-based partitioning table as:
+
```yaml
out:
type: bigquery
mode: replace
- auto_create_table: true
table: table_name
time_partitioning:
type: DAY
field: timestamp
```
+
Note the `time_partitioning.field` should be top-level `DATE` or `TIMESTAMP`.
Use [Tables: patch](https://cloud.google.com/bigquery/docs/reference/v2/tables/patch) API to update the schema of the partitioned table, embulk-output-bigquery itself does not support it, though.
Note that only adding a new column, and relaxing non-necessary columns to be `NULLABLE` are supported now. Deleting columns, and renaming columns are not supported.