README.md in embulk-output-bigquery-0.4.14 vs README.md in embulk-output-bigquery-0.5.0

- old
+ new

@@ -21,18 +21,10 @@ * https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases Current version of this plugin supports Google API with Service Account Authentication, but does not support OAuth flow for installed applications. -### INCOMPATIBILITY CHANGES - -v0.3.x has incompatibility changes with v0.2.x. Please see [CHANGELOG.md](CHANGELOG.md) for details. - -* `formatter` option (formatter plugin support) is dropped. Use `source_format` option instead. (it already exists in v0.2.x too) -* `encoders` option (encoder plugin support) is dropped. Use `compression` option instead (it already exists in v0.2.x too). -* `mode: append` mode now expresses a transactional append, and `mode: append_direct` is one which is not transactional. - ## Configuration #### Original options | name | type | required? | default | description | @@ -45,14 +37,13 @@ | project | string | required if json_keyfile is not given | | project_id | | dataset | string | required | | dataset | | location | string | optional | nil | geographic location of dataset. See [Location](#location) | | table | string | required | | table name, or table name with a partition decorator such as `table_name$20160929`| | auto_create_dataset | boolean | optional | false | automatically create dataset | -| auto_create_table | boolean | optional | false | See [Dynamic Table Creating](#dynamic-table-creating) and [Time Partitioning](#time-partitioning) | +| auto_create_table | boolean | optional | true | `false` is available only for `append_direct` mode. Other modes requires `true`. See [Dynamic Table Creating](#dynamic-table-creating) and [Time Partitioning](#time-partitioning) | | schema_file | string | optional | | /path/to/schema.json | | template_table | string | optional | | template table name. See [Dynamic Table Creating](#dynamic-table-creating) | -| prevent_duplicate_insert | boolean | optional | false | See [Prevent Duplication](#prevent-duplication) | | job_status_max_polling_time | int | optional | 3600 sec | Max job status polling time | | job_status_polling_interval | int | optional | 10 sec | Job status polling interval | | is_skip_job_result_check | boolean | optional | false | Skip waiting Load job finishes. Available for append, or delete_in_advance mode | | with_rehearsal | boolean | optional | false | Load `rehearsal_counts` records as a rehearsal. Rehearsal loads into REHEARSAL temporary table, and delete finally. You may use this option to investigate data errors as early stage as possible | | rehearsal_counts | integer | optional | 1000 | Specify number of records to load in a rehearsal | @@ -105,11 +96,10 @@ | allow_quoted_newlines | boolean | optional | false | Set true, if data contains newline characters. It may cause slow procsssing | | time_partitioning | hash | optional | `{"type":"DAY"}` if `table` parameter has a partition decorator, otherwise nil | See [Time Partitioning](#time-partitioning) | | time_partitioning.type | string | required | nil | The only type supported is DAY, which will generate one partition per day based on data loading time. | | time_partitioning.expiration_ms | int | optional | nil | Number of milliseconds for which to keep the storage for a partition. | | time_partitioning.field | string | optional | nil | `DATE` or `TIMESTAMP` column used for partitioning | -| time_partitioning.require_partition_filter | boolean | optional | nil | If true, valid partition filter is required when query | | clustering | hash | optional | nil | Currently, clustering is supported for partitioned tables, so must be used with `time_partitioning` option. See [clustered tables](https://cloud.google.com/bigquery/docs/clustered-tables) | | clustering.fields | array | required | nil | One or more fields on which data should be clustered. The order of the specified columns determines the sort order of the data. | | schema_update_options | array | optional | nil | (Experimental) List of `ALLOW_FIELD_ADDITION` or `ALLOW_FIELD_RELAXATION` or both. See [jobs#configuration.load.schemaUpdateOptions](https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load.schemaUpdateOptions). NOTE for the current status: `schema_update_options` does not work for `copy` job, that is, is not effective for most of modes such as `append`, `replace` and `replace_backup`. `delete_in_advance` deletes origin table so does not need to update schema. Only `append_direct` can utilize schema update. | ### Example @@ -250,15 +240,10 @@ table: table_%Y_%m ``` ### Dynamic table creating -This plugin tries to create a table using BigQuery API when - -* mode is either of `delete_in_advance`, `replace`, `replace_backup`, `append`. -* mode is `append_direct` and `auto_create_table` is true. - There are 3 ways to set schema. #### Set schema.json Please set file path of schema.json. @@ -353,26 +338,10 @@ out: type: bigquery payload_column_index: 0 # or, payload_column: payload ``` -### Prevent Duplication - -`prevent_duplicate_insert` option is used to prevent inserting same data for modes `append` or `append_direct`. - -When `prevent_duplicate_insert` is set to true, embulk-output-bigquery generate job ID from md5 hash of file and other options. - -`job ID = md5(md5(file) + dataset + table + schema + source_format + file_delimiter + max_bad_records + encoding + ignore_unknown_values + allow_quoted_newlines)` - -[job ID must be unique(including failures)](https://cloud.google.com/bigquery/loading-data-into-bigquery#consistency) so that same data can't be inserted with same settings repeatedly. - -```yaml -out: - type: bigquery - prevent_duplicate_insert: true -``` - ### GCS Bucket This is useful to reduce number of consumed jobs, which is limited by [100,000 jobs per project per day](https://cloud.google.com/bigquery/quotas#load_jobs). This plugin originally loads local files into BigQuery in parallel, that is, consumes a number of jobs, say 24 jobs on 24 CPU core machine for example (this depends on embulk parameters such as `min_output_tasks` and `max_threads`). @@ -399,35 +368,34 @@ ```yaml out: type: bigquery table: table_name$20160929 - auto_create_table: true ``` -You may configure `time_partitioning` parameter together to create table via `auto_create_table: true` option as: +You may configure `time_partitioning` parameter together as: ```yaml out: type: bigquery table: table_name$20160929 - auto_create_table: true time_partitioning: type: DAY expiration_ms: 259200000 ``` You can also create column-based partitioning table as: + ```yaml out: type: bigquery mode: replace - auto_create_table: true table: table_name time_partitioning: type: DAY field: timestamp ``` + Note the `time_partitioning.field` should be top-level `DATE` or `TIMESTAMP`. Use [Tables: patch](https://cloud.google.com/bigquery/docs/reference/v2/tables/patch) API to update the schema of the partitioned table, embulk-output-bigquery itself does not support it, though. Note that only adding a new column, and relaxing non-necessary columns to be `NULLABLE` are supported now. Deleting columns, and renaming columns are not supported.