README.md in embulk-output-bigquery-0.3.3 vs README.md in embulk-output-bigquery-0.3.4

- old
+ new

@@ -35,34 +35,36 @@ #### Original options | name | type | required? | default | description | |:-------------------------------------|:------------|:-----------|:-------------------------|:-----------------------| -| mode | string | optional | "append" | [See below](#mode) | +| mode | string | optional | "append" | See [Mode](#mode) | | auth_method | string | optional | "private_key" | `private_key` , `json_key` or `compute_engine` | service_account_email | string | required when auth_method is private_key | | Your Google service account email | p12_keyfile | string | required when auth_method is private_key | | Fullpath of private key in P12(PKCS12) format | | json_keyfile | string | required when auth_method is json_key | | Fullpath of json key | | project | string | required if json_keyfile is not given | | project_id | | dataset | string | required | | dataset | | table | string | required | | table name | | auto_create_dataset | boolean | optional | false | automatically create dataset | -| auto_create_table | boolean | optional | false | [See below](#dynamic-table-creating) | +| auto_create_table | boolean | optional | false | See [Dynamic Table Creating](#dynamic-table-creating) | | schema_file | string | optional | | /path/to/schema.json | -| template_table | string | optional | | template table name [See below](#dynamic-table-creating) | -| prevent_duplicate_insert | boolean | optional | false | [See below](#prevent-duplication) | +| template_table | string | optional | | template table name. See [Dynamic Table Creating](#dynamic-table-creating) | +| prevent_duplicate_insert | boolean | optional | false | See [Prevent Duplication] (#prevent-duplication) | | job_status_max_polling_time | int | optional | 3600 sec | Max job status polling time | | job_status_polling_interval | int | optional | 10 sec | Job status polling interval | | is_skip_job_result_check | boolean | optional | false | Skip waiting Load job finishes. Available for append, or delete_in_advance mode | | with_rehearsal | boolean | optional | false | Load `rehearsal_counts` records as a rehearsal. Rehearsal loads into REHEARSAL temporary table, and delete finally. You may use this option to investigate data errors as early stage as possible | | rehearsal_counts | integer | optional | 1000 | Specify number of records to load in a rehearsal | | abort_on_error | boolean | optional | true if max_bad_records is 0, otherwise false | Raise an error if number of input rows and number of output rows does not match | -| column_options | hash | optional | | [See below](#column-options) | +| column_options | hash | optional | | See [Column Options](#column-options) | | default_timezone | string | optional | UTC | | | default_timestamp_format | string | optional | %Y-%m-%d %H:%M:%S.%6N | | -| payload_column | string | optional | nil | [See below](#formatter-performance-issue) | -| payload_column_index | integer | optional | nil | [See below](#formatter-performance-issue) | +| payload_column | string | optional | nil | See [Formatter Performance Issue](#formatter-performance-issue) | +| payload_column_index | integer | optional | nil | See [Formatter Performance Issue](#formatter-performance-issue) | +| gcs_bucket | stringr | optional | nil | See [GCS Bucket](#gcs-bucket) | +| auto_create_gcs_bucket | boolean | optional | false | See [GCS Bucket](#gcs-bucket) | Client or request options | name | type | required? | default | description | |:-------------------------------------|:------------|:-----------|:-------------------------|:-----------------------| @@ -342,9 +344,28 @@ ```yaml out: type: bigquery prevent_duplicate_insert: true ``` + +### GCS Bucket + +This is useful to reduce number of consumed jobs, which is limited by [10,000 jobs per project per day](https://cloud.google.com/bigquery/quota-policy#import). + +This plugin originally loads local files into BigQuery in parallel, that is, consumes a number of jobs, say 24 jobs on 24 CPU core machine for example (this depends on embulk parameters such as `min_output_tasks` and `max_threads`). + +BigQuery supports loading multiple files from GCS with one job (but not from local files, sigh), therefore, uploading local files to GCS and then loading from GCS into BigQuery reduces number of consumed jobs. + +Using `gcs_bucket` option, such strategy is enabled. You may also use `auto_create_gcs_bucket` to create the specified GCS bucket automatically. + +```yaml +out: + type: bigquery + gcs_bucket: bucket_name + auto_create_gcs_bucket: false +``` + +ToDo: Use https://cloud.google.com/storage/docs/streaming if google-api-ruby-client supports streaming transfers into GCS. ## Development ### Run example: