README.md in embulk-output-bigquery-0.4.10 vs README.md in embulk-output-bigquery-0.4.11

- old
+ new

@@ -12,11 +12,11 @@ * **Plugin type**: output * **Resume supported**: no * **Cleanup supported**: no * **Dynamic table creating**: yes -### NOT IMPLEMENTED +### NOT IMPLEMENTED * insert data over streaming inserts * for continuous real-time insertions * Please use other product, like [fluent-plugin-bigquery](https://github.com/kaizenplatform/fluent-plugin-bigquery) * https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases @@ -33,11 +33,11 @@ ## Configuration #### Original options -| name | type | required? | default | description | +| name | type | required? | default | description | |:-------------------------------------|:------------|:-----------|:-------------------------|:-----------------------| | mode | string | optional | "append" | See [Mode](#mode) | | auth_method | string | optional | "private_key" | `private_key` , `json_key` or `compute_engine` | service_account_email | string | required when auth_method is private_key | | Your Google service account email | p12_keyfile | string | required when auth_method is private_key | | Fullpath of private key in P12(PKCS12) format | @@ -51,11 +51,11 @@ | schema_file | string | optional | | /path/to/schema.json | | template_table | string | optional | | template table name. See [Dynamic Table Creating](#dynamic-table-creating) | | prevent_duplicate_insert | boolean | optional | false | See [Prevent Duplication](#prevent-duplication) | | job_status_max_polling_time | int | optional | 3600 sec | Max job status polling time | | job_status_polling_interval | int | optional | 10 sec | Job status polling interval | -| is_skip_job_result_check | boolean | optional | false | Skip waiting Load job finishes. Available for append, or delete_in_advance mode | +| is_skip_job_result_check | boolean | optional | false | Skip waiting Load job finishes. Available for append, or delete_in_advance mode | | with_rehearsal | boolean | optional | false | Load `rehearsal_counts` records as a rehearsal. Rehearsal loads into REHEARSAL temporary table, and delete finally. You may use this option to investigate data errors as early stage as possible | | rehearsal_counts | integer | optional | 1000 | Specify number of records to load in a rehearsal | | abort_on_error | boolean | optional | true if max_bad_records is 0, otherwise false | Raise an error if number of input rows and number of output rows does not match | | column_options | hash | optional | | See [Column Options](#column-options) | | default_timezone | string | optional | UTC | | @@ -78,11 +78,11 @@ | application_name | string | optional | "Embulk BigQuery plugin" | User-Agent | | sdk_log_level | string | optional | nil (WARN) | Log level of google api client library | Options for intermediate local files -| name | type | required? | default | description | +| name | type | required? | default | description | |:-------------------------------------|:------------|:-----------|:-------------------------|:-----------------------| | path_prefix | string | optional | | Path prefix of local files such as "/tmp/prefix_". Default randomly generates with [tempfile](http://ruby-doc.org/stdlib-2.2.3/libdoc/tempfile/rdoc/Tempfile.html) | | sequence_format | string | optional | .%d.%d | Sequence format for pid, thread id | | file_ext | string | optional | | The file extension of local files such as ".csv.gz" ".json.gz". Default automatically generates from `source_format` and `compression`| | skip_file_generation | boolean | optional | | Load already generated local files into BigQuery if available. Specify correct path_prefix and file_ext. | @@ -105,11 +105,11 @@ | allow_quoted_newlines | boolean | optional | false | Set true, if data contains newline characters. It may cause slow procsssing | | time_partitioning | hash | optional | `{"type":"DAY"}` if `table` parameter has a partition decorator, otherwise nil | See [Time Partitioning](#time-partitioning) | | time_partitioning.type | string | required | nil | The only type supported is DAY, which will generate one partition per day based on data loading time. | | time_partitioning.expiration_ms | int | optional | nil | Number of milliseconds for which to keep the storage for a partition. | | time_partitioning.field | string | optional | nil | `DATE` or `TIMESTAMP` column used for partitioning | -| time_partitioning.requirePartitionFilter | boolean | optional | nil | If ture, valid partition filter is required when query | +| time_partitioning.requirePartitionFilter | boolean | optional | nil | If true, valid partition filter is required when query | | schema_update_options | array | optional | nil | (Experimental) List of `ALLOW_FIELD_ADDITION` or `ALLOW_FIELD_RELAXATION` or both. See [jobs#configuration.load.schemaUpdateOptions](https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load.schemaUpdateOptions). NOTE for the current status: `schema_update_options` does not work for `copy` job, that is, is not effective for most of modes such as `append`, `replace` and `replace_backup`. `delete_in_advance` deletes origin table so does not need to update schema. Only `append_direct` can utilize schema update. | ### Example ```yaml @@ -370,11 +370,11 @@ This is useful to reduce number of consumed jobs, which is limited by [50,000 jobs per project per day](https://cloud.google.com/bigquery/quota-policy#import). This plugin originally loads local files into BigQuery in parallel, that is, consumes a number of jobs, say 24 jobs on 24 CPU core machine for example (this depends on embulk parameters such as `min_output_tasks` and `max_threads`). -BigQuery supports loading multiple files from GCS with one job (but not from local files, sigh), therefore, uploading local files to GCS and then loading from GCS into BigQuery reduces number of consumed jobs. +BigQuery supports loading multiple files from GCS with one job, therefore, uploading local files to GCS in parallel and then loading from GCS into BigQuery reduces number of consumed jobs to 1. Using `gcs_bucket` option, such strategy is enabled. You may also use `auto_create_gcs_bucket` to create the specified GCS bucket automatically. ```yaml out: @@ -442,19 +442,27 @@ $ embulk run -X page_size=1 -b . -l trace example/example.yml ``` ### Run test: +Place your embulk with `.jar` extension: + ``` -$ bundle exec rake test +$ cp -a $(which embulk) embulk.jar ``` +Run tests with `env RUBYOPT="-r ./embulk.jar`: + +``` +$ bundle exec env RUBYOPT="-r ./embulk.jar" rake test +``` + To run tests which actually connects to BigQuery such as test/test\_bigquery\_client.rb, prepare a json\_keyfile at example/your-project-000.json, then ``` -$ bundle exec ruby test/test_bigquery_client.rb -$ bundle exec ruby test/test_example.rb +$ bundle exec env RUBYOPT="-r ./embulk.jar" ruby test/test_bigquery_client.rb +$ bundle exec env RUBYOPT="-r ./embulk.jar" ruby test/test_example.rb ``` ### Release gem: Fix gemspec, then