# embulk-output-bigquery

[Embulk](https://github.com/embulk/embulk/) output plugin to load/insert data into [Google BigQuery](https://cloud.google.com/bigquery/) using [direct insert](https://cloud.google.com/bigquery/loading-data-into-bigquery#loaddatapostrequest)

## Overview

load data into Google BigQuery as batch jobs for big amount of data
https://developers.google.com/bigquery/loading-data-into-bigquery

* **Plugin type**: output
* **Resume supported**: no
* **Cleanup supported**: no
* **Dynamic table creating**: yes

### NOT IMPLEMENTED 
* insert data over streaming inserts
  * for continuous real-time insertions
  * Please use other product, like [fluent-plugin-bigquery](https://github.com/kaizenplatform/fluent-plugin-bigquery)
  * https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases

Current version of this plugin supports Google API with Service Account Authentication, but does not support
OAuth flow for installed applications.

## Configuration

#### Original options

| name                      | type        | required?  | default      | description            |  
|:--------------------------|:------------|:-----------|:-------------|:-----------------------|
|  auth_method              | string      | optional   | "private_key"  | `private_key` , `json_key` or `compute_engine`
|  service_account_email    | string      | required when auth_method is private_key  |   | Your Google service account email
|  p12_keyfile              | string      | required when auth_method is private_key   |   | Fullpath of private key in P12(PKCS12) format |
|  json_keyfile             | string      | required when auth_method is json_key     |   | Fullpath of json key |
|  sequence_format          | string      | optional   | %03d.%02d      |  |
|  file_ext                 | string      | optional   |                | e.g. ".csv.gz" ".json.gz" |
|  project                  | string      | required   |                | project_id |
|  dataset                  | string      | required   |                | dataset |
|  table                    | string      | required   |                | table name |
|  auto_create_table        | boolean     | optional   | 0              | [See below](#dynamic-table-creating) |
|  schema_file              | string      | optional   |                | /path/to/schema.json |
|  prevent_duplicate_insert | boolean     | optional   | 0              | [See below](#data-consistency) |
|  delete_from_local_when_job_end | boolean     | optional   | 0            | If set to true, delete local file when job is end |
|  job_status_max_polling_time    | int         | optional   | 3600 sec     | Max job status polling time |
|  job_status_max_polling_time    | int         | optional   | 10 sec       | Job status polling interval |
|  is_skip_job_result_check       | boolean     | optional   | 0            |  |
|  application_name         | string      | optional   | "Embulk BigQuery plugin" | Anything you like |

#### Same options of bq command-line tools or BigQuery job's propery

Following options are same as [bq command-line tools](https://cloud.google.com/bigquery/bq-command-line-tool#creatingtablefromfile) or BigQuery [job's property](https://cloud.google.com/bigquery/docs/reference/v2/jobs#resource).

| name                      | type        | required?  | default      | description            |  
|:--------------------------|:------------|:-----------|:-------------|:-----------------------|
|  source_format            | string      | required   | "CSV"          | File type (`NEWLINE_DELIMITED_JSON` or `CSV`) |
|  max_bad_records          | int         | optional   | 0            | |
|  field_delimiter          | char        | optional   | ","          |  |
|  encoding                 | string      | optional   | "UTF-8"      | `UTF-8` or `ISO-8859-1` |
|  ignore_unknown_values    | boolean     | optional   | 0            | |
|  allow_quoted_newlines    | boolean     | optional   | 0            | Set true, if data contains newline characters. It may cause slow procsssing |

### Example

```yaml
out:
  type: bigquery
  auth_method: private_key   # default
  service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
  p12_keyfile: /path/to/p12_keyfile.p12
  path_prefix: /path/to/output
  file_ext: csv.gz
  source_format: CSV
  project: your-project-000
  dataset: your_dataset_name
  table: your_table_name
  formatter:
    type: csv
    header_line: false
  encoders:
  - {type: gzip}
```

### Authentication

There are three methods supported to fetch access token for the service account.

1. Public-Private key pair of GCP(Google Cloud Platform)'s service account
2. JSON key of GCP(Google Cloud Platform)'s service account
3. Pre-defined access token (Google Compute Engine only)

#### Public-Private key pair of GCP's service account

You first need to create a service account (client ID),
download its private key and deploy the key with embulk.

```yaml
out:
  type: bigquery
  auth_method: private_key   # default
  service_account_email: ABCXYZ123ABCXYZ123.gserviceaccount.com
  p12_keyfile: /path/to/p12_keyfile.p12
```

#### JSON key of GCP's service account

You first need to create a service account (client ID),
download its json key and deploy the key with embulk.

```yaml
out:
  type: bigquery
  auth_method: json_key
  json_keyfile: /path/to/json_keyfile.json
```

You can also embed contents of json_keyfile at config.yml.

```yaml
in:
  type: gcs
  auth_method: json_key
  json_keyfile:
    content: |
      {
          "private_key_id": "123456789",
          "private_key": "-----BEGIN PRIVATE KEY-----\nABCDEF",
          "client_email": "..."
       }
```

#### Pre-defined access token(GCE only)

On the other hand, you don't need to explicitly create a service account for embulk when you
run embulk in Google Compute Engine. In this third authentication method, you need to
add the API scope "https://www.googleapis.com/auth/bigquery" to the scope list of your
Compute Engine VM instance, then you can configure embulk like this.

```yaml
out:
  type: bigquery
  auth_method: compute_engine
```

### Table id formatting

`table` and option accept [Time#strftime](http://ruby-doc.org/core-1.9.3/Time.html#method-i-strftime)
format to construct table ids.
Table ids are formatted at runtime
using the local time of the embulk server.

For example, with the configuration below,
data is inserted into tables `table_2015_04`, `table_2015_05` and so on.

```yaml
out:
  type: bigquery
  table: table_%Y_%m
```

### Dynamic table creating

When `auto_create_table` is set to true, try to create the table using BigQuery API.

If table already exists, insert into it.

To describe the schema of the target table, please write schema path.


```yaml
out:
  type: bigquery
  auto_create_table: true
  table: table_%Y_%m
  schema_file: /path/to/schema.json
```

### Data Consistency

When `prevent_duplicate_insert` is set to true, embulk-output-bigquery generate job ID from md5 hash of file  and other options to prevent duplicate data insertion.

`job ID = md5(md5(file) + dataset + table + schema + source_format + file_delimiter + max_bad_records + encoding + ignore_unknown_values + allow_quoted_newlines)`

[job ID must be unique(including failures)](https://cloud.google.com/bigquery/loading-data-into-bigquery#consistency). So same data can't insert with same settings.

In other words, you can retry as many times as you like, in case something bad error(like network error) happens before job insertion.

```yaml
out:
  type: bigquery
  prevent_duplicate_insert: true
```

## Build

```
$ ./gradlew gem
```