# Google Cloud Bigquery extract file input plugin for Embulk 

embulk file input plugin. 

- embulk : http://www.embulk.org/docs/

- embulk plugins : http://www.embulk.org/plugins/

Read files stored in Google Cloud Storage that extracted from Google Cloud Bigquery's table or query result.

## Overview

* **Plugin type**: file input
* **Resume supported**: no
* **Cleanup supported**: yes

### Detail

Read files stored in Google Cloud Storage, that exported from Google Cloud Bigquery's table or query result.

Maybe solution for very big data in bigquery.

If you set  **table** config without **query** config, 
then just extract table to Google Cloud Storage.

If you set **query** config,
then query result save to temp table and then extracted that temp table to Google Cloud Storage uri.
see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract
   
## Usage

### Install plugin

```bash
embulk gem install embulk-input-bigquery_extract_files
```

* rubygem url : https://rubygems.org/profiles/jo8937


## Configuration

- **project**: Google Cloud Platform (gcp) project id (string, required)
- **json_keyfile**: gcp service account's private key with json (string, required)
- **location**: location of bigquery dataset and temp_dataset. see : https://cloud.google.com/bigquery/docs/locations (Optional) (string, default: `US`)
- **gcs_uri**: bigquery result saved uri. bucket and path names parsed from this uri.  (string, required)
- **temp_local_path**: extract files download directory in local machine (string, required)

- **dataset**: target datasource dataset (string, default: `null`)
- **table**: target datasource table. either query or table are required. (string, default: `null`)
- **query**: target datasource query. either query or table are required. (string, default: `null`)

- **temp_dataset**: if you use **query** param, query result saved here  (string, default: `null`)
- **temp_table**: if you use **query** param, query result saved here. if not set, plugin generate temp name (string, default: `null`)
- **use_legacy_sql**: if you use **query** param, see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query.useLegacySql (string, default: `false`)
- **cache**: if you use **query** param, see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query.useQueryCache (string, default: `true`)
- **create_disposition**: if you use **query** param, see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query.createDisposition (string, default: `CREATE_IF_NEEDED`)
- **write_disposition**: if you use **query** param, see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query.writeDisposition (string, default: `WRITE_APPEND`)

- **file_format**: Table extract file format. see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract.destinationFormat (string, default: `CSV`)
- **compression**: Table extract file compression setting. see : https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract.compression (string, default: `GZIP`)

- **temp_schema_file_path**: bigquery result schema file for parser. (Optional) (string, default: `null`)

- **bigquery_job_wait_second**: bigquery job waiting second. (Optional) (string, default: `600`)

- **cleanup_gcs_before_executing**: delete all file in gcs temp path before process start (Optional) (string, default: `true`)

- **cleanup_gcs_files**: delete all file in gcs temp path after process end (Optional) (string, default: `false`)

- **cleanup_temp_table**: delete query result table after process end (Optional) (string, default: `true`)

- **cleanup_local_temp_files**: delete all file in local temp dir (Optional) (string, default: `true`)

- **direct_download_enabled**: gcs download option. see : https://developers.google.com/api-client-library/java/google-api-java-client/reference/1.19.1/com/google/api/client/googleapis/media/MediaHttpDownloader#setDirectDownloadEnabled(boolean)  (Optional) (string, default: `false`)

- **decoders**: embulk java-file-input plugin's default attribute. see : http://www.embulk.org/docs/built-in.html#gzip-decoder-plugin
- **parser**: embulk java-file-input plugin's default .attribute see : http://www.embulk.org/docs/built-in.html#csv-parser-plugin

## Example

```yaml
in:
  type: bigquery_extract_files
  project: googlecloudplatformproject
  json_keyfile: gcp-service-account-private-key.json
  dataset: target_dataset
  #table: target_table
  query: 'select a,b,c from target_table'
  gcs_uri: gs://bucket/subdir
  temp_dataset: temp_dataset
  temp_local_path: C:\Temp
  file_format: 'NEWLINE_DELIMITED_JSON'
  compression: 'GZIP'
  decoders:
  - {type: gzip}  
  parser:
    type: json
out: 
  type: stdout
```

### Advenced Example 

#### bigquery to mysql with auto-schema 

I have to batch bigquery table to mysql every day for my job.
then, I wan'to get auto-schema for this file input plugin.

- see also 
 - https://github.com/jo8937/embulk-parser-csv_with_schema_file
 - https://github.com/embulk/embulk-output-jdbc/tree/master/embulk-output-mysql

this is my best practive for bigquery to mysql batch config. 

```yaml
in:
  type: bigquery_extract_files
  project: my-google-project
  json_keyfile: /tmp/embulk/google_service_account.json
  query: 'select * from dataset.t_nitocris'
  temp_dataset: temp_dataset
  gcs_uri: gs://bucket/embulktemp/t_nitocris_*
  temp_local_path: /tmp/embulk/data
  file_format: 'CSV'
  compression: 'GZIP'
  temp_schema_file_path: /tmp/embulk/schema/csv_schema_nitocris.json
  decoders:
  - {type: gzip}
  parser:
    type: csv_with_schema_file
    default_timestamp_format: '%Y-%m-%d %H:%M:%S %z'
    schema_path: /tmp/embulk/schema/csv_schema_nitocris.json
out:
  type: mysql
  host: host
  user: user
  password: password
  port: 3306
  database: MY_DATABASE
  table: 
  options: {connectTimeout: 0, waitTimeout: 0, enableQueryTimeouts: false, autoReconnect: true}
  mode: insert_direct
  retry_limit: 60
  retry_wait: 3000
  batch_size: 4096000
```

* bigquery's TIMESTAMP value format is not exactly one format in one CSV export. you can define optional 'columns' param in 'csv_with_schema_file' parser for another format like below.

```yml
...
  parser:
    type: csv_with_schema_file
    default_timestamp_format: '%Y-%m-%d %H:%M:%S %z'
    schema_path: /tmp/embulk/schema/csv_schema_nitocris.json
    columns:
      - {name: Date2, type: timestamp, format: '%Y-%m-%d %H:%M:%S.%N %z'}
```

## Build

```
$ ./gradlew gem  # -t to watch change of files and rebuild continuously
```

# Another choice

This plugin useful for file-input type. but maybe so complicated to use.

For portable use, see embulk-input-bigquery : https://github.com/medjed/embulk-input-bigquery