# Column filter plugin for Embulk

[![Build Status](https://secure.travis-ci.org/sonots/embulk-filter-column.png?branch=master)](http://travis-ci.org/sonots/embulk-filter-column)

A filter plugin for Embulk to filter out columns

## Configuration

- **columns**: columns to retain (array of hash)
  - **name**: name of column (required)
  - **src**: src column name to be copied (optional, default is `name`)
  - **default**: default value used if input is null (optional)
  - **type**: type of the default value (required for `default`)
  - **format**: special option for timestamp column, specify the format of the default timestamp (string, default is `default_timestamp_format`)
  - **timezone**: special option for timestamp column, specify the timezone of the default timestamp (string, default is `default_timezone`)
- **add_columns**: columns to add (array of hash)
  - **name**: name of column (required)
  - **src**: src column name to be copied (either of `src` or `default` is required)
  - **default**: value of column (either of `src` or `default` is required)
  - **type**: type of the default value (required for `default`)
  - **format**: special option for timestamp column, specify the format of the default timestamp (string, default is `default_timestamp_format`)
  - **timezone**: special option for timestamp column, specify the timezone of the default timestamp (string, default is `default_timezone`)
- **drop_columns**: columns to drop (array of hash)
  - **name**: name of column (required)
- **default_timestamp_format**: default timestamp format for timestamp columns (string, default is `%Y-%m-%d %H:%M:%S.%N %z`)
- **default_timezone**: default timezone for timestamp columns (string, default is `UTC`)

## Example (columns)

Say input.csv is as follows:

```
time,id,key,score
2015-07-13,0,Vqjht6YEUBsMPXmoW1iOGFROZF27pBzz0TUkOKeDXEY,1370
2015-07-13,1,VmjbjAA0tOoSEPv_vKAGMtD_0aXZji0abGe7_VXHmUQ,3962
2015-07-13,2,C40P5H1WcBx-aWFDJCI8th6QPEI2DOUgupt_gB8UutE,7323
```

```yaml
filters:
  - type: column
    columns:
      - {name: time, default: "2015-07-13", format: "%Y-%m-%d"}
      - {name: id}
      - {name: key, default: "foo"}
```

reduces columns to only `time`, `id`, and `key` columns as:

```
2015-07-13,0,Vqjht6YEUBsMPXmoW1iOGFROZF27pBzz0TUkOKeDXEY
2015-07-13,1,VmjbjAA0tOoSEPv_vKAGMtD_0aXZji0abGe7_VXHmUQ
2015-07-13,2,C40P5H1WcBx-aWFDJCI8th6QPEI2DOUgupt_gB8UutE
```

Note that column types are automatically retrieved from input data (inputSchema).

## Example (add_columns)

Say input.csv is as follows:

```
time,id,key,score
2015-07-13,0,Vqjht6YEUBsMPXmoW1iOGFROZF27pBzz0TUkOKeDXEY,1370
2015-07-13,1,VmjbjAA0tOoSEPv_vKAGMtD_0aXZji0abGe7_VXHmUQ,3962
2015-07-13,2,C40P5H1WcBx-aWFDJCI8th6QPEI2DOUgupt_gB8UutE,7323
```

```yaml
filters:
  - type: column
    add_columns:
      - {name: d, type: timestamp, default: "2015-07-13", format: "%Y-%m-%d"}
      - {name: copy_id, src: id}
```

add `d` column, and `copy_id` column which is a copy of `id` column as:

```
2015-07-13,0,Vqjht6YEUBsMPXmoW1iOGFROZF27pBzz0TUkOKeDXEY,1370,2015-07-13,0
2015-07-13,1,VmjbjAA0tOoSEPv_vKAGMtD_0aXZji0abGe7_VXHmUQ,3962,2015-07-13,1
2015-07-13,2,C40P5H1WcBx-aWFDJCI8th6QPEI2DOUgupt_gB8UutE,7323,2015-07,13,2
```

## Example (drop_columns)

Say input.csv is as follows:

```
time,id,key,score
2015-07-13,0,Vqjht6YEUBsMPXmoW1iOGFROZF27pBzz0TUkOKeDXEY,1370
2015-07-13,1,VmjbjAA0tOoSEPv_vKAGMtD_0aXZji0abGe7_VXHmUQ,3962
2015-07-13,2,C40P5H1WcBx-aWFDJCI8th6QPEI2DOUgupt_gB8UutE,7323
```

```yaml
filters:
  - type: column
    drop_columns:
      - {name: time}
      - {name: id}
```

drop `time` and `id` columns as:

```
Vqjht6YEUBsMPXmoW1iOGFROZF27pBzz0TUkOKeDXEY,1370
VmjbjAA0tOoSEPv_vKAGMtD_0aXZji0abGe7_VXHmUQ,3962
C40P5H1WcBx-aWFDJCI8th6QPEI2DOUgupt_gB8UutE,7323
```

## JSONPath (like) name

For type: json column, you can specify [JSONPath](http://goessner.net/articles/JsonPath/) for column's name as:

```
$.payload.key1
$.payload.array[0]
$.payload.array[*]
```

EXAMPLE:

* [example/json_columns.yml](example/json_columns.yml)
* [example/json_add_columns.yml](example/json_add_columns.yml)
* [example/json_drop_columns.yml](example/json_drop_columns.yml)

NOTE:

* JSONPath syntax is not fully supported
* Embulk's type: json cannot have timestamp column, so `type: timesatmp` for `add_columns` or `columns` with default is not available
* `src` (to rename or copy columns) for `add_columns` or `columns` is only partially supported yet
  * the json path directory must be same, for example, `{name: $.foo.copy, src: $foo.bar}` works, but `{name: $foo.copy, src: $.bar.baz}` does not work

## ToDo

* Write test

## Development

Run example:

```
$ ./gradlew classpath
$ embulk preview -I lib example/example.yml
```

Run test:

```
$ ./gradlew test
```

Run test with coverage reports:

```
$ ./gradlew test jacocoTestReport
```

open build/reports/jacoco/test/html/index.html

Run checkstyle:

```
$ ./gradlew check
```

Release gem:

```
$ ./gradlew gemPush
```