# Row filter plugin for Embulk [![Build Status](https://secure.travis-ci.org/sonots/embulk-filter-row.png?branch=master)](http://travis-ci.org/sonots/embulk-filter-row) A filter plugin for Embulk to filter out rows ## Configuration Versions >= 0.3.0 has `where` option to support SQL-like syntax. * **where**: Select only rows which matches with conditions written in SQL-like syntax. See [SQL-like Syntax](#sql-like-syntax) ## Example ```yaml filters: - type: row where: column1 = 'str' ``` ```yaml filters: - type: row where: |- ( string START_WITH 'str' AND number > 1.0 ) OR ( time = TIMESTAMP '2016-01-01 +0900' AND "true" = true ) ``` See [SQL-like Syntax](#sql-like-syntax) for more details # SQL-like Syntax Versions >= 0.3.0 suppors SQL-like syntax as: ```sql where: |- ( string START_WITH 'str' AND number > 1.0 ) OR ( time = TIMESTAMP '2016-01-01 +0900' AND "true" = true AND string REGEXP '^reg' ) ``` ## Literals ### Boolean Literal `true` or `TRUE` or `false` or `FALSE` are considered as a boolean literal ### Number Literal Characters matching with a regular expression `-?[0-9]+(\.[0-9]+)?` is considered as a number literal ### String Literal Characters surrounded by `'` such as `'foo'` is considered as a string literal ### Timestamp Literal NOTE: It became possible to omit `TIMESTAMP` keyword on comparing with `timestamp` identifier (column) from version >= 0.3.3. `TIMESTAMP ( NumberLiteral | StringLiteral )` such as `TIMESTAMP 1470433087.747123` or `TIMESTAMP '2016-08-06 06:38:07.747123 +0900'` is considered as a timestamp literal Number is a epoch time since 1970-01-01 UTC with nano time resolution. String is a timestamp string which matches with one of following format: * `%Y-%m-%d %H:%M:%S.%N %z` * `%Y-%m-%d %H:%M:%S.%N` * `%Y-%m-%d %H:%M:%S %z` * `%Y-%m-%d %H:%M:%S` * `%Y-%m-%d %z` * `%Y-%m-%d` The time zone for formats without `%z` is UTC, and the time resolution is micro second (caused by limitation of Embulk TimestampParser). ### Json Literal Not supported yet ### Identifier Literal Characters matching with a regular expression `[a-zA-Z_][a-zA-z0-9_]*` such as `foobar`, and characters surrounded by `"` such as `"foo-bar"`, `"foo.bar"`, and `"foo\"bar"` are considred as an identifier literal, that is, embulk's column name. ## Operators ### Boolean Operator * `=` * `!=` ### Number Operator (Long and Double) * `=` * `!=` * `>` * `>=` * `<=` * `<` ### String Operator * `=` * `!=` * `START_WITH` * `END_WITH` * `INCLUDE` * `REGEXP` ### Timestamp Operator * `=` * `!=` * `>` * `>=` * `<=` * `<` ### Json Operator Not supported yet ### unary operator * "xxx IS NULL" * "xxx IS NOT NULL" * "NOT xxx" ## Old Configuration Versions >= 0.3.0 has `where` option to supports SQL-like syntax. I recommend to use it. Following options are **deprecated**, and **will be removed someday**. * **condition**: AND or OR (string, default: AND). * **conditions**: select only rows which matches with conditions. * **column**: column name (string, required) * **operator** operator (string, optional, default: ==) * boolean operator * `==` * `!=` * numeric operator (long, double, Timestamp) * `==` * `!=` * `>` * `>=` * `<=` * `<` * string operator * `==` * `!=` * `start_with` (or `startsWith`) * `end_with` (or `endsWith`) * `include` (or `contains`) * unary operator * `IS NULL` * `IS NOT NULL` * **argument**: argument for the operation (string, required for non-unary operators) * **not**: not (boolean, optional, default: false) * **format**: special option for timestamp column, specify the format of timestamp argument, parsed argument is compared with the column value as Timestamp object (string, default is `%Y-%m-%d %H:%M:%S.%N %z`) * **timezone**: special option for timestamp column, specify the timezone of timestamp argument (string, default is `UTC`) NOTE: column type is automatically retrieved from input data (inputSchema) ## Example (AND) **Deprecated** ```yaml filters: - type: row condition: AND conditions: - {column: foo, operator: "IS NOT NULL"} - {column: id, operator: ">=", argument: 10} - {column: id, operator: "<", argument: 20} - {column: name, opeartor: "include", argument: foo, not: true} - {column: time, operator: "==", argument: "2015-07-13", format: "%Y-%m-%d"} ``` ## Example (OR) **Deprecated** ```yaml filters: - type: row condition: OR conditions: - {column: a, operator: "IS NOT NULL"} - {column: b, operator: "IS NOT NULL"} ``` ## Example (AND of OR) **Deprecated** You can express a condition such as `(A OR B) AND (C OR D)` by combining multiple filters like ```yaml filters: - type: row condition: OR conditions: - {column: a, operator: "IS NOT NULL"} - {column: b, operator: "IS NOT NULL"} - type: row condition: OR conditions: - {column: c, operator: "IS NOT NULL"} - {column: d, operator: "IS NOT NULL"} ``` ## Comparisions * [embulk-filter-calcite](https://github.com/muga/embulk-filter-calcite) * embulk-filter-calcite is a pretty nice plugin which enables us to write SQL query to filter embulk records. * However, based on [my benchmark (Japanese)](http://qiita.com/sonots/items/a70482d29862de87624d), embulk-filter-calcite was 1471 times slower than embulk-filter-row in iterms of string `=` operator. Use `embulk-filter-row` (this plugin) when you need performance. ## ToDo * Support filtering by values of `type: json` with JSONPath * Support IN operator ## ChangeLog [CHANGELOG.md](./CHANGELOG.md) ## Development Run example: ``` $ ./gradlew classpath $ embulk preview -I lib example/example.yml ``` Run test: ``` $ ./gradlew test ``` Run checkstyle: ``` $ ./gradlew check ``` Release gem: ``` $ ./gradlew gemPush ``` ## Development of SQL-like Syntax Read the article [Supported SQL-like Syntax with embulk-filter-row using BYACC/J and JFlex](http://blog.livedoor.jp/sonots/archives/48172830.html). To download BYACC/J and JFlex and run them, you can use: ``` $ script/byaccj.sh ``` or ``` $ ./gradlew byaccj # this runs script/byaccj.sh internally ``` This generates `src/main/java/org/embulk/filter/row/where/{Parser,ParserVal,Yylex}.java`. The `byaccj` task of gradle is ran before `compileJava` task (which means to be ran before `classpath` or `test` task also) automatically.