Scheduled bulk data loading to Elasticsearch + Kibana 4 from CSV files
==================================
.. contents::
:local:
:depth: 2
This article shows how to:
* Bulk load CSV files to Elasticsearch.
* Visualize the data with Kibana interactively.
* Schedule the data loading every hour using cron.
This guide assumes you are using Ubuntu 12.0 Precise or Mac OS X.
Setup Elasticsearch and Kibana 4
------------------
Step 1. Download and start Elasticsearch.
~~~~~~~~~~~~~~~~~~
You can find releases from the `Elasticsearch website `_.
For the smallest setup, you can unzip the package and run `./bin/elasticsearch` command:
.. code-block:: console
$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.4.zip
$ unzip elasticsearch-1.4.4.zip
$ cd elasticsearch-1.4.4
$ ./bin/elasticsearch
Step 2. Download and unzip Kibana:
~~~~~~~~~~~~~~~~~~
You can find releases from the `Kibana website `_. Open a new console and run following commands:
.. code-block:: console
$ wget https://download.elasticsearch.org/kibana/kibana/kibana-4.0.0-linux-x64.tar.gz
$ tar zxvf kibana-4.0.0-linux-x64.tar.gz
$ cd kibana-4.0.0-linux-x64
$ ./bin/kibana
Note: If you're using Mac OS X, https://download.elasticsearch.org/kibana/kibana/kibana-4.0.0-darwin-x64.tar.gz is the URL to download.
Now Elasticsearch and Kibana started. Open http://localhost:5601/ using your browser to see the Kibana's graphical interface.
Setup Embulk
------------------
Step 1. Download Embulk binary:
~~~~~~~~~~~~~~~~~~
You can find the latest embulk binary from the `releases `_. Because Embulk is a single executable binary, you can simply download it to /usr/local/bin directory and set executable flag as following:
.. code-block:: console
$ sudo wget http://dl.embulk.org/embulk-latest.jar -O /usr/local/bin/embulk
$ sudo chmod +x /usr/local/bin/embulk
Step 2. Install Elasticsearch plugin
~~~~~~~~~~~~~~~~~~
You also need Elasticsearch plugin for Embulk. You can install the plugin with this command:
.. code-block:: console
$ embulk gem install embulk-output-elasticsearch
Embulk includes CSV file reader in itself. Now everything is ready to use.
Loading a CSV file
------------------
Assuming you have a CSV files at ``./mydata/csv/`` directory. If you don't have CSV files, you can create ones using ``embulk example ./mydata`` command.
Create this configuration file and save as ``config.yml``:
.. code-block:: yaml
in:
type: file
path_prefix: ./mydata/csv/
out:
type: elasticsearch
index: embulk
index_type: embulk
nodes:
- host: localhost
In fact, this configuration lacks some important information. However, embulk guesses the other information. So, next step is to order embulk to guess them:
.. code-block:: console
$ embulk guess config.yml -o config-complete.yml
The generated config-complete.yml file should include complete information as following:
.. code-block:: yaml
in:
type: file
path_prefix: ./mydata/csv/
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
escape: ''
null_string: 'NULL'
skip_header_lines: 1
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
out:
type: elasticsearch
index: embulk
index_type: embulk
nodes:
- {host: localhost}
Now, you can run the bulk loading:
.. code-block:: console
$ embulk run config-complete.yml -o next-config.yml
Scheduling loading by cron
------------------
At the last step, you ran embulk command with ``-o next-config.yml`` file. The ``next-config.yml`` file should include a parameter named ``last_path``:
.. code-block:: yaml
last_path: mydata/csv/sample_01.csv.gz
With this configuration, embulk loads the files newer than this file in alphabetical order.
For example, if you create ``./mydata/csv/sample_02.csv.gz`` file, embulk skips ``sample_01.csv.gz`` file and loads ``sample_02.csv.gz`` only next time. And the next next-config.yml file has ``last_path: mydata/csv/sample_02.csv.gz`` for the next next execution.
So, if you want to loads newly created files every day, you can setup this cron schedule:
.. code-block:: cron
0 * * * * embulk run /path/to/next-config.yml -o /path/to/next-config.yml