Scheduled bulk data loading to Elasticsearch + Kibana 4 from CSV files ================================== .. contents:: :local: :depth: 2 This article shows how to: * Bulk load CSV files to Elasticsearch. * Visualize the data with Kibana interactively. * Schedule the data loading every hour using cron. This guide assumes you are using Ubuntu 12.0 Precise or Mac OS X. Setup Elasticsearch and Kibana 4 ------------------ Step 1. Download and start Elasticsearch. ~~~~~~~~~~~~~~~~~~ You can find releases from the `Elasticsearch website `_. For the smallest setup, you can unzip the package and run `./bin/elasticsearch` command: .. code-block:: console $ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.4.zip $ unzip elasticsearch-1.4.4.zip $ cd elasticsearch-1.4.4 $ ./bin/elasticsearch Step 2. Download and unzip Kibana: ~~~~~~~~~~~~~~~~~~ You can find releases from the `Kibana website `_. Open a new console and run following commands: .. code-block:: console $ wget https://download.elasticsearch.org/kibana/kibana/kibana-4.0.0-linux-x64.tar.gz $ tar zxvf kibana-4.0.0-linux-x64.tar.gz $ cd kibana-4.0.0-linux-x64 $ ./bin/kibana Note: If you're using Mac OS X, https://download.elasticsearch.org/kibana/kibana/kibana-4.0.0-darwin-x64.tar.gz is the URL to download. Now Elasticsearch and Kibana started. Open http://localhost:5601/ using your browser to see the Kibana's graphical interface. Setup Embulk ------------------ Step 1. Download Embulk binary: ~~~~~~~~~~~~~~~~~~ You can find the latest embulk binary from the `releases `_. Because Embulk is a single executable binary, you can simply download it to /usr/local/bin directory and set executable flag as following: .. code-block:: console $ sudo wget http://dl.embulk.org/embulk-latest.jar -O /usr/local/bin/embulk $ sudo chmod +x /usr/local/bin/embulk Step 2. Install Elasticsearch plugin ~~~~~~~~~~~~~~~~~~ You also need Elasticsearch plugin for Embulk. You can install the plugin with this command: .. code-block:: console $ embulk gem install embulk-output-elasticsearch Embulk includes CSV file reader in itself. Now everything is ready to use. Loading a CSV file ------------------ Assuming you have a CSV files at ``./mydata/csv/`` directory. If you don't have CSV files, you can create ones using ``embulk example ./mydata`` command. Create this configuration file and save as ``config.yml``: .. code-block:: yaml in: type: file path_prefix: ./mydata/csv/ out: type: elasticsearch index: embulk index_type: embulk nodes: - host: localhost In fact, this configuration lacks some important information. However, embulk guesses the other information. So, next step is to order embulk to guess them: .. code-block:: console $ embulk guess config.yml -o config-complete.yml The generated config-complete.yml file should include complete information as following: .. code-block:: yaml in: type: file path_prefix: ./mydata/csv/ decoders: - {type: gzip} parser: charset: UTF-8 newline: CRLF type: csv delimiter: ',' quote: '"' escape: '' null_string: 'NULL' skip_header_lines: 1 columns: - {name: id, type: long} - {name: account, type: long} - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'} - {name: purchase, type: timestamp, format: '%Y%m%d'} - {name: comment, type: string} out: type: elasticsearch index: embulk index_type: embulk nodes: - {host: localhost} Now, you can run the bulk loading: .. code-block:: console $ embulk run config-complete.yml -o next-config.yml Scheduling loading by cron ------------------ At the last step, you ran embulk command with ``-o next-config.yml`` file. The ``next-config.yml`` file should include a parameter named ``last_path``: .. code-block:: yaml last_path: mydata/csv/sample_01.csv.gz With this configuration, embulk loads the files newer than this file in alphabetical order. For example, if you create ``./mydata/csv/sample_02.csv.gz`` file, embulk skips ``sample_01.csv.gz`` file and loads ``sample_02.csv.gz`` only next time. And the next next-config.yml file has ``last_path: mydata/csv/sample_02.csv.gz`` for the next next execution. So, if you want to loads newly created files every day, you can setup this cron schedule: .. code-block:: cron 0 * * * * embulk run /path/to/next-config.yml -o /path/to/next-config.yml