= Apache Arrow

: author
   須藤功平
: institution
   クリアコード
: content-source
   データ分析用次世代データフォーマットApache Arrow勉強会
: date
   2017-06-13
: allotted-time
   60m
: theme
   clear-code

= ハッシュタグ

(('#tokyo_arrow'))\n
(('note:今日はいろんなURLを参照するのでそれらを共有したい'))

= 流れ

  (1) Apache Arrowの概要を知る
  (2) Apache Arrowの詳細を知る
  (3) Apache Arrow関連の開発に\n
      参加する方法を知る

= 概要

DataScience.rbワークショップ\n
の資料で紹介

  * RubyもApache Arrowで\n
    データ処理言語の仲間入り\n
    (('tag:small:https://slide.rabbit-shocker.org/authors/kou/data-science-rb/'))

= 詳細

  * 最新情報はWes McKinneyさんのスライドを見るのがよい
    * https://www.slideshare.net/wesm/
  * 例：
    * (('tag:xx-small'))
      https://www.slideshare.net/wesm/memory-interoperability-in-analytics-and-machine-learning
    * (('tag:xx-small'))
      https://www.slideshare.net/wesm/nextgeneration-python-big-data-tools-powered-by-apache-arrow

= 紹介

  * (('tag:xx-small'))
    https://www.slideshare.net/MapR_Japan/apache-arrow-value-vectors-tokyo-apache-drill-meetup-20160322
  * (('tag:xx-small'))
    https://www.slideshare.net/HadoopSummit/the-columnar-era-leveraging-parquet-arrow-and-kudu-for-highperformance-analytics
  * (('tag:xx-small'))
    https://www.slideshare.net/wesm/memory-interoperability-in-analytics-and-machine-learning

= 開発に参加

  * Apache Arrowの旨味がでる状態
    * みんながApache Arrowを使う
  * 早く↑の状態にするには
    * Apache Arrow関連の開発に参加！\n
      (('note:待っていることもできるけど一緒にやろうよ！'))

= Apache Arrowの開発に参加

  * JIRA:(('tag:xx-small:https://issues.apache.org/jira/browse/ARROW/'))
    * コミットはすべてチケットに紐づく
    * こういうのやりたいねー！も\n
      チケットになる
  * メーリングリスト:(('tag:xx-small:dev@arrow.apache.org'))\n
    (('note:dev-subscribe@arrow.apache.orgにメールを送ればOK'))
    * 基本的にここでディスカッション
    * JIRAの新規チケットも流れる

= Apache Arrowの開発に参加

  * バグレポート
    * JIRAにチケット作成
  * バグ修正・機能追加
    * JIRAにチケット作成→GitHubでPR\n
      (('note:Pull Requestタイトルにルールあり（後述）'))
  * 相談
    * メーリングリスト

= PRのタイトル

  フォーマット：
  ARROW-XXX: [YYY] ...
  例：
  ARROW-897: [GLib] Extract ...

  ARROW-XXX: JIRAのissue ID
  [YYY]: モジュール名

= モジュール

  * Java: Java実装
  * C++: C++実装
  * GLib: C++実装のCラッパー\n
    (('note:（各種言語バインディング向け）'))
    * GLibを使用
  * JS: JavaScript実装
    * TypeScriptを使用

= WANTED: モジュール

(('tag:center'))
↓は未着手なはずなので\n
ここから開発に参加もあり

  * R: C++実装のR(('note:cpp'))ラッパー
  * Julia: Juliaネイティブ実装
  * Go: Goネイティブ実装\n
    (('note:GLib経由で使えるけどネイティブ実装の方がいいかも？'))
  * Rust: Rustネイティブ実装

= Apache Arrow関連の開発

  * 大量のデータ交換が必要な\n
    プロダクトをArrowに対応させる
    * 例：Apache Spark\n
      (('note:（PySparkはすでに進んでいる：SPARK-13534）'))

= 対応プロダクト

  * Groonga: http://groonga.org/
    * 全文検索エンジン
  * Ray: (('tag:x-small:https://github.com/ray-project/ray'))
    * 分散タスク実行エンジン
  * Turbodbc:\n
    (('tag:x-small:https://github.com/blue-yonder/turbodbc'))
    * ODBCでDB内の分析用データにアクセスするためのPythonモジュール

= Red Data Tools

(('tag:center'))
(('tag:small'))
https://red-data-tools.github.io/

  * Ruby用データ分析ツールを\n
    揃えよう！プロジェクト
    * Apache Arrowベース
  * ただし！できるだけRuby以外でも使えるようにしたい！

= Ruby以外でも使える？

  * GLibバインディングとして開発\n
    (('note:（Ruby専用バインディングとして開発しない）'))
    * Luaとかでも使えるようになる
    * 例：parquet-glib\n
      (('tag:xx-small:https://github.com/red-data-tools/parquet-glib'))
    * 例：xtensor-glib\n
      (('tag:xx-small:https://github.com/red-data-tools/xtensor-glib'))

= Ruby以外でも使える？

  * データも似たような感じで
    * どうすればいろんな言語から\n
      使いやすくなるかは要検討

= 開発に参加しよう！

  * Apache Arrow
    * dev@arrow.apache.org
  * Red Data Tools
    * https://gitter.im/red-data-tools
  * Arrowが嬉しそうなプロダクト