= Apache Arrow : subtitle A cross-language development platform\nfor in-memory data : author Kouhei Sutou : institution ClearCode Inc. : content-source SciPy Japan Conference 2019 : date 2019-04-23 : start-time 2019-04-23T14:30:00+09:00 : end-time 2019-04-23T14:55:00+09:00 : theme . = Me Ruby committer since 2004\n (('note:2004年からRubyコミッター')) = Why do I talk at SciPy?\n(('note:なぜSciPyで話しているのか？')) To introduce\n ((*Apache Arrow*))\n (('note:((*Apache Arrow*))を紹介するため')) = Apache Arrow # blockquote # title = https://arrow.apache.org/ A cross-language development platform for in-memory data\n (('note:インメモリーデータ向け多言語対応開発プラットフォーム')) = Cross-language\n(('note:多言語対応')) * C, C++, C#, Go, Java, * JavaScript, MATLAB, ((*Python*)), * R, ((*Ruby*)) and Rust = Development platform\n(('note:開発プラットフォーム')) Apache Arrow ... * specifies standards and\n (('note:標準化')) * provides implementations\n (('note:実装')) to advance cooperation by many people\n (('note:多くの人が協力できるように')) = For in-memory data\n(('note:インメモリーデータ')) Apache Arrow focuses on ↓ ((*for now*))\n (('note:Apache Arrowは((*今のところ*))は↓に注力')) * sharing columnar/tensor data\n (('note:カラムナーデータ・テンソルデータの共有')) * analyzing columnar data\n (('note:カラムナーデータの分析')) * RPC for columnar data\n (('note:カラムナーデータのRPC')) = Apache Arrow and Python\n(('note:Apache ArrowとPython')) * As pickle replacement\n (('note:pickleの代替')) * PySpark does\n (('note:PySparkはすでにやっている')) * As dataframe library\n (('note:データフレームライブラリー')) * pandas and Vaes use Apache Arrow a bit\n (('note:pandasとVaesはApache Arrowを少し使っている')) = Apache Arrow and me\n(('note:Apache Arrowと私')) # image # src = images/apache-arrow-and-me.svg # align = right # vertical-align = bottom # relative-width = 65 # relative-margin-right = 20 # relative-padding-bottom = -15 * A release manager(('note:（リリースマネージャー）')) * 0.11.0 and 0.13.0 (('note:(the latest release/最新リリース)')) * An active developer(('note:（アクティブな開発者）')) (('tag:margin-top * 10')) = Feature (1)\n(('note:機能（1）')) Effective serialization\n (('note:効率的なシリアライズ')) = Why effective?\n(('note:なぜ効率的なのか')) * Don't parse data\n (('note:データをパースしないから')) * Use data directly\n (('note:データをそのまま使うから')) = Data format: Number\n(('note:データフォーマット：数値')) Contiguous data (Same as C array) 連続データ（Cの配列と同じ） 32bit integer: [1, 2, 3] 0x01 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x03 ... = Compare to JSON\n(('note:JSONと比較')) 　　　　"[1, 2, 3]" ↓ "1" → 1 (String → Number) "2" → 2 (String → Number) "3" → 3 (String → Number) = Merit of direct data use\n(('note:データを直接使うことのメリット')) * Zero copy cost\n (('note:コピーコストをなくせる')) * Copy is costly for large data\n (('note:大きなデータではコピーはコストが高い')) * (Nearly) zero parse cost\n (('note:（ほぼ）パースコストをなくせる')) * Only need to parse metadata\n (('note:メタデータをパースするだけでよい')) = Performance\n(('note:性能')) # image # src = images/serialize-list.png # relative_height = 77 (('note:(())')) (('note:Apache License 2.0: (c) 2016-2019 The Apache Software Foundation')) = Zero copy and large data\n(('note:ゼロコピーと大きなデータ')) * pandas can't process large data\n (('note:pandasは大きなデータを扱えない')) * Because it needs to allocate memory\n (('note:メモリーを確保する必要があるから')) * Apache Arrow supports memory mapping\n (('note:Apache Arrowはメモリーマッピング対応')) * Can use data in file directly without copy\n (('note:ファイル内のデータをコピーせずに使える')) = Effective string representation\n(('note:効率的な文字列表現')) * pandas: Array of strings\n (('note:pandas：文字列の配列')) * Use discontiguous memory\n (('note:非連続なメモリーを使う')) * Apache Arrow: Data and array of lengths\n (('note:Apache Arrow：データと長さの配列')) * Use contiguous memory: Fast\n (('note:連続したメモリーを使う：速い')) = Data format: String\n(('note:データフォーマット：文字列')) Data bytes + length array UTT-8 string: ["Hello", "", "!"] Data bytes: "Hello!" Length array: [0, 5, 5, 6] i-th length: lengths[i+1] - lengths[i] i-th data： data[lengths[i]:lengths[i+1]] = Feature(?) (2)\n(('note:機能（？）（2）')) Specify data format\n (('note:データフォーマットを仕様化')) = Why do Arrow specify?\n(('note:なぜArrowは仕様化するのか')) Effective\n data exchange\n (('note:効率的なデータ交換のため')) = Effective data exchange\n(('note:効率的なデータ交換')) * Use common format widely\n (('note:みんなが同じフォーマットを使うこと')) * No format conversion reduces resource usage\n (('note:フォーマットを変換しなくてよいならリソース使用量を減らせる')) * Use low {,de}serialize cost format\n (('note:シリアライズコストが低いフォーマットを使うこと')) * Fast = Who uses Arrow format?\n(('note:Arrowフォーマットをだれが使っているか')) * (()): For NVIDIA GPU * (()), (()): For FPGA * (()): For interprocess data exchange\n (('note:Spark：プロセス間のデータ交換のために')) = CPU and GPU * Can't share data on memory\n (('note:メモリー上のデータを共有できない')) * Need to copy between CPU and GPU\n (('note:CPUとGPU間でコピーする必要がある')) * Effective data exchange improves performance\n (('note:データ交換を効率化することで高速化')) = CPU and FPGA * Can't share data on memory\n (('note:メモリー上のデータを共有できない')) * Need to copy between CPU and FPGA\n (('note:CPUとFPGA間でコピーする必要がある')) * Effective data exchange improves performance\n (('note:データ交換を効率化することで高速化')) = Spark * Process large data\n (('note:大きなデータを処理')) * Need to pass data to worker processes\n (('note:ワーカープロセスにデータを渡す必要がある')) * Effective data exchange improves performance\n (('note:データ交換を効率化することで高速化')) = PySpark * Worker by Python\n (('note:ワーカーはPython')) * Use pikcle to exchange data\n (('note:データ交換にpickleを使用')) * Spark supports Arrow for data exchange\n (('note:Arrowを使ったデータ交換をサポート')) * Disabled by default\n (('note:デフォルトでは無効')) = PySpark with Arrow In [2]: %time pdf = df.toPandas() CPU times: user 17.4 s, sys: 792 ms, total: 18.1 s Wall time: 20.7 s In [3]: spark.conf.set("spark.sql.execution.arrow.enabled", "true") In [4]: %time pdf = df.toPandas() CPU times: user 40 ms, sys: 32 ms, total: 72 ms Wall time: 737 ms (('note:((< Speeding up PySpark with Apache Arrow|URL:https://arrow.apache.org/blog/2017/07/26/spark-arrow/>))')) = Feature (3)\n(('note:機能（3）')) Optimized\n data processing\n modules\n (('note:最適化されたデータ処理モジュール')) = Optimized data processing\n(('note:最適化されたデータ処理モジュール')) * Apache Arrow targets large data\n (('note:Apache Arrowは大きなデータを対象にしている')) * Performance is important\n (('note:性能は重要')) * How to get high performance...?\n (('note:どうすれば速くできる。。。？')) = High performance (1)\n(('note:高速化（1）')) Data locality\n (('note:データを局所化')) = Data locality\n(('note:データを局所化')) * Minimize cache misses\n (('note:キャッシュミスを減らす')) * Storage is very slow\n (('note:ストレージはすごく遅い')) * Memory is slow\n (('note:メモリーは遅い')) * CPU cache is fast\n (('note:CPUキャッシュは速い')) = High performance (2)\n(('note:高速化（2）')) SIMD\n (('note:Single Instruction Multi Data'))\n (('note:一気に複数のデータを処理する方法')) = SIMD * Data must be contiguous and aligned\n (('note:データは連続していてアラインされていないといけない')) * Arrow format is SIMD ready\n (('note:ArrowフォーマットはSIMDを使える')) * No condition branch\n (('note:条件分岐がないこと')) * Use bitmap instead of "missing" for null\n (('note:nullを表現するために「欠損値」ではなく別途ビットマップを使う'))\n (('note:((<"Is it time to stop using sentinel values for null / NA values?"|URL:http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/>))')) = No condition branch\n(('note:条件分岐なし')) # img # src = images/simd-null.svg # relative_height = 100 == Slide property : enable-title-on-image false = FYI: null\n(('note:参考情報：null')) * All data types support null in Arrow\n (('note:Arrowはすべての型でnullをサポート')) * Some types only support null in NumPy\n (('note:NumPyは一部の型でnullをサポート'))\n (('note:((<欠損値の制約 - PythonとApache Arrow|URL:https://speakerdeck.com/sinhrks/pythontoapache-arrow-eaf72479-ce30-4161-8c73-15b555cc56c7?slide=11>))')) = High performance (3)\n(('note:高速化（3）')) Thread = Thread * Use multi-cores in single process\n (('note:シングルプロセスで複数コアを使う')) * Minimize resource conflict\n (('note:リソースの競合をなくすこと')) * Locking to avoid conflict reduces performance\n (('note:競合を避けるためにロックすると性能劣化')) * Approaches(('note:（アプローチ）')) * Read only or copy (shared nothing)\n (('note:リードオンリーにするかコピー（なにも共有しない）')) = Apache Arrow and thread\n(('note:Apache Arrowとスレッド')) * Data is read only\n (('note:データはリードオンリー')) * Share data in threads without lock overhead\n (('note:ロックのオーバーヘッドなしでスレッド間でデータを共有')) * Avoid both locking and copying\n (('note:ロックもコピーも避ける')) * They reduce performance\n (('note:どちらも性能劣化するから')) = High performance (4)\n(('note:高速化（4）')) Compute kernels\n (('note:計算カーネル')) = Compute kernels\n(('note:計算カーネル')) * SIMD ready primitive operations\n (('note:SIMDを使ったプリミティブな演算')) * Projection, Filter, Aggregation, ...\n (('note:射影とかフィルターとか集計とかとか')) * compare, take, mean, ...\n (('note:比較とか行選択とか平均とか')) = High performance (5)\n(('note:高速化（5）')) Subgraph compiler\n (('note:サブグラフコンパイラー')) = Subgraph compiler: Gandiva\n(('note:サブグラフコンパイラー：Gandiva')) * Compile operator graphs at run-time\n (('note:実行時に演算グラフをコンパイル')) * Operator graph: combined multiple operations\n (('note:演算グラフ：演算のまとまり')) * (({table.a + table.b < table.c && ...})) * Usable for query engine backend\n (('note:クエリーエンジンのバックエンドとして使える')) = High performance (6)\n(('note:高速化（6）')) Query engine\n (('note:クエリーエンジン')) = Query engine\n(('note:クエリーエンジン')) * For single node\n (('note:シングルノード向け')) * Dataflow-style operator execution\n (('note:データが流れるように演算を実行')) * scan → project → filter → aggregate → ...\n (('note:データ取得→射影→フィルター→集計→…')) (('note:(())')) = Query engine from Python\n(('note:Pythonからクエリーエンジンを使う')) * With pandas(('note:（pandasと使う）')) * Large data → execute → (({to_pandas()}))\n (('note:大きなデータ→実行→(({to_pandas()}))')) * With Dask(('note:（Daskと使う）')) * Dask will be able to use this as backend\n (('note:Daskのバックエンドで使えるかも？')) = High performance (7)\n(('note:高速化（7）')) Datasets\n (('note:データセット')) = Datasets\n(('note:データセット')) * Scan data from storage/database\n (('note:ストレージ・データベースからデータ取得')) * File systems: local, HDFS, ... * Formats: CSV, Parquet, ... * Databases: MySQL, PostgreSQL, ... (('note:(())')) = Fast datasets\n(('note:高速なデータセット')) * Predicate pushdown\n (('note:条件のプッシュダウン')) * Scan only needed data\n (('note:必要なデータのみ取得')) * Parallel scan\n (('note:並列取得')) = Feature (4)\n(('note:機能（4）')) RPC = RPC: Arrow Flight * Fast RPC framework for Arrow\n (('note:Arrow用の高速なRPC')) * Based on gRPC with low-level extensions\n (('note:gRPCベースでいくつか低レベルの拡張をしている')) (('note:(())')) = Wrap up\n(('note:まとめ')) * Arrow is useful for SciPy community\n (('note:SciPyコミュニティーにArrowは有用')) * in not only Python but also other languages\n (('note:Pythonだけでなく他の言語でも有用')) * Join Apache Arrow development!\n (('note:Apache Arrowの開発に参加しよう！')) * Ask me how to start\n (('note:なにから始めればよいかは私に相談してね')) = Next step\n(('note:次の一歩')) * (()): dev@arrow.apache.org * Chat in Japanese: * (()) * Apache Arrow Tokyo Meetup 2019 (('note:this summer?')) * See also: (())