= Ruby + ADBC : subtitle A single API between Ruby and DBs : author Sutou Kouhei : institution ClearCode Inc. : content-source RubyKaigi 2023 : date 2023-05-13 : start-time 2023-05-13T13:30:00+09:00 : end-time 2023-05-13T14:00:00+09:00 : theme clear-code = Sutou Kouhei\nA president Ruby committer The president of ClearCode Inc.\n (('note:クリアコードの社長')) # img # src = images/clear-code-rubykaigi-2023-silver-sponsor.png # relative_width = 90 # reflect_ratio = 0.1 = Sutou Kouhei\nThe 3rd Apache Arrow PMC chair * PMC: Project Management Committee\n (('note:Apache Arrowのプロジェクト管理委員会の3代目代表')) * #2 commits(('note:(コミット数2位)')) # img # src = images/apache-arrow-commits-kou-with-mark.png # relative_height = 120 # reflect_ratio = 0.1 = Sutou Kouhei\nThe pioneer in Ruby and ADBC * A Ruby committer * Maintain some standard libraries/default gems\n (('note:標準ライブラリーとかデフォルトgemのメンテナンスをしている')) * The author of Red ADBC * The official ADBC library for Ruby\n (('note:公式のRuby用のADBCライブラリー')) * ADBC is developed by Arrow project\n (('note:ADBCはApache Arrowプロジェクトが開発している')) = Sutou Kouhei\nThe founder of Red Data Tools * Provides data processing tools for Ruby\n (('note:Ruby用のデータ処理ツールを提供するプロジェクト'))\n (('note:(())'))\n (('note:(())')) * Policies\n (('note:ポリシー')) * 5. Ignore criticism from outsiders\n (('note:部外者からの非難は気にしない'))\n (('note:Ignore "I use XXX for it instead of Ruby because..."')) * 6. Fun!\n (('note:楽しくやろう!')) = Topic\n(('note:話すこと')) Let's use ((*Ruby*)) to extract and load large data!\n (('note:大量データの読み書きにもRubyを使おうぜ!')) = Embulk? * Bulk data loader implemented in Java\n (('note:Javaで実装されたバルクデータローダー')) * JRuby supported!\n (('note:JRubyサポート!')) = "Embulk v0.11 is coming soon"\n(('note:「Embulk v0.11 がまもなく出ます」')) (('note:(())')) # blockquote we plan to gradually shrink our support on (J)Ruby (('note:Embulk の (J)Ruby サポートは徐々に縮小していく計画です。')) = Another approach: ADBC\n(('note:別のアプローチ:ADBC')) * ((*A*))rrow ((*D*))ata((*b*))ase ((*C*))onnectivity * A single API for accessing many DBs\n (('note:各種DBにアクセスするための共通API')) * Like Active Record/Sequel in Ruby\n (('note:Rubyで言えばActive RecordやSequelみたいなもの')) = ADBC:Features\n(('note:ADBC:特徴')) * Cross-language\n (('note:多言語対応')) * Active Record needs adapters impl-ed in Ruby\n (('note:Active RecordではRubyでアダプターを実装しないといけない')) * ADBC can use adapters impl-ed in other langs\n (('note:ADBCでは他の言語で実装されたアダプターも使える')) * Optimized for large columnar data\n (('note:大きな列指向データに最適化')) = Large column-oriented data\n(('note:大きな列指向データ')) * Large: >= 1M records with 1 column\n (('note:大きな:1カラムなら100万レコード以上')) * Column-oriented\n (('note:列指向')) (('tag:margin-bottom * 20')) # image # src = images/column-oriented.svg # align = right # vertical-align = top # relative-width = 60 # relative-margin-right = -10 # relative-margin-top = 30 = Optimized for large columnar data\n(('note:大きな列指向データに最適化')) * Apache Arrow data format:\n Minimize data interchange cost!\n (('note:Apache Arrowデータフォーマット:データ交換コストがめっちゃ安い!')) * Partitioned result sets:\n Fast data extract\n (('note:結果セットの分割:高速なデータ読み込み')) * Bulk insert:\n Fast data load\n (('note:バルクインサート:高速なデータ書き込み')) # img # src = https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/why-apache-arrow-format-is-fast.pdf # align = right # vertical-align = top # relative-width = 40 # relative-margin-right = -5 # relative-margin-top = 65 # caption = https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/ # caption-font-size = 1 = How fast is ADBC?\n(('note:ADBCはどのくらい速いの?')) * 1 integer column\n (('note:数値カラム1つだけ')) * (({SELECT * FROM x})) * Lower is faster\n (('note:低いほど速い')) * About 2x faster\n with 10M records\n (('note:1000万レコードで2倍ほど速い')) # charty # backend = pyplot # align = right # vertical-align = top # relative-margin-right = -10 # relative_height = 90 # type = bar # x = N records # y = Elapsed time (sec) # color = Approach Approach,N records,Elapsed time (sec) ADBC,100K,0.009 libpq,100K,0.010 ADBC,1M,0.066 libpq,1M,0.113 ADBC,10M,0.598 libpq,10M,1.122 = Architecture\n(('note:アーキテクチャー')) * Single API\n (('note:同じAPIで使える')) * Driver per\n protocol\n (('note:プロトコルごとに'))\n (('note:ドライバーを用意')) * API returns\n Arrow data\n (('note:レスポンスはArrowデータ')) # img # src = images/ADBCFlow2.svg # align = right # vertical-align = top # relative-width = 60 # relative-margin-top = 20 # relative-margin-right = -5 # caption = https://arrow.apache.org/img/ADBCFlow2.svg Apache-2.0 © 2016-2023 The Apache Software Foundation # caption-font-size = 1 = API * C API * Bindings: GLib, Python, R, ((*Ruby*)) * Go API * Java API * Rust API (WIP) (('note:See also: (())')) = C API * (({AdbcDatabase})): It holds state shared by multiple connections\n (('note:複数の接続を管理')) * (({AdbcConnection})): It's a single, logical connection to a database\n (('note:1つの接続を管理')) * (({AdbcStatement})): It holds state related to query execution\n (('note:クエリーの実行を管理')) (('note:See also: (())')) = Ruby API: Extract # rouge ruby require "adbc" options = { driver: "adbc_driver_postgresql", uri: "postgresql://127.0.0.1:5432/db", } ADBC::Database.open(**options) do |database| database.connect do |connection| connection.open_statement do |statement| query = "SELECT * FROM data" table, = statement.query(query) p table end end end = Ruby API: Load # rouge ruby require "adbc" options = { driver: "adbc_driver_postgresql", uri: "postgresql://127.0.0.1:5432/db", } ADBC::Database.open(**options) do |database| database.connect do |connection| connection.open_statement do |statement| input = Arrow::Table.load("in.arrow") statement.ingest("table", input) # ... end end end = Ruby API - Active Record (('tag:center'))(('tag:xx-large'))WIP (('tag:center')) (('note:(())'))\n (('note:Join us! We need to improve drivers too.')) = Available drivers\n(('note:利用可能なドライバー')) # RT DB, Status DuckDB, Beta Flight SQL, Beta PostgreSQL, Experimental SQLite, Beta = How to implement a driver\n(('note:ドライバーの実装方法')) * Choose C, C++ or Go * See the following implementations: * C: (('note:(())')) * C++: (('note:(())')) * Go (Go API): (('note:(())')) * Go (C API): (('note:(())')) = Current ADBC\n(('note:現時点のADBC')) * 1 integer column\n (('note:数値カラム1つだけ')) * (({SELECT * FROM x})) * Lower is faster\n (('note:低いほど速い')) * libpq driver\n is slow for now...\n (('note:実は現時点ではlibqpドライバーは遅い…')) # charty # backend = pyplot # align = right # vertical-align = top # relative-margin-right = -10 # relative_height = 90 # type = bar # x = N records # y = Elapsed time (sec) # color = Approach Approach,N records,Elapsed time (sec) ADBC (Flight SQL),100K,0.009 libpq,100K,0.010 ADBC (libpq),100K,0.02 ADBC (Flight SQL),1M,0.066 libpq,1M,0.113 ADBC (libpq),1M,0.217 ADBC (Flight SQL),10M,0.598 libpq,10M,1.122 ADBC (libpq),10M,2.266 = Flight SQL? SQL\n on\n Apache Arrow Flight\n (('note:Apache Arrow Flightの上でSQLを使えるようにしたもの')) = Apache Arrow Flight? * Arrow format based fast RPC framework\n (('note:Arrowフォーマットを使った高速RPCフレームワーク')) * Minimum data interchange cost!\n (('note:データ交換コストがめっちゃ安い!')) * Parallel transfers\n (('note:並列転送')) * Stream processing\n (('note:ストリーム処理')) # img # src = https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-2021/apache-arrow-flight.pdf # align = right # vertical-align = top # relative-width = 40 # relative-margin-right = -5 # relative-margin-top = 65 # caption = https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-2021/ # caption-font-size = 1 = Simple usage\n(('note:簡単な使い方')) # image # src = https://arrow.apache.org/img/20191014_flight_simple.png # relative-height = 75 (('tag:center')) (('note:(())'))\n (('note:Apache License 2.0 - © 2016-2021 The Apache Software Foundation')) = GetFlightInfo * Client→Server\n (('note:クライアント→サーバー')) * Server returns\n how to get data\n (('note:サーバーはデータの取得方法を返す')) * FlightInfo: How to get data\n (('note:FlightInfo: データの取得方法')) * Metadata: Schema, # of records, ...\n (('note:メタデータ:スキーマ・総レコード数…')) * 1+ Endpoints: Data may be distributed!\n (('note:複数エンドポイント:データは複数ヶ所に分散しているかもしれない!')) # image # src = https://arrow.apache.org/img/20191014_flight_simple.png # relative-width = 35 # align = right # vertical-align = top # relative-margin-right = -10 # draw0 = [rectangle, false, 0.28, 0.15, 0.39, 0.33, {color: "#0041ab", line_width: 3}] = DoGet * Client→Server\n (('note:クライアント→サーバー')) * Server returns data\n (('note:サーバーはデータを返す')) * Data: Record batch stream\n (('note:データ:レコードバッチのストリーム')) * Called as FlightData in protocol\n (('note:プロトコルレベルではFlightDataと呼んでいる')) * Record batch: 0+ records\n (('note:レコードバッチ:0個以上のレコードの集まり')) # image # src = https://arrow.apache.org/img/20191014_flight_simple.png # relative-width = 35 # align = right # vertical-align = top # relative-margin-right = -10 # draw0 = [rectangle, false, 0.28, 0.48, 0.39, 0.48, {color: "#0041ab", line_width: 3}] # = Scale out architecture example\n(('note:スケールアウト構成例')) # # # image # # src = https://arrow.apache.org/img/20191014_flight_complex.png # # relative-height = 80 # # (('tag:right')) # (('note:(())'))\n # (('note:Apache License 2.0 - © 2016-2021 The Apache Software Foundation')) # # = Point: Avoid inefficient forwards\n(('note:ポイント:ムダな転送を回避可能')) # # * A distributed system # * Communicate via "coordinator"\n # (('note:「コーディネーター」経由で通信')) # * Node→Coodinator→Client # * Apache Arrow Flight # * Client can fetch data from nodes directly\n # (('note:直接クライアントがデータ取得可能')) # * Node→Client # # # image # # src = https://arrow.apache.org/img/20191014_flight_complex.png # # relative-width = 30 # # align = right # # vertical-align = top # # relative-margin-right = -10 = Apache Arrow Flight SQL # mermaid # relative_height = 75 sequenceDiagram Client->>Server: GetFlightInfo(CommandStatementQuery: SQL) Server->>Client: FlightInfo{..., Ticket, ...} Client->>Server: DoGet(Ticket) Server->>Client: query results as Apache Arrow data (('tag:center')) (('note:(())'))\n (('note:Apache License 2.0 - © 2016-2023 The Apache Software Foundation')) == Slide properties : enable-title-on-image false = Current ADBC\n(('note:現時点のADBC')) * 1 integer column\n (('note:数値カラム1つだけ')) * (({SELECT * FROM x})) * Lower is faster\n (('note:低いほど速い')) * libpq driver\n is slow for now...\n (('note:実は現時点ではlibqpドライバーは遅い…')) # charty # backend = pyplot # align = right # vertical-align = top # relative-margin-right = -10 # relative_height = 90 # type = bar # x = N records # y = Elapsed time (sec) # color = Approach Approach,N records,Elapsed time (sec) ADBC (Flight SQL),100K,0.009 libpq,100K,0.010 ADBC (libpq),100K,0.02 ADBC (Flight SQL),1M,0.066 libpq,1M,0.113 ADBC (libpq),1M,0.217 ADBC (Flight SQL),10M,0.598 libpq,10M,1.122 ADBC (libpq),10M,2.266 = But can PostgreSQL talk Flight SQL?\n(('note:でもPostgreSQLはFlight SQLをしゃべれるの?')) Flight SQL adapter\n (('note:(())'))\n (('note:I'm the author')) = Architecture # mermaid # relative_height = 100 # background_color = black sequenceDiagram participant C as Client participant M as PG(master) participant FSM as PG(Flight SQL main) participant FSS as PG(Flight SQL server) participant FSE as PG(Flight SQL executor) M->>FSM: Spawn FSM->>FSS: Spawn note right of FSS: Listen gRPC socket (multi-threading) C->>FSS: Connect with Flight SQL protocol FSS->>FSM: Allocate an executor for this session FSM->>FSE: Spawn C->>FSS: Send a query FSS->>FSE: Pass the given query via shared memory note right of FSE: Run the given query with SPI note right of FSE: Convert a result to Apache Arrow data FSE->>FSS: Pass the result via shared memory FSS->>C: Return the result with Flight SQL protocol == Slide properties : enable-title-on-image false = Wrap up\n(('note:まとめ')) * We can use Ruby to extract and load large data by ADBC! (('note:(in a few years...)'))\n (('note:ADBCを使うとRubyで大量データを読み書きできるよ!(近いうちに。。。)')) * PostgreSQL'll be Flight SQL ready soon!\n (('note:すぐにPostgreSQLでFlight SQLを使えるようになるよ!')) * We can use ADBC via Active Record soon\n (('note:すぐにActive Record経由でADBCを使えるようになるよ!')) = Join us!\n(('note:一緒に開発しようぜ!')) * Red Data Tools: A project that provides data processing tools for Ruby\n (('note:Red Data Tools:Ruby用のデータ処理ツールを提供するプロジェクト'))\n (())\n (()) * You can implement something with us!\n (('note:一緒になにか作ろうぜ!'))\n (())\n (()) = Sponsor us?\n(('note:資金援助しない?')) * Provide XX% work time to your employee to work on Red Data Tools\n (('note:業務時間のXX%をRed Data Toolsの作業をできるようにする')) * (('tag:small'))Employ a full-time Red Data Tools developer\n (('note:フルタイムのRed Data Tools開発者を雇用する')) * Pay Red Data Tools continuously\n (('note:Red Data Toolsに継続的に資金を提供する'))\n (('note:Red Data Toolsのだれかがお金で時間を確保して作業する')) * Or contact me!\n (('note:相談して!'))