= Ruby + ADBC
: subtitle
A single API between Ruby and DBs
: author
Sutou Kouhei
: institution
ClearCode Inc.
: content-source
RubyKaigi 2023
: date
2023-05-13
: start-time
2023-05-13T13:30:00+09:00
: end-time
2023-05-13T14:00:00+09:00
: theme
clear-code
= Sutou Kouhei\nA president Ruby committer
The president of ClearCode Inc.\n
(('note:クリアコードの社長'))
# img
# src = images/clear-code-rubykaigi-2023-silver-sponsor.png
# relative_width = 90
# reflect_ratio = 0.1
= Sutou Kouhei\nThe 3rd Apache Arrow PMC chair
* PMC: Project Management Committee\n
(('note:Apache Arrowのプロジェクト管理委員会の3代目代表'))
* #2 commits(('note:(コミット数2位)'))
# img
# src = images/apache-arrow-commits-kou-with-mark.png
# relative_height = 120
# reflect_ratio = 0.1
= Sutou Kouhei\nThe pioneer in Ruby and ADBC
* A Ruby committer
* Maintain some standard libraries/default gems\n
(('note:標準ライブラリーとかデフォルトgemのメンテナンスをしている'))
* The author of Red ADBC
* The official ADBC library for Ruby\n
(('note:公式のRuby用のADBCライブラリー'))
* ADBC is developed by Arrow project\n
(('note:ADBCはApache Arrowプロジェクトが開発している'))
= Sutou Kouhei\nThe founder of Red Data Tools
* Provides data processing tools for Ruby\n
(('note:Ruby用のデータ処理ツールを提供するプロジェクト'))\n
(('note:(())'))\n
(('note:(())'))
* Policies\n
(('note:ポリシー'))
* 5. Ignore criticism from outsiders\n
(('note:部外者からの非難は気にしない'))\n
(('note:Ignore "I use XXX for it instead of Ruby because..."'))
* 6. Fun!\n
(('note:楽しくやろう!'))
= Topic\n(('note:話すこと'))
Let's use ((*Ruby*)) to extract and load large data!\n
(('note:大量データの読み書きにもRubyを使おうぜ!'))
= Embulk?
* Bulk data loader implemented in Java\n
(('note:Javaで実装されたバルクデータローダー'))
* JRuby supported!\n
(('note:JRubyサポート!'))
= "Embulk v0.11 is coming soon"\n(('note:「Embulk v0.11 がまもなく出ます」'))
(('note:(())'))
# blockquote
we plan to gradually shrink our support on (J)Ruby
(('note:Embulk の (J)Ruby サポートは徐々に縮小していく計画です。'))
= Another approach: ADBC\n(('note:別のアプローチ:ADBC'))
* ((*A*))rrow ((*D*))ata((*b*))ase ((*C*))onnectivity
* A single API for accessing many DBs\n
(('note:各種DBにアクセスするための共通API'))
* Like Active Record/Sequel in Ruby\n
(('note:Rubyで言えばActive RecordやSequelみたいなもの'))
= ADBC:Features\n(('note:ADBC:特徴'))
* Cross-language\n
(('note:多言語対応'))
* Active Record needs adapters impl-ed in Ruby\n
(('note:Active RecordではRubyでアダプターを実装しないといけない'))
* ADBC can use adapters impl-ed in other langs\n
(('note:ADBCでは他の言語で実装されたアダプターも使える'))
* Optimized for large columnar data\n
(('note:大きな列指向データに最適化'))
= Large column-oriented data\n(('note:大きな列指向データ'))
* Large: >= 1M records with 1 column\n
(('note:大きな:1カラムなら100万レコード以上'))
* Column-oriented\n
(('note:列指向'))
(('tag:margin-bottom * 20'))
# image
# src = images/column-oriented.svg
# align = right
# vertical-align = top
# relative-width = 60
# relative-margin-right = -10
# relative-margin-top = 30
= Optimized for large columnar data\n(('note:大きな列指向データに最適化'))
* Apache Arrow data format:\n
Minimize data interchange cost!\n
(('note:Apache Arrowデータフォーマット:データ交換コストがめっちゃ安い!'))
* Partitioned result sets:\n
Fast data extract\n
(('note:結果セットの分割:高速なデータ読み込み'))
* Bulk insert:\n
Fast data load\n
(('note:バルクインサート:高速なデータ書き込み'))
# img
# src = https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/why-apache-arrow-format-is-fast.pdf
# align = right
# vertical-align = top
# relative-width = 40
# relative-margin-right = -5
# relative-margin-top = 65
# caption = https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/
# caption-font-size = 1
= How fast is ADBC?\n(('note:ADBCはどのくらい速いの?'))
* 1 integer column\n
(('note:数値カラム1つだけ'))
* (({SELECT * FROM x}))
* Lower is faster\n
(('note:低いほど速い'))
* About 2x faster\n
with 10M records\n
(('note:1000万レコードで2倍ほど速い'))
# charty
# backend = pyplot
# align = right
# vertical-align = top
# relative-margin-right = -10
# relative_height = 90
# type = bar
# x = N records
# y = Elapsed time (sec)
# color = Approach
Approach,N records,Elapsed time (sec)
ADBC,100K,0.009
libpq,100K,0.010
ADBC,1M,0.066
libpq,1M,0.113
ADBC,10M,0.598
libpq,10M,1.122
= Architecture\n(('note:アーキテクチャー'))
* Single API\n
(('note:同じAPIで使える'))
* Driver per\n
protocol\n
(('note:プロトコルごとに'))\n
(('note:ドライバーを用意'))
* API returns\n
Arrow data\n
(('note:レスポンスはArrowデータ'))
# img
# src = images/ADBCFlow2.svg
# align = right
# vertical-align = top
# relative-width = 60
# relative-margin-top = 20
# relative-margin-right = -5
# caption = https://arrow.apache.org/img/ADBCFlow2.svg Apache-2.0 © 2016-2023 The Apache Software Foundation
# caption-font-size = 1
= API
* C API
* Bindings: GLib, Python, R, ((*Ruby*))
* Go API
* Java API
* Rust API (WIP)
(('note:See also: (())'))
= C API
* (({AdbcDatabase})): It holds state shared by multiple connections\n
(('note:複数の接続を管理'))
* (({AdbcConnection})): It's a single, logical connection to a database\n
(('note:1つの接続を管理'))
* (({AdbcStatement})): It holds state related to query execution\n
(('note:クエリーの実行を管理'))
(('note:See also: (())'))
= Ruby API: Extract
# rouge ruby
require "adbc"
options = {
driver: "adbc_driver_postgresql",
uri: "postgresql://127.0.0.1:5432/db",
}
ADBC::Database.open(**options) do |database|
database.connect do |connection|
connection.open_statement do |statement|
query = "SELECT * FROM data"
table, = statement.query(query)
p table
end
end
end
= Ruby API: Load
# rouge ruby
require "adbc"
options = {
driver: "adbc_driver_postgresql",
uri: "postgresql://127.0.0.1:5432/db",
}
ADBC::Database.open(**options) do |database|
database.connect do |connection|
connection.open_statement do |statement|
input = Arrow::Table.load("in.arrow")
statement.ingest("table", input)
# ...
end
end
end
= Ruby API - Active Record
(('tag:center'))(('tag:xx-large'))WIP
(('tag:center'))
(('note:(())'))\n
(('note:Join us! We need to improve drivers too.'))
= Available drivers\n(('note:利用可能なドライバー'))
# RT
DB, Status
DuckDB, Beta
Flight SQL, Beta
PostgreSQL, Experimental
SQLite, Beta
= How to implement a driver\n(('note:ドライバーの実装方法'))
* Choose C, C++ or Go
* See the following implementations:
* C: (('note:(())'))
* C++: (('note:(())'))
* Go (Go API): (('note:(())'))
* Go (C API): (('note:(())'))
= Current ADBC\n(('note:現時点のADBC'))
* 1 integer column\n
(('note:数値カラム1つだけ'))
* (({SELECT * FROM x}))
* Lower is faster\n
(('note:低いほど速い'))
* libpq driver\n
is slow for now...\n
(('note:実は現時点ではlibqpドライバーは遅い…'))
# charty
# backend = pyplot
# align = right
# vertical-align = top
# relative-margin-right = -10
# relative_height = 90
# type = bar
# x = N records
# y = Elapsed time (sec)
# color = Approach
Approach,N records,Elapsed time (sec)
ADBC (Flight SQL),100K,0.009
libpq,100K,0.010
ADBC (libpq),100K,0.02
ADBC (Flight SQL),1M,0.066
libpq,1M,0.113
ADBC (libpq),1M,0.217
ADBC (Flight SQL),10M,0.598
libpq,10M,1.122
ADBC (libpq),10M,2.266
= Flight SQL?
SQL\n
on\n
Apache Arrow Flight\n
(('note:Apache Arrow Flightの上でSQLを使えるようにしたもの'))
= Apache Arrow Flight?
* Arrow format based fast RPC framework\n
(('note:Arrowフォーマットを使った高速RPCフレームワーク'))
* Minimum data interchange cost!\n
(('note:データ交換コストがめっちゃ安い!'))
* Parallel transfers\n
(('note:並列転送'))
* Stream processing\n
(('note:ストリーム処理'))
# img
# src = https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-2021/apache-arrow-flight.pdf
# align = right
# vertical-align = top
# relative-width = 40
# relative-margin-right = -5
# relative-margin-top = 65
# caption = https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-2021/
# caption-font-size = 1
= Simple usage\n(('note:簡単な使い方'))
# image
# src = https://arrow.apache.org/img/20191014_flight_simple.png
# relative-height = 75
(('tag:center'))
(('note:(())'))\n
(('note:Apache License 2.0 - © 2016-2021 The Apache Software Foundation'))
= GetFlightInfo
* Client→Server\n
(('note:クライアント→サーバー'))
* Server returns\n
how to get data\n
(('note:サーバーはデータの取得方法を返す'))
* FlightInfo: How to get data\n
(('note:FlightInfo: データの取得方法'))
* Metadata: Schema, # of records, ...\n
(('note:メタデータ:スキーマ・総レコード数…'))
* 1+ Endpoints: Data may be distributed!\n
(('note:複数エンドポイント:データは複数ヶ所に分散しているかもしれない!'))
# image
# src = https://arrow.apache.org/img/20191014_flight_simple.png
# relative-width = 35
# align = right
# vertical-align = top
# relative-margin-right = -10
# draw0 = [rectangle, false, 0.28, 0.15, 0.39, 0.33, {color: "#0041ab", line_width: 3}]
= DoGet
* Client→Server\n
(('note:クライアント→サーバー'))
* Server returns data\n
(('note:サーバーはデータを返す'))
* Data: Record batch stream\n
(('note:データ:レコードバッチのストリーム'))
* Called as FlightData in protocol\n
(('note:プロトコルレベルではFlightDataと呼んでいる'))
* Record batch: 0+ records\n
(('note:レコードバッチ:0個以上のレコードの集まり'))
# image
# src = https://arrow.apache.org/img/20191014_flight_simple.png
# relative-width = 35
# align = right
# vertical-align = top
# relative-margin-right = -10
# draw0 = [rectangle, false, 0.28, 0.48, 0.39, 0.48, {color: "#0041ab", line_width: 3}]
# = Scale out architecture example\n(('note:スケールアウト構成例'))
#
# # image
# # src = https://arrow.apache.org/img/20191014_flight_complex.png
# # relative-height = 80
#
# (('tag:right'))
# (('note:(())'))\n
# (('note:Apache License 2.0 - © 2016-2021 The Apache Software Foundation'))
#
# = Point: Avoid inefficient forwards\n(('note:ポイント:ムダな転送を回避可能'))
#
# * A distributed system
# * Communicate via "coordinator"\n
# (('note:「コーディネーター」経由で通信'))
# * Node→Coodinator→Client
# * Apache Arrow Flight
# * Client can fetch data from nodes directly\n
# (('note:直接クライアントがデータ取得可能'))
# * Node→Client
#
# # image
# # src = https://arrow.apache.org/img/20191014_flight_complex.png
# # relative-width = 30
# # align = right
# # vertical-align = top
# # relative-margin-right = -10
= Apache Arrow Flight SQL
# mermaid
# relative_height = 75
sequenceDiagram
Client->>Server: GetFlightInfo(CommandStatementQuery: SQL)
Server->>Client: FlightInfo{..., Ticket, ...}
Client->>Server: DoGet(Ticket)
Server->>Client: query results as Apache Arrow data
(('tag:center'))
(('note:(())'))\n
(('note:Apache License 2.0 - © 2016-2023 The Apache Software Foundation'))
== Slide properties
: enable-title-on-image
false
= Current ADBC\n(('note:現時点のADBC'))
* 1 integer column\n
(('note:数値カラム1つだけ'))
* (({SELECT * FROM x}))
* Lower is faster\n
(('note:低いほど速い'))
* libpq driver\n
is slow for now...\n
(('note:実は現時点ではlibqpドライバーは遅い…'))
# charty
# backend = pyplot
# align = right
# vertical-align = top
# relative-margin-right = -10
# relative_height = 90
# type = bar
# x = N records
# y = Elapsed time (sec)
# color = Approach
Approach,N records,Elapsed time (sec)
ADBC (Flight SQL),100K,0.009
libpq,100K,0.010
ADBC (libpq),100K,0.02
ADBC (Flight SQL),1M,0.066
libpq,1M,0.113
ADBC (libpq),1M,0.217
ADBC (Flight SQL),10M,0.598
libpq,10M,1.122
ADBC (libpq),10M,2.266
= But can PostgreSQL talk Flight SQL?\n(('note:でもPostgreSQLはFlight SQLをしゃべれるの?'))
Flight SQL adapter\n
(('note:(())'))\n
(('note:I'm the author'))
= Architecture
# mermaid
# relative_height = 100
# background_color = black
sequenceDiagram
participant C as Client
participant M as PG(master)
participant FSM as PG(Flight SQL main)
participant FSS as PG(Flight SQL server)
participant FSE as PG(Flight SQL executor)
M->>FSM: Spawn
FSM->>FSS: Spawn
note right of FSS: Listen gRPC socket (multi-threading)
C->>FSS: Connect with Flight SQL protocol
FSS->>FSM: Allocate an executor for this session
FSM->>FSE: Spawn
C->>FSS: Send a query
FSS->>FSE: Pass the given query via shared memory
note right of FSE: Run the given query with SPI
note right of FSE: Convert a result to Apache Arrow data
FSE->>FSS: Pass the result via shared memory
FSS->>C: Return the result with Flight SQL protocol
== Slide properties
: enable-title-on-image
false
= Wrap up\n(('note:まとめ'))
* We can use Ruby to extract and load large data by ADBC! (('note:(in a few years...)'))\n
(('note:ADBCを使うとRubyで大量データを読み書きできるよ!(近いうちに。。。)'))
* PostgreSQL'll be Flight SQL ready soon!\n
(('note:すぐにPostgreSQLでFlight SQLを使えるようになるよ!'))
* We can use ADBC via Active Record soon\n
(('note:すぐにActive Record経由でADBCを使えるようになるよ!'))
= Join us!\n(('note:一緒に開発しようぜ!'))
* Red Data Tools: A project that provides data processing tools for Ruby\n
(('note:Red Data Tools:Ruby用のデータ処理ツールを提供するプロジェクト'))\n
(())\n
(())
* You can implement something with us!\n
(('note:一緒になにか作ろうぜ!'))\n
(())\n
(())
= Sponsor us?\n(('note:資金援助しない?'))
* Provide XX% work time to your employee to work on Red Data Tools\n
(('note:業務時間のXX%をRed Data Toolsの作業をできるようにする'))
* (('tag:small'))Employ a full-time Red Data Tools developer\n
(('note:フルタイムのRed Data Tools開発者を雇用する'))
* Pay Red Data Tools continuously\n
(('note:Red Data Toolsに継続的に資金を提供する'))\n
(('note:Red Data Toolsのだれかがお金で時間を確保して作業する'))
* Or contact me!\n
(('note:相談して!'))