PG-Stromでの列指向フォーマットApache Arrow形式の活用 / pgstrom_with_apache_arrow_file

PG-Stromでの列指向フォーマット Apache Arrow形式の活用 2024/05/20 有限会社アートライ坂井恵「DB性能高速化入門〜基礎から列指向、GPU活用まで〜」イベント

PG-Strom & Arrow(イメージ)

自己紹介 • 坂井恵（さかいけい） • 有限会社アートライ • 日本仮想化技術株式会社「爆速DB Powered by
PG-Strom 」チームに参画中 Keyword: データベース、地理情報(GIS)

この時間のお話テーマ： PG-Stromで Apache Arrowファイルを使うと爆速内容： • Apache Arrowファイルってなに？ •
PG-StromでArrowファイルを使う方法 • PG-Strom + Arrowファイルはこんな用途で高効果 • PostgreSQLデータからの Arrowファイルの作りかた

Apache Arrowファイルってなに？ https://arrow.apache.org/ より「インメモリー分析のためのクロスランゲージ開発プラットフォーム」 Apache Arrow is a
software development platform for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another. Apache Arrowは、大規模なデータセットを処理・転送する高性能アプリケーションを構築するためのソフトウェア開発プラットフォームです。分析アルゴリズムの性能向上と、あるシステムやプログラミング言語から別の言語へのデータ移動の効率化の両方を目的として設計されています。 • Apache のトップレベルプロジェクト

Apache Arrowファイルってなに？ • 列指向（カラムナ）のデータフォーマット行指向の格納列指向の格納

Apache Arrowファイルってなに？今日お越しになるような方には、もう少し具体的なデータで説明しますこのテーブルを Arrow化します。

Apache Arrowファイルってなに？ Arrowファイルの中身（hexdump)（抜粋）

Apache Arrowファイルってなに？(まとめ) • 列指向のフォーマットである • 列指向というのは、 • 「行指向」が行単位でデータを格納するのに対して • 「列指向」は、列単位で格納する
• 列指向だと何が嬉しい？ • 一部のカラムのみを参照する際に行全体を読み込む必要がない • 同じく、一部のカラムを参照する際に、データが近くに固まっている

PG-StromでArrowファイルを使う方法 • PG-Strom で Apache Arrowファイルを利用するのは、とても簡単です！ • PG-Stromには Arrowファイルをテーブルとして扱えるようにするため
の "arrow_fdw" が含まれています • Arrowファイルと、アクセスするためのテーブル名を指定するだけ • データをインポートするわけではなく、Arrowファイルを直接参照するものです

PG-StromでArrowファイルを使う方法 1. PG-Stromのextensionを有効にします。 2. Arrowファイルとテーブル名(この名前のテーブルとしてアクセスしたい）とスキーマを指定して、IMPORT。 CREATE EXTENSION pg_strom; IMPORT
FOREIGN SCHEMA mytable_arrow FROM SERVER arrow_fdw INTO public OPTIONS (file '/path/to/mytable.arrow'); ※ 他にも方法はありますが、このIMPORTを使う方法が一番手軽でお勧めです。 ※ ここではファイル指定を例としましたが、フォルダ指定も可能です。

PG-Strom + Arrowファイルはこんな用途で高効果 • Arrowファイルの特長 • 「列指向」であり、つまり、同じ列の値がまとまって配置されている → 行全体にアクセスしなくて良い場合に、
必要な列の情報だけに高速にアクセスできるプラス GPUパワー！

PG-Strom + Arrowファイル実例紹介今回使用するテーブル非常にカラム数が大きく、非常にデータ数の多いテーブル → 先祖代々伝わる「大福帳」データなどにありがち

PG-Strom + Arrowファイル実例紹介 • 前回のイベントで紹介した「スタースキーマベンチマーク(ssb)」のデータを活用 • 今回の検証用に、ssbの全てのテーブルを結合して、大きな大福帳テーブルを作成した

PG-Strom + Arrowファイル実例紹介こうやって作りました。このテーブルと、これをArrow化したテーブルでの比較を行います。 db=# CREATE TABLE lineorder_flat_pgtable AS
SELECT lo_orderkey,lo_linenumber,lo_custkey,lo_partkey,lo_suppkey,lo_orderdate,lo_orderpriority,lo_shippriority,lo_quantity,lo_extendedprice,lo_ordertotalprice,lo_discount ,lo_revenue,lo_supplycost,lo_tax,lo_commit_date,lo_shipmode,c_name,c_address,c_city,c_nation,c_region,c_phone,c_mktsegment,s_name,s_address,s_city,s_nation,s_region ,s_phone,p_name,p_mfgr,p_category,p_brand1,p_color,p_type,p_size,p_container,d_date,d_dayofweek,d_month,d_year,d_yearmonthnum,d_yearmonth,d_daynuminweek,d_daynuminm onth,d_daynuminyear,d_monthnuminyear,d_weeknuminyear,d_sellingseason,d_lastdayinweekfl,d_lastdayinmonthfl,d_holidayfl,d_weekdayfl FROM lineorder l JOIN customer c ON (lo_custkey=c_custkey) JOIN date1 d ON (lo_orderdate=d_datekey) JOIN part p ON (lo_partkey=p_partkey) JOIN supplier ON (lo_suppkey=s_suppkey); SELECT 2400012063 Time: 10354688.367 ms (02:52:34.688)

(参考)ssbデータ生成時の与パラメタと件数の対応 TABLE s10 s20 s50 s100 s200 s400 customer 300,000
600,000 1,500,000 3,000,000 6,000,000 12,000,000 date1 2,556 2,556 2,556 2,556 2,556 2,556 part 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 supplier 100,000 200,000 500,000 1,000,000 2,000,000 4,000,000 lineorder 59,986,052 119,994,608 300,005,811 600,037,902 1,200,018,434 2,400,012,063 データ件数 (s=400)

PG-Strom + Arrowファイル実例紹介(サイズ) Schema | Name | Type | Owner
| Persistence | Access method | Size | --------+------------------------+-------------------+----------+-------------+---------------+------------+ public | lineorder | table | postgres | permanent | heap | 350 GB | public | customer | table | postgres | permanent | heap | 1624 MB | public | date1 | table | postgres | permanent | heap | 416 kB | public | part | table | postgres | permanent | heap | 206 MB | public | supplier | table | postgres | permanent | heap | 527 MB | public | lineorder_flat | foreign table | postgres | permanent | | 0 bytes | public | lineorder_flat_pgtable | table | postgres | permanent | heap | 1526 GB | -rw-r--r--. 1 sakaik sakaik 1598636977622 Mar 12 04:35 lineorder_flat.arrow (-rw-r--r--. 1 sakaik sakaik 1.5T Mar 12 04:35 lineorder_flat.arrow) テーブル Arrowファイル

PG-Strom + Arrowファイル実例紹介 PGTableでの通常のPostgreSQL と PG-Strom Arrowテーブルでの PG-Strom の処理時間を計測した。
24億件 1.5TBの - PostgreSQL heapデータ（PGテーブル） - PostgreSQLでのクエリ実行 - PG-Stromでのクエリ実行 - Arrowファイルのテーブル（Arrowテーブル） - PG-Stromでのクエリ実行

今回の検証環境spec • CPU: AMD EPYC 7443 24-Core Processor (24
cores/48 Processsors) • OS : Red Hat Enterprise Linux release 8.8 (Ootpa) • Memory : 131,330,728 kB • GPU : NVIDIA A100 80GB PCIe (6,912 Cuda cores) • CUDA Version : 12.3 • PostgreSQL Version : 16.1 • NVMe SSD： 14TB

PG-Strom + Arrowファイル実例紹介(1) SELECT lo_orderpriority, COUNT(*) FROM %TABLE% GROUP BY
lo_orderpriority ORDER BY lo_orderpriority; lo_orderpriority | count ------------------+----------- 1-URGENT | 479992014 2-HIGH | 480037805 3-MEDIUM | 479957940 4-NOT SPECI | 479967889 5-LOW | 480056415 (5 rows) PostgreSQL 05:14 PG-Strom 01:20 PG-Strom+Arrow 00:05.165 x 3.9 x 62.8

PG-Strom + Arrowファイル実例紹介(2) SELECT d_year, COUNT(*) FROM %TABLE% GROUP BY
d_year; d_year | count --------+----------- 1995 | 364053784 1992 | 365083082 1994 | 364096209 1996 | 365067240 1993 | 364083099 1997 | 364136802 1998 | 213491847 (7 rows) PostgreSQL 06:39 PG-Strom 01:22 PG-Strom+Arrow 00:10.639 x 4.9 x 39.9

PG-Strom + Arrowファイル実例紹介(3) SELECT lo_orderpriority, COUNT(*) FROM %TABLE% WHERE d_year=1995
GROUP BY lo_orderpriority ORDER BY lo_orderpriority; lo_orderpriority | count ------------------+---------- 1-URGENT | 72803951 2-HIGH | 72816667 3-MEDIUM | 72799284 4-NOT SPECI | 72802326 5-LOW | 72831556 (5 rows) PostgreSQL 06:07 PG-Strom 01:21 PG-Strom+Arrow 00:03.958 x 4.5 x 91.8

PG-Strom + Arrowファイル実例紹介(4) --04: double grouping SELECT d_year, lo_orderpriority, COUNT(*)
FROM %TABLE% GROUP BY d_year, lo_orderpriority ORDER BY d_year, lo_orderpriority; d_year | lo_orderpriority | count --------+------------------+---------- 1992 | 1-URGENT | 73013665 1992 | 2-HIGH | 72992015 1992 | 3-MEDIUM | 73025142 1992 | 4-NOT SPECI | 73028365 1992 | 5-LOW | 73023895 1993 | 1-URGENT | 72803890 1993 | 2-HIGH | 72838788 1993 | 3-MEDIUM | 72807545 ： PostgreSQL 07:10 PG-Strom 01:24 PG-Strom+Arrow 00:07.631 x 5.1 x 61.4

PG-Strom + Arrowファイル実例紹介(5) --05: double grouping and extract and sum
SELECT d_year, lo_orderpriority, sum(lo_extendedprice) FROM %TABLE% WHERE s_region='ASIA' GROUP BY d_year,lo_orderpriority ORDER BY d_year, lo_orderpriority; d_year | lo_orderpriority | sum --------+------------------+---------------- 1992 | 1-URGENT | 55754054433397 1992 | 2-HIGH | 55720071150222 1992 | 3-MEDIUM | 55737456621929 1992 | 4-NOT SPECI | 55745409996470 1992 | 5-LOW | 55747692515383 1993 | 1-URGENT | 55582507698342 1993 | 2-HIGH | 55617539949435 : PostgreSQL 05:58 PG-Strom 01:23 PG-Strom+Arrow 00:11.253 x 3.9 x 62.8

PG-Strom + Arrowファイル実例紹介(結果一覧) No. 概要 PostgreSQL PG-Strom Strom+Arrow 1 1列グルーピング＋COUNT
05:14 01:20 (x3.9) 00:05.165 (x62.8) 2 1列グルーピング＋COUNT 06:39 01:22 (x4.9) 00:10.639 (x39.9) 3 1列グルーピング＋COUNT (絞り込みあり) 06:07 01:21 (x4.5) 00:03.958 (x91.8) 4 2列グルーピング＋COUNT 07:10 01:24 (x5.1) 00:07.631 (x61.4) 5 2列グルーピング＋SUM 05:58 01:23 (x4.3) 00:11.253 (x32.5) ※括弧内は”PostgreSQL" 測定結果比

PG-Strom + Arrowファイル実例紹介（いったんまとめ） • カラム数の多いテーブルの集計には、Arrowファイルの利用は効果的 • 生のPostgres →
PG-Stromへの変更でも高速化の効果を享受できるが、Arrowテーブルにすることで更に爆速化 • カラムナ＋GPUでの並列処理で高効果。お手持ちのデータでぜひ体験してもらいたい

PostgreSQLデータからの Arrowファイルの作り方 (pg2arrowの使い方）

PostgreSQLデータからの Arrowファイルの作り方 • PG-Strom に付属の pg2arrow プログラムを使用します • arrow_toolsディレクトリの中にあります
• make pg2arrow PG_CONFIG=/usr/pgsql-16/bin/pg_config のようにPG_CONFIGパラメタを与えてmakeできます • MySQLのデータをArrowファイル化する mysql2arrowもあります • ネットワークインタフェースから、またはキャプチャ済のPCAPファイルからArrowファイルを作成する pcap2arrow もあります

PostgreSQLデータからの Arrowファイルの作り方 (pg2arrowの使い方） • 大きく2つの目的に対応しています • テーブル全体をArrowファイル化する • クエリを与えてその結果をArrowファイル化する
• 基本的なつかいかた(引数) • -u PostgreSQL接続ユーザ名 • -d PostgreSQLデータベース名 • -o 出力ファイル名 • -t テーブル名(テーブル指定の場合) • -c SELECTクエリ（クエリ指定の場合）例： pg2arrow -upostgres -dssb -t 'lineorder_flat' -o /opt/nvme/lineorder_flat.arrow

PostgreSQLデータからの Arrowファイルの作り方 pg2arrowの処理時間と便利なオプション例1：約3.5億件、58GBのテーブルのArrow化 →約6分02秒例2：約24億件、350GBのテーブルのArrow化 →約65分 pg2arrow -upostgres
-d mydb -c 'SELECT * FROM mytable' -o /opt/nvme/mydata.arrow このあとで、もっと速くなる方法を紹介します！ pg2arrow -upostgres -d mydb -t mytable -o /opt/nvme/mydata.arrow pg2arrow -u postgres -d ssb -t lineorder -o /opt/nvme/lineorder2.arrow --progress ※--progressを付けると実行中の詳細情報が表示されます

PostgreSQLデータからの Arrowファイルの作り方 pg2arrowの処理時間と便利なオプションふたつの使い方があります。 (1)自分で分割ルールを指定する方法 $(N_WORKERS)と$(WORKER_ID)を使って分割ルールを指定する (2) pg2arrowにお任せで分割数だけ指定する方法 →テーブルまるごとならこちらが圧倒的にお勧め！ pg2arrow
-u postgres -d mydb -c 'SELECT * FROM mytable WHERE id % $(N_WORKERS) = $(WORKER_ID)' -n5 -o /opt/nvme/mytable.arrow pg2arrow -u postgres -d ssb -t lineorder -n12 -o /opt/nvme/lineorder.arrow --progress ※--progressを付けると分割条件を見ることができるのでお勧めです 2024年4月に pg2arrowのパラレル実行(nオプション)が実装されました！超っ速です

PostgreSQLデータからの Arrowファイルの作り方 pg2arrowの処理時間と便利なオプション例1：約3.5億件、58GBのテーブルのArrow化約6分02秒 → n=10で約58秒例2：約24億件、350GBのテーブルのArrow化
約65分 → n=30 で約5分30秒 2024年4月に pg2arrowのパラレル実行(nオプション)が実装されました！超っ速です pg2arrow -u postgres -d mydb -c 'SELECT * FROM mytable WHERE id % $(N_WORKERS) = $(WORKER_ID)' -n10 -o /opt/nvme/mytable.arrow pg2arrow -u postgres -d ssb -t lineorder -n30 -o /opt/nvme/lineorder.arrow --progress ちょっと解説：自動分割での並列処理は、PostgreSQLテーブル各行の ctid を利用してエリア分割してくれます。ハッシュや剰余を使う場合と比べてストレージ上の読み取りエリアが重複しない割合が増えるので、同時並行数を高めて実行することが可能になります。

今日のお話のまとめテーマ： PG-Stromで Apache Arrowファイルを使うと爆速内容： • Apache Arrowファイルというものがありますよ •
PG-StromでArrowファイルを使う方法は簡単 • CREATE EXTENSION＋IMPORT TABLE • PG-Strom + Arrowファイルの爆速の実例 • PostgreSQLデータからの Arrowファイルの作り方 • パラレル実行で pg2arrowが爆速化！

お問い合わせ先メールにて [email protected] 評価したい等々、お気軽にお問い合わせください

PG-Stromでの列指向フォーマットApache Arrow形式の活用 / pgstrom_...

PG-Stromでの列指向フォーマットApache Arrow形式の活用 / pgstrom_with_apache_arrow_file

More Decks by sakaik

Other Decks in Technology

Featured

Transcript