Delta Lake - Liquid Clusteringとは何か？

1 2024/03/01 Saki Kitaoka [Delta Lake] Liquid Clusteringとはなにか？

Agenda 1 (前提) Delta Lakeとはなにか？ 2 Delta Lakeのパフォーマンスチューニング戦略 3 Liquid
Clusteringとはなにか？

自己紹介 Saki Kitaoka (@ktksq) • 趣味：ミュージカル、アニメを見ること • 最近有志で技術書典に出版しました。ぜひお手にとってみてください！（値下げしておきました笑） •
Blog: https://ktksq.hatenablog.com/ • Twitter: @ktksq • 宣伝: @DatabricksJP (中の人やってます!)

(前提) Delta Lakeとはなにか？

Delta Lakeとはなにか？

©2022 Databricks Inc. — All rights reserved Delta Lakeの実体 Delta
Lakeとはなにか？

©2022 Databricks Inc. — All rights reserved Delta Lakeが生まれた経緯: データレイクとDelta
Lakeの違い Delta Lakeとはなにか？ • コンピュートとストレージの分離 • 無限のストレージ容量 • 安価なストレージコスト • あらゆる種類の生データを保存 (e.g. 非構造データ、構造化データ、ビデオ、オーディオ、テキスト) • ACIDトランザクションが担保されていないため、部分的に完了したトランザクションによりデータが破損した状態で残り、複雑なリカバリが必要になる　 • データ品質が担保できないため、一貫性がなく使い物にならないデータが作成される • 一貫性/独立性がないため、データの追加とデータの読込み、バッチとストリーミングを同時に実行させることが困難 • 多くの小さいサイズのファイルが存在するため、ファイルI/Oに時間がかかる • クラウドストレージのスループットが低い (S3は 20～50MB/scoreに対して、ローカルのNVMe SSDは300MB/score)

©2022 Databricks Inc. — All rights reserved Delta Lakeの特徴 Delta
Lakeとはなにか？

Delta Lakeのパフォーマンスチューニング戦略

©2022 Databricks Inc. — All rights reserved (前提) Parquetの課題 customersデータ
p3 p1 p2 task task task データ偏り(Skew)の問題合計処理時間処理時間 customersデータ p4 p1 p2 p3 task task task task 合計処理時間理想的な状態(各ファイルを均等に処理）処理時間 customersデータ p1 pn p2 p.. p.. p1 p.. p1 p.. p1 p.. p1 p.. p1 p.. p1 p.. p.. p.. p.. task task task task 小規模ファイルの問題合計処理時間処理時間 customersデータ p4 p1 p2 p3 task task task task 合計処理時間不正データ問題処理時間 schema broken ﬁle corrupt FAIL FAIL FAIL 10 Delta Lakeのパフォーマンスチューニング戦略

©2022 Databricks Inc. — All rights reserved (前提) 既存のパーティショニング戦略 -
Hive Style Partitioning Delta Lakeのパフォーマンスチューニング戦略 ➔ データレイクに格納された大規模データセットに対するクエリのパフォーマンスを向上させる一般的な方法 ➔ データをより小さなパーティションに分割し、パーティション情報は各ファイルのパスの一部として保存 ➔ スキャン中にデータをスキップできるようになるので、クエリを大幅に高速化できる /transactions/date=2023-02-05/customer=customerA/{1.parquet, 2.parquet,...} /transactions/date=2023-02-05/customer=customerB/{1.parquet, 2.parquet,...} /transactions/date=2023-02-05/customer=customerC/{1.parquet, 2.parquet,...} /transactions/date=2023-02-06/customer=customerA/{1.parquet, 2.parquet,...} /transactions/date=2023-02-06/customer=customerB/{1.parquet, 2.parquet,...} …

2023-02-05 2023-02-06 2023-02-07 Customer A Customer B Customer C Customer
D Customer E Customer F (前提) 既存のパーティショニング戦略 - Hive Style Partitioning Delta Lakeのパフォーマンスチューニング戦略

D Customer E Customer F (前提) 既存のパーティショニング戦略 - Hive Style Partitioning Delta Lakeのパフォーマンスチューニング戦略小規模ファイルができる ➔ メタデータ操作のオーバーヘッドが大きい ➔ 読み取り操作が遅い課題

D Customer E Customer F 『Optimize』コマンドを実行し、ファイルサイズを最適化 Hive Style Partitioning + Compaction (Optimize) Delta Lakeのパフォーマンスチューニング戦略

D Customer E Customer F パーティション境界でファイルサイズを最適化 Hive Style Partitioning + Compaction (Optimize) Delta Lakeのパフォーマンスチューニング戦略『Optimize』コマンドを実行し、ファイルサイズを最適化

D Customer E Customer F ターゲットファイルサイズ Hive Style Partitioning + Compaction (Optimize) Delta Lakeのパフォーマンスチューニング戦略パーティション境界内でファイルサイズを最適化『Optimize』コマンドを実行し、ファイルサイズを最適化

D Customer E Customer F Hive Style Partitioning + Compaction (Optimize) Delta Lakeのパフォーマンスチューニング戦略小規模ファイルができる ➔ メタデータ操作のオーバーヘッドが大きい ➔ 読み取り操作が遅いデータサイズの偏り(Skew)の発生 ➔ パーティション間のファイルサイズの不一致課題

D Customer E Customer F Z-Order + Compaction (Optimize) Delta Lakeのパフォーマンスチューニング戦略

D Customer E Customer F Z-Order + Compaction (Optimize) Delta Lakeのパフォーマンスチューニング戦略 optimize my_table zorder by date, customer_id

Z-Order（イメージ） Delta Lakeのパフォーマンスチューニング戦略同じファイルセットの中に関連する情報をまとめて配置するファイルレイアウト技術

Z-Order（イメージ） Delta Lakeのパフォーマンスチューニング戦略 X = 2 またはY = 3を検索する場合

D Customer E Customer F Z-Order + Compaction (Optimize) Delta Lakeのパフォーマンスチューニング戦略 optimize my_table zorder by date, customer_id

D Customer E Customer F Z-Order + Compaction (Optimize) Delta Lakeのパフォーマンスチューニング戦略最適化されたファイルサイズ ➔ 小規模ファイルがたくさんできていない ➔ データの偏りが発生していない

D Customer E Customer F Z-Order + Compaction (Optimize) Delta Lakeのパフォーマンスチューニング戦略新規ファイルにすぐに適用されない ➔ 新しく取り込まれたデータはクラスタ化されていない ➔ 動的にファイルをマージできない課題

Liquid Clusteringとは何か？

D Customer E Customer F Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

D Customer E Customer F Liquid cluster by customer ID and date Liquid Clusteringとはなにか？ CREATE TABLE my_liquid_table … CLUSTER BY (customer_id, date) AS SELECT …

D Customer E Customer F Col 1: date Col 2: customer_id Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

D Customer E Customer F Col 1 Col 1 > 2023-02-06 Col 1 <= 2023-02-06 Col 1: date Col 2: customer_id Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

D Customer E Customer F Col 1 Col 1 > 2023-02-06 Col 1 <= 2023-02-06 Col 2 Col 2 Col 2 > C Col 2 <= C Col 2 > B Col 2 <= B Col 1: date Col 2: customer_id Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

D Customer E Customer F Col 1 Col 1 > 2023-02-06 Col 1 <= 2023-02-06 Col 1 Col 2 Col 2 Col 2 > C Col 2 <= C Col 2 > B Col 2 <= B Col 1 > 2023-02-05 Col 1 <= 2023-02-05 Col 1: date Col 2: customer_id Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

D Customer E Customer F Col 1 Col 1 > 2023-02-06 Col 1 <= 2023-02-06 Col 1 Col 2 Col 2 Col 2 Col 2 Col 2 > C Col 2 <= C Col 2 > B Col 2 <= B Col 1 > 2023-02-05 Col 1 <= 2023-02-05 Col 2 > D Col 2 <= D Col 2 > C Col 2 <= C Col 1: date Col 2: customer_id Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

D Customer E Customer F Col 1 Col 1 > 2023-02-06 Col 1 <= 2023-02-06 Leaf1 Col 1 Col 2 Col 2 Leaf6 Leaf7 Col 2 Col 2 Col 2 > C Col 2 <= C Col 2 > B Col 2 <= B Leaf2 Leaf3 Leaf4 Leaf5 Col 1 > 2023-02-05 Col 1 <= 2023-02-05 Col 2 > D Col 2 <= D Col 2 > C Col 2 <= C Col 1: date Col 2: customer_id Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

D Customer E Customer F ターゲットファイルサイズに応じて最適化します。 Col 1 Col 1 > 2023-02-06 Col 1 <= 2023-02-06 Leaf1 Col 1 Col 2 Col 2 Leaf6 Leaf7 Col 2 Col 2 Col 2 > C Col 2 <= C Col 2 > B Col 2 <= B Leaf2 Leaf3 Leaf4 Leaf5 Col 1 > 2023-02-05 Col 1 <= 2023-02-05 Col 2 > D Col 2 <= D Col 2 > C Col 2 <= C Col 1: date Col 2: customer_id Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

D Customer E Customer F Col 1 Col 1 > 2023-02-06 Col 1 <= 2023-02-06 Leaf1 Col 1 Col 2 Col 2 Leaf6 Leaf7 Col 2 Col 2 Col 2 > C Col 2 <= C Col 2 > B Col 2 <= B Leaf2 Leaf3 Leaf4 Leaf5 Col 1 > 2023-02-05 Col 1 <= 2023-02-05 Col 2 > D Col 2 <= D Col 2 > C Col 2 <= C ターゲットファイルサイズ Col 1: date Col 2: customer_id Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

D Customer E Customer F Liquid clustered delta table Write new data Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

Col 1 Col 1 > 2023-02-06 Col 1 <= 2023-02-06
Leaf1 Col 1 Col 2 Col 2 Leaf6 Leaf7 Col 2 Col 2 Col 2 > C Col 2 <= C Col 2 > B Col 2 <= B Leaf2 Leaf3 Leaf4 Leaf5 Col 1 > 2023-02-05 Col 1 <= 2023-02-05 Col 2 > D Col 2 <= D Col 2 > C Col 2 <= C Write new data Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

Col 1 leaf1 Col 1 Col 2 Col 2 leaf6
leaf7 Col 2 Col 2 Leaf2 Leaf3 Leaf4 Leaf5 Write Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

leaf7 Col 2 Col 2 Leaf2 Leaf3 Leaf4 Leaf5 Write Insert more data . . . Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

D Customer E Customer F Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

optimize my_table 2023-02-05 2023-02-06 2023-02-07 Customer A Customer B Customer
C Customer D Customer E Customer F Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

leaf7 Col 2 Col 2 Leaf2 Leaf3 Leaf4 Leaf5 2023-02-05 2023-02-06 2023-02-07 Customer A Customer B Customer C Customer D Customer E Customer F optimize my_table Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

D Customer E Customer F ノードのファイルを最適化： • 小さいファイルの数がﬁles_numberの閾値より大きい • ノード・サイズがnode_sizeの閾値より小さい Col 1 leaf1 Col 1 Col 2 Col 2 leaf6 leaf7 Col 2 Col 2 Leaf2 Leaf3 Leaf4 Leaf5 Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

D Customer E Customer F Col 1 leaf1 Col 1 Col 2 Col 2 leaf6 leaf7 Col 2 Col 2 Leaf2 Leaf3 Leaf4 Leaf5 ノードのファイルを最適化： • 小さいファイルの数がﬁles_numberの閾値より大きい • ノード・サイズがnode_sizeの閾値より小さい Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

D Customer E Customer F Col 1 leaf1 Col 1 Col 2 Col 2 leaf6 leaf7 Col 2 Col 2 Leaf2 Leaf3 Leaf4 Leaf5 Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

D Customer E Customer F Col 1 leaf1 Col 1 Col 2 Col 2 leaf6 leaf7 Col 2 Col 2 Leaf2 Leaf3 Leaf4 Leaf5 リーフノードの拡張 • ノードサイズがnode_sizeのしきい値より大きい Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

D Customer E Customer F Col 1 leaf1 Col 1 Col 2 Col 2 leaf6 Col 2 Col 2 Leaf2 Leaf3 Leaf4 Leaf5 Col1 Col 2 Col 2 leaf 7 leaf 8 leaf 9 leaf 10 リーフノードの拡張 • ノードサイズがnode_sizeのしきい値より大きい Liquid cluster by customer ID and date Liquid Clusteringとはなにか？

➔ 理想的なファイル数とサイズでバランスの取れたデータセットになるように、動的にファイルをマージ,分割可能 ➔ リキッドクラスタリングはステートフルのため、OPTIMIZE コマンドが実行されるたびに再計算されない
➔ 新しく取り込まれたデータは必要に応じてクラスタリングされ、以前にクラスタリングされたデータは無視される Col 1 leaf1 Col 1 Col 2 Col 2 leaf6 Col 2 Col 2 Leaf2 Leaf3 Leaf4 Leaf5 Col 1 Col 2 Col 2 Leaf7 Leaf8 Leaf9 Leaf10 54 Lazy Clusering Liquid Clusteringとはなにか？

55 Thank you!

Delta Lake - Liquid Clusteringとは何か？

Delta Lake - Liquid Clusteringとは何か？

More Decks by Databricks Japan

Other Decks in Technology

Featured

Transcript