Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Iceberg: The Definitive Guide 輪読会 - 第 6 ...

Tomohiro Tanaka
September 02, 2024
70

Apache Iceberg: The Definitive Guide 輪読会 - 第 6 章 Apache Spark

以下のセクション構成になっています:
- Spark について簡単に特徴を recap しつつ、Iceberg を使うための設定や条件について確認
- 次に、Spark で実行できるオペレーションについて一通りみていきます
- 最後に (これは本書にはないですが) これまで "おまじない" 的に書いていた Iceberg 用の Spark のための config について詳しく見ていきます

Tomohiro Tanaka

September 02, 2024
Tweet

Transcript

  1. Tomohiro Tanaka, 2024 Sep. 2 Chapter 6. Apache Spark Apache

    Iceberg: The De fi nitive Guide ྠಡձ
  2. • झຯ: Iceberg • Iceberg ʹͨ·ʹίϯτϦϏϡʔτ͍ͯ͠·͢ • Serverless ETL and

    Analytics with AWS Glue (ڞஶ) • GitHub: tomtongue, LinkedIn: in/ttomtan Tomohiro Tanaka ຊൃද͸ݸਓͷݟղͰ͋ΓɺॴଐاۀΛ୅ද͢Δ΋ͷͰ͸͍͟͝·ͤΜɻ
  3. ຊ೔͓࿩͢͠Δ಺༰ Use Iceberg with Spark Iceberg operations for Spark (Advanced)

    Dive Deep into Iceberg configurations for Spark Iceberg version 1.6.1 Λϕʔεͱ͍ͯ͠·͢
  4. Apache Spark ͱ͸ • An uni fi ed analytics engine

    for large-scale data processing • Enables applications to easily distribute in-memory processing • Suits for iterative computations • Enables developers to easily implement high-level control fl ow of your distributed data processing application
  5. Apache Spark ͱ͸? Example: ETL from pyspark.sql import SparkSession spark

    = SparkSession.builder.getOrCreate() df = spark.read.json('s3://src-bucket/path/') df.createOrReplaceTempView('tmp_tbl') df_2 = spark.sql(""" SELECT category, count(*) as cnt FROM tmp_tbl GROUP BY category """) df_2.write.parquet('s3://dst-bucket/path/') +--------+---+ |category|cnt| +--------+---+ | drink| 35| | book| 12| | health| 9| +--------+---+
  6. SparkSQL & DataFrame APIs • Enables data processing with SQL

    syntax • Provides: • 1. External data source connectivity with friendly APIs, • 2. High performance coming from database techniques and • 3. Supporting new data sources such as semi-structured data • DataFrame API is a main abstraction of SparkSQL.
  7. SparkSQL & DataFrame APIs Query optimization by Catalyst • Enables

    you to get optimizations by query optimizer (called Catalyst) • Catalyst provides rule-based, cost-based and runtime optimizations for your queries. Adaptive Query Execution Driver program Spark RDD Spark RDD Driver program SparkSQL Catalyst Query optimization w/o catalyst w catalyst
  8. Iceberg Λ Spark Ͱར༻͢Δ Prerequisites for Iceberg 1.6.x • Spark

    3.3+ • Java (8), 11, 17 or (21) • Java 8 ʹ͍ͭͯ͸ (͓ͦΒ͘) Iceberg 1.7.0 Ͱαϙʔτ͞Εͳ͍ (ઌ೔·Ͱ vote ͕ community ͰߦΘΕ ͍ͯͨ • (Community thread ͷ໊݅) [VOTE] Drop Java 8 support in Iceberg 1.7.0 • PR #10518 - Core: Drop support for Java 8 • ͳ͓ɺSpark 4.0 (in preview as of 2024.09.01) Ͱ Java 17/21 ʹͳΔͷͰɺ஫ҙ͕ඞཁ • JARs: • iceberg-spark-runtime-<SPARK_VERSION>_<SCALA_VERSION>_<ICEBERG_VERSION>.jar • Plus, (if needed) additional JARs like iceberg-aws-bundle-<ICEBERG_VERSION>.jar (1.4.0+) • < 1.4.0 Ͱ͸ҎԼ 2 ͭͷ JAR ͕ඞཁ (for AWS) • url-connection-client-<AWS_SDK_VERSION>.jar • bundle-<AWS_SDK_VERSION>.jar طʹ README ʹ Java 8 ͷهࡌ͸ͳ͠
  9. Iceberg Λ Spark Ͱར༻͢Δ Add Iceberg JARs $ spark-shell (spark-sql/spark-submit/pyspark)

    \ --jars /path/iceberg-spark-runtime.jar,/path/iceberg-aws-bundle.jar \ --master yarn --conf <CONF_A_KEY>=<CONF_A_VALUE> --conf <CONF_B_KEY>=<CONF_B_VALUE> from pyspark.sql import SparkSession spark = SparkSession.builder\ .master("yarn") .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-<VERSIONS>:<ICEBERG_VERSION>")\ .config("<CONF_A_KEY>", "<CONF_A_VALUE>")\ .config("<CONF_B_KEY>", "<CONF_B_VALUE>")\ .getOrCreate() $ spark-shell (spark-sql/spark-submit/pyspark) \ --packages org.apache.iceberg:iceberg-spark-runtime-<SPARK_VERSION>_<SCALA_VERSION>:<ICEBERG_VERSION> \ --master yarn --conf <CONF_A_KEY>=<CONF_A_VALUE> --conf <CONF_B_KEY>=<CONF_B_VALUE>
  10. Walkthrough Iceberg with Spark Scenario 1. Iceberg ͷઃఆΛ SparkSession ʹରͯ͠ߦ͏

    2. Iceberg ςʔϒϧΛ࡞੒͢Δ 3. ࡞੒ͨ͠ Iceberg ςʔϒϧʹσʔλΛॻ͖ࠐΉ 4. Iceberg ςʔϒϧ͔ΒσʔλΛಡΉ
  11. Walkthrough Iceberg with Spark PySpark script from pyspark.sql import SparkSession

    spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark.sql(""" CREATE TABLE my_catalog.db.tbl (id int, name string) USING iceberg """) spark.sql("INSERT INTO my_catalog.db.tbl VALUES (1, 'Alice'), (2, 'Bob')") spark.sql("SELECT * FROM my_catalog.db.tbl").show(n=2, truncate=False)
  12. Walkthrough Iceberg with Spark PySpark script from pyspark.sql import SparkSession

    spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark.sql(""" CREATE TABLE my_catalog.db.tbl (id int, name string) USING iceberg """) spark.sql("INSERT INTO my_catalog.db.tbl VALUES (1, 'Alice'), (2, 'Bob')") spark.sql("SELECT * FROM my_catalog.db.tbl").show(n=2, truncate=False) Iceberg ͷઃఆ Iceberg ςʔϒϧͷ࡞੒ σʔλͷॻ͖ग़͠ σʔλͷಡΈࠐΈ
  13. Walkthrough Iceberg with Spark Iceberg con fi guration from pyspark.sql

    import SparkSession spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark.sql(""" CREATE TABLE my_catalog.db.tbl (id int, name string) USING iceberg """) spark.sql("INSERT INTO my_catalog.db.tbl VALUES (1, 'Alice'), (2, 'Bob')") spark.sql("SELECT * FROM my_catalog.db.tbl").show(n=2, truncate=False) Iceberg ͷઃఆ
  14. Walkthrough Iceberg with Spark Example: Iceberg con fi guration to

    use Glue Data Catalog and S3 spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() For Iceberg 1.5.0+ Iceberg 1.5.0-
  15. Walkthrough Iceberg with Spark Example: Iceberg con fi guration to

    use Hive Catalog and S3 spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "hive")\ .config("spark.sql.catalog.my_catalog.uri", "thrift://<metastore-host>:<port>")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "hive")\ .config("spark.sql.catalog.my_catalog.uri", "thrift://<metastore-host>:<port>")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() Hive Catalog (only, HDFS ͕σϑΥϧτͰར༻͞ΕΔ) Hive Catalog + S3
  16. Walkthrough Iceberg with Spark Example: Iceberg con fi guration to

    use Snow fl ake Catalog spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.snowflake.SnowflakeCatalog")\ .config("spark.sql.catalog.my_catalog.uri", "jdbc:snowflake://<ACCOUNT_LOCATOR>.snowflakecomputing.com")\ .config("spark.sql.catalog.my_catalog.jdbc.role", "<ROLE>")\ .config("spark.sql.catalog.my_catalog.jdbc.user", "<USER_NAME>")\ .config("spark.sql.catalog.my_catalog.jdbc.password", "<PASSWORD>")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() File IO class ͸ɺSnow fl ake Catalog ͷ৔߹ɺResolvingFileIO class ͕ࢦఆ͞Ε͓ͯΓɺLocation ͷ scheme (s3, gs, abfs etc.) ʹΑΓ൑அͤ͞Δ͜ͱ΋Ͱ͖Δɻ Fallback scheme ͸ HDFS*. *ҎԼͷιʔείʔυΛࢀরɻ https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/snow fl ake/src/main/java/org/apache/iceberg/snow fl ake/Snow fl akeCatalog.java https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/core/src/main/java/org/apache/iceberg/io/ResolvingFileIO.java
  17. Walkthrough Iceberg with Spark PySpark script from pyspark.sql import SparkSession

    spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark.sql(""" CREATE TABLE my_catalog.db.tbl (id int, name string) USING iceberg """) spark.sql("INSERT INTO my_catalog.db.tbl VALUES (1, 'Alice'), (2, 'Bob')") spark.sql("SELECT * FROM my_catalog.db.tbl").show(n=2, truncate=False)
  18. Walkthrough Iceberg with Spark Result output from pyspark.sql import SparkSession

    spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark.sql(""" CREATE TABLE my_catalog.db.tbl (id int, name string) USING iceberg """) spark.sql("INSERT INTO my_catalog.db.tbl VALUES (1, 'Alice'), (2, 'Bob')") spark.sql("SELECT * FROM my_catalog.db.tbl").show(n=2, truncate=False) +--+-------+ |id| name | +--+-------+ |1 | Alice | |2 | Bob | +--+-------+
  19. Iceberg operations 4 categories Operation SQLs DDLs CREATE TABLE (PARTITIONED

    BY), CTAS, DROP TABLE ALTER TABLE RENAME/ALTER COLUMN (Schema evolution) ALTER TABLE ADD/DROP/REPLACE PARTITION FIELD (Partition evolution) ALTER TABLE CREATE BRANCH (Branch and Tag) CREATE|ALTER|DROP VIEW (Views) etc. Reads SELECT, SELECT AS OF (Time travel) SELECT cols FROM history/snapshot/refs (Table inspections) etc. Writes INSERT INTO, UPDATE, DELETE FROM, MERGE INTO (upsert), INSERT OVERWRITE etc. Procedures rollback_to_snapshot/timestamp, rewrite_data_files/manifests, rewrite_position_deletes (compaction), expire_snapshot, remove_orphan_files fast_forward, publish_changes etc.
  20. DDLs CREATE TABLE with partitioning and bucketing CREATE TABLE my_catalog.db.tbl

    (id int, name string) USING iceberg CREATE TABLE my_catalog.db.tbl (id int, name string, year int) USING iceberg LOCATION 's3://bucket/path' PARTITIONED BY (year) s3://bucket/path/db.db/tbl/ - data/ - year=2024/00000-6-887971a6-11e5-4860-a379-4980bda4c85c-0-00003.parquet - year=2023/00000-6-887971a6-11e5-4860-a379-4980bda4c85c-0-00002.parquet - metadata/ - 00001-00ce06e3-54d3-4f36-94c6-7e277b2aec3f.metadata.json - a184a338-62d9-4b67-ba06-5757972f64a6-m0.avro - snap-6993945275787927359-1-a184a338-62d9-4b67-ba06-5757972f64a6.avro
  21. DDLs CREATE TABLE with partitioning and bucketing CREATE TABLE my_catalog.db.tbl

    (id int, name string) USING iceberg CREATE TABLE my_catalog.db.tbl (id int, name string, year int) USING iceberg LOCATION 's3://bucket/path' PARTITIONED BY (year) CREATE TABLE my_catalog.db.tbl (id int, name string, ts timestamp) USING iceberg LOCATION 's3 PARTITIONED BY (year(ts), month(ts), day(ts)) Partition transforms like year, month, day, bucket etc. Doc: https://iceberg.apache.org/spec/#partition-transforms Src: https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/api/src/main/java/org/apache/iceberg/transforms/Transforms.java
  22. DDLs CREATE TABLE with partitioning and bucketing CREATE TABLE my_catalog.db.tbl

    (id int, name string, ts timestamp) USING iceberg LOCATION 's3 PARTITIONED BY (year(ts), month(ts), day(ts), bucket(4, id)) s3://bucket/path/db.db/tbl/ - data/ - year=2024/ - id_bucket=3/00000-6-887971a6-11e5-4860-a379-4980bda4c85c-0-00003.parquet - id_bucket=0/00000-6-887971a6-11e5-4860-a379-4980bda4c85c-0-00002.parquet - metadata/ - 00001-00ce06e3-54d3-4f36-94c6-7e277b2aec3f.metadata.json - a184a338-62d9-4b67-ba06-5757972f64a6-m0.avro - snap-6993945275787927359-1-a184a338-62d9-4b67-ba06-5757972f64a6.avro
  23. DDLs ALTER TABLE SQL SQL Extension is required? ALTER TABLE

    my_catalog.db.tbl RENAME TO my_catalog.db.tbl_new_name No ALTER TABLE my_catalog.db.tbl SET TBLPROPERTIES ('KEY'='VALUE') ALTER TABLE my_catalog.db.tbl ADD COLUMN(S) (new_col string) ALTER TABLE my_catalog.db.tbl RENAME COLUMN id TO user_id ALTER TABLE my_catalog.db.tbl ALTER COLUMN id TYPE string ALTER TABLE my_catalog.db.tbl DROP COLUMN id ALTER TABLE my_catalog.db.tbl ADD|DROP PARTITION FIELD day(ts) ALTER TABLE my_catalog.db.tbl REPLACE PARTITION FIELD id WITH req_id YES ALTER TABLE my_catalog.db.tbl WRITE ORDERED BY id, year ALTER TABLE my_catalog.db.tbl WRITE DISTRIBUTED BY PARTITION year, month ALTER TABLE my_catalog.db.tbl SET|DRP{ IDENRIFIER FIELDS id, year ALTER TABLE my_catalog.db.tbl CREATE|REPLACE|DROP BRANCH 'branchname' ALTER TABLE my_catalogg CREATE|REPLACE|DROP TAG 'tagname'
  24. DDLs Views Ref: https://iceberg.apache.org/docs/latest/spark-ddl/#iceberg-views-in-spark CREATE VIEW product_category_view AS SELECT category,

    count(*) as cnt FROM my_catalog.db.tbl DROP VIEW product_category_view ALTER VIEW product_category_view Spark 3.4+, and Nessie/JDBC catalogs etc. are supported
  25. Reads Time travels SELECT * FROM my_catalog.db.tbl SELECT * FROM

    my_catalog.db.tbl VERSION AS OF <SNAPSHOT_ID> Time travels by version, timestamp and branch/tag SELECT * FROM my_catalog.db.tbl TIMESTAMP AS OF '<TIMESTAMP>' SELECT * FROM my_catalog.db.tbl VERSION AS OF '<TAG_NAME | BRANCH_NAME>' +---+-------+----+ |id |name |year| +---+-------+----+ |1 |Alice |2024| |2 |Bob |2023| |3 |Charlie|2022| |4 |Dave |2021| |5 |Elly |2022| +---+-------+----+ SELECT * FROM my_catalog.db.tbl VERSION AS OF 7394805868859573195 +---+-------+----+ |id |name |year| +---+-------+----+ |1 |Alice |2024| |2 |Bob |2023| |3 |Charlie|2022| +---+-------+----+ Ref: https://iceberg.apache.org/docs/latest/spark-queries/#time-travel
  26. Reads Table inspections SELECT * FROM my_catalog.db.tbl.<KEYWORD (e.g.) history, snapshots,

    refs etc.> SELECT * FROM my_catalog.db.tbl.history +-----------------------+-------------------+-------------------+-------------------+ |made_current_at |snapshot_id |parent_id |is_current_ancestor| +-----------------------+-------------------+-------------------+-------------------+ |2024-09-02 07:45:31.031|7394805868859573195|NULL |true | |2024-09-02 08:23:56.313|7217724094206392902|7394805868859573195|true | +-----------------------+-------------------+-------------------+-------------------+ Ref: https://iceberg.apache.org/docs/latest/spark-queries/#inspecting-tables
  27. Writes MERGE INTO (= Upsert) MERGE INTO my_catalog.db.tbl t USING

    (SELECT * FROM tmp) s ON t.id = s.id WHEN MATCHED THEN UPDATE SET t.year = s.year WHEN NOT MATCHED THEN INSERT * +---+-------+----+ |id |name |year| +---+-------+----+ |1 |Alice |2024| |2 |Bob |2023| |3 |Charlie|2022| |4 |Dave |2021| |5 |Elly |2022| +---+-------+----+ +---+-------+----+ |id |name |year| +---+-------+----+ |1 |Alice |1999| |2 |Bob |1998| |3 |Charlie|2022| |4 |Dave |2021| |5 |Elly |2022| |8 |Tommy |2024| +---+-------+----+ +---+-------+----+ |id |name |year| +---+-------+----+ |1 |Alice |1999| |2 |Bob |1998| |8 |Tommy |2024| +---+-------+----+ Target tmp
  28. Procedures Procedure Category CALL my_catalog.system.rollback_to_snapshot CALL my_catalog.system.rollback_to.timestamp Snapshot management CALL

    my_catalog.system.set_current_snapshot CALL my_catalog.system.cherripick_snapshot CALL my_catalog.system.publish_changes CALL my_catalog.system.fast_forward Branch and Tag CALL my_catalog.system.expire_snapshots CALL my_catalog.system.remove_orphan_files Data lifecycle management CALL my_catalog.system.rewrite_data_files CALL my_catalog.system.rewrite_manifests CALL my_catalog.system.rewrite_position_delete_files Compaction CALL my_catalog.system.snapshot CALL my_catalog.system.migrate CALL my_catalog.system.add_files Table migration CALL my_catalog.system.register_table Table registration CALL my_catalog.system.create_changelog_view CDC for an Iceberg table CALL my_catalog.system.ancestors_of Metadata information
  29. Recap Iceberg configurations for Spark spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\

    .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() For Iceberg 1.5.0+ Iceberg 1.5.0-
  30. Recap Iceberg configurations for Spark spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\

    .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate()
  31. Recap Iceberg configurations for Spark spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\

    .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() Catalog Storage Extensions
  32. Initialization of SparkSession with Iceberg configurations spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog",

    "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() Spark Catalog my_catalog = org.apache.iceberg.spark.SparkCatalog Catalog implementation class org.apache.iceberg.aws.glue.GlueCatalog Storage implementation class and location org.apache.iceberg.aws.s3.S3FileIO warehouse = s3://bucket/path SparkSQL extensinos org.apache.iceberg.spark.extensions.IcebergSparkExtensions
  33. "Spark Catalog" ͱ͸ • Spark 3 Ͱ Catalog plugin API

    ͱ Multiple catalog support ͕௥Ճ͞Εͨ • SPARK-27066 - SPIP: Identi fi ers for multi-catalog support • SPARK-27067 - SPIP: Catalog API for table metadata • ຊػೳʹΑΓ: • Spark ͰϢʔβʔ͕ಠࣗʹ࣮૷ͨ͠ "catalog" Λར༻͢Δ͜ͱ͕Ͱ͖Δ • Catalog ͱ͍͏৽ͨͳ໊લۭؒΛઃ͚Δ͜ͱͰɺ1 SparkSession Ͱෳ਺ͷΧλϩά࣮૷Λར༻Ͱ͖Δ (e.g. SELECT * FROM catalog.db.tbl) • Default catalog name: spark_catalog (spark.sql.defaultCatalog=<CATALOG_NAME> ͰมߋՄೳ) • Iceberg Spark Ͱ͸ɺຊ Catalog API Λར༻͠ɺ֤Χλϩά࣮૷ʹΞΫηε͍ͯ͠Δ
  34. Spark Catalog ͱ ͦͷ࣮૷Ϋϥε ͸ͲͷΑ͏ʹղܾ͞ΕΔ͔ 2 ͭͷϨΠϠʔ͕ଘࡏ͢Δ 1. SparkCatalog ͕࠷ॳʹ

    "Spark" ϨΠϠʔͰղܾ͞ΕΔ 2. IcebergCatalogImpl (e.g. HiveCatalog, GlueCatalog etc.) ͕ "Iceberg" ϨΠϠʔͰϩʔυ͞ΕΔ spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog Spark layer Iceberg layer spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")\ ...
  35. Spark Catalog ͱ ͦͷ࣮૷Ϋϥε ͸ͲͷΑ͏ʹղܾ͞ΕΔ͔ Catalyst query optimization process Spark

    layer Ͱ SparkCatalog ͕ͲͷΑ͏ʹղܾ͞ΕΔͷ͔ݟ͍ͯ͘લʹɺCatalyst plan ʹ͍ͭͯݟ͍͖ͯ·͢ɻ SQLParser SparkSQL Analyzer Optimizer Planner ৄࡉ͸ https://github.com/apache/spark/blob/v3.5.2/sql/core/src/main/ scala/org/apache/spark/sql/execution/QueryExecution.scala Λࢀর
  36. Spark Catalog ͱ ͦͷ࣮૷Ϋϥε ͸ͲͷΑ͏ʹղܾ͞ΕΔ͔ Catalyst query optimization process Unresolved

    LogicalPlan Analyzed LogicalPlan SQLParser SparkSQL Analyzer Optimizer Planner Optimized LogicalPlan WithCachedData LogicalPlan PhysicalPlan (sparkPlan) ExecutablePhysicalPlan (executedPlan) RDD CodeGen Rule-based analysis by LogicalAnalyzer Rule-based and CBO by LogicalOptimizer Convert to PhysicalPlan by SparkPlanner Runtime Optimization. (If needed) Reoptimize the logical plan by Adaptive Query Execution ৄࡉ͸ https://github.com/apache/spark/blob/v3.5.2/sql/core/src/main/ scala/org/apache/spark/sql/execution/QueryExecution.scala Λࢀর
  37. Spark Catalog ͱ ͦͷ࣮૷Ϋϥε ͸ͲͷΑ͏ʹղܾ͞ΕΔ͔ Catalyst query optimization process Unresolved

    LogicalPlan Analyzed LogicalPlan SQLParser SparkSQL Analyzer Optimizer Planner Optimized LogicalPlan WithCachedData LogicalPlan PhysicalPlan (sparkPlan) ExecutablePhysicalPlan (executedPlan) RDD CodeGen Rule-based analysis by LogicalAnalyzer Rule-based and CBO by LogicalOptimizer Convert to PhysicalPlan by SparkPlanner Runtime Optimization. (If needed) Reoptimize the logical plan by Adaptive Query Execution ৄࡉ͸ https://github.com/apache/spark/blob/v3.5.2/sql/core/src/main/ scala/org/apache/spark/sql/execution/QueryExecution.scala Λࢀর Catalog
  38. Analyzer Spark Catalog ͱ ͦͷ࣮૷Ϋϥε ͸ͲͷΑ͏ʹղܾ͞ΕΔ͔ "Spark" layer ʹ͓͚Δղܾ Unresolved

    LogicalPlan Analyzed LogicalPlan Catalog Spark Parse SQL by SparkSqlParser Resolve in SparkAnalyzer (e.g.) ResolveCatalogs LookupCatalog CatalogManager.catalog Catalogs.load Initialize iceberg.spark.SparkCatalog (iceberg.spark.SparkCatalog.initialize) by CatalogPlugin DataFrame APIs with DataSourceV2 ͷ৔߹͸ɺSQL Parse ͷϑΣʔζ͕εΩοϓ͞ΕɺIceberg catalog ͕ Spark ʹ͓͚Δ CatalogV2Util class ʹΑΓϩʔυ͞ΕΔ
  39. Spark Catalog ͱ ͦͷ࣮૷Ϋϥε ͸ͲͷΑ͏ʹղܾ͞ΕΔ͔ "Iceberg" layer ʹ͓͚Δղܾ Spark Parse

    SQL by SparkSqlParser Resolve in SparkAnalyzer (e.g.) ResolveCatalogs LookupCatalog CatalogManager.catalog Catalogs.load Initialize iceberg.spark.SparkCatalog (iceberg.spark.SparkCatalog.initialize) by CatalogPlugin Iceberg SparkCatalog.buildIcebergCatalog CatalogUtil.buildIcebergCatalog
  40. Iceberg Catalog types CatalogUtil.buildIcebergCatalog ʹΑΓϩʔυ͞ΕΔ Iceberg Catalog Iceberg Catalog ͸

    2 λΠϓʹ෼ྨͰ͖Δɻ1. shorthand ͕ར༻Ͱ͖Δ (type ࢦఆՄ)Χλϩάɺ2. ΧελϜΧλϩά ʹ෼͚ΒΕΔɻ Category Catalog type Class name Default storage class Type ࢦఆ Մೳͳ Catalog hive (default) org.apache.iceberg.hive.HiveCatalog org.apache.iceberg.hadoop.HadoopFileIO hadoop org.apache.iceberg.hadoop.HadoopCatalog jdbc org.apache.iceberg.jdbc.JdbcCatalog rest org.apache.iceberg.rest.RESTCatalog org.apache.iceberg.io.ResolvingFileIO glue org.apache.iceberg.aws.glue.GlueCatalog org.apache.iceberg.aws.s3.S3FileIO nessie org.apache.iceberg.nessie.NessieCatalog org.apache.iceberg.hadoop.HadoopFileIO Custom Catalog impl class Λ catalog- impl ʹηοτ͢Δඞཁ͕͋Δ (Ұ෦ Iceberg package ʹؚ·Ε Δ) org.apache.iceberg.inmemory.InMemoryCatalog org.apache.iceberg.snowflake.SnowflakeCatalog org.apache.iceberg.dell.ecs.EcsCatalog etc. n/a
  41. Iceberg Catalog types CatalogUtil.buildIcebergCatalog ʹΑΓϩʔυ͞ΕΔ Iceberg Catalog spark = SparkSession.builder\

    .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")\ .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() ΑͬͯɺͲͪΒͷઃఆͰ΋ར༻Ͱ͖Δɻ
  42. Loading FileIO class CatalogUtil.buildIcebergCatalog ޙ Iceberg SparkCatalog.buildIcebergCatalog CatalogUtil.buildIcebergCatalog CatalogImpl Ληοτ͢Δ

    type=hive ͷ৔߹ɺ org.apache.iceberg.hive.HiveCatalog ·ͨ͸ɺ catalog-impl= org.apache.iceberg.snowflake.SnowflakeCatalog CatalogUtil.loadCatalog CatalogImpl ͸ DynamicConstructor (DynConstructors class) ʹΑΓಡΈࠐ·ΕΔ Ref: https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L233 CatalogUtil.loadFileIO ಡΈࠐ·Εͨ CatalogImpl class Λϕʔεʹ Catalog ͕ॳظԽ ͞ΕΔɻ͜ͷॳظԽͷλΠϛϯάͰɺFileIO ͕ϩʔυ͞ΕΔ Ref: * https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L256 * (HiveCatalog) https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L111
  43. Summary: Catalog resolution Iceberg SparkCatalog.buildIcebergCatalog CatalogUtil.buildIcebergCatalog CatalogUtil.loadCatalog CatalogUtil.loadFileIO Spark Parse

    SQL by SparkSqlParser Resolve in SparkAnalyzer (e.g.) ResolveCatalogs LookupCatalog CatalogManager.catalog Catalogs.load Initialize iceberg.spark.SparkCatalog (iceberg.spark.SparkCatalog.initialize) by CatalogPlugin Ready to use Load FileIO impl Load Catalog impl SetCatalog Impl
  44. "IcebergSparkSessionExtensions" ͱ͸ • Spark catalyst ʹ͓͚Δϧʔϧ͸ SparkSessionExtensions class ܦ༝Ͱ֦ு ͢Δ͜ͱ͕Ͱ͖Δ

    • Developer ͸ಠࣗϧʔϧΛ࣮૷ͯ͠ɺͦΕΛ Spark catalyst ʹ௥ՃͰ͖Δ • User ͸ɺ௥Ճ͞ΕͨϧʔϧΛ spark.sql.extensions ʹઃఆ͢Δ͜ͱͰ࢖༻ Ͱ͖Δ Ref: https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/spark/v3.5/spark-extensions/src/main/scala/org/apache/iceberg/spark/extensions/IcebergSparkSessionExtensions.scala spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate()
  45. Iceberg catalyst rule injection IcebergSparkSessionExtensions (Spark 3.5 & Iceberg 1.6.1)

    SQLParser SparkSQL Analyzer Optimizer Planner injectParser NonResolvedLogicalPlan => AnalyzedLogicalPlan AnalyzedLogicalPlan => OptimizedLogicalPlan QptimizedLogicalPlan => SpakPlan (+ AQE) injectResolutionRule injectCheckRule injectOptimizerRule injectPreCBORule injectPlannerStrategy ResolveProcedures ResolveViews ProcedureArgumentCoercison CheckViews ReplaceStaticInvoke ExtendedDataSourceV2Strategy
  46. Many Iceberg catalyst rules are added to Spark Iceberg 1.4.0

    release notes Tabular blog post on 2023 Oct. 5 https://tabular.io/blog/iceberg-1-4-0/
  47. Many Iceberg catalyst rules are added to Spark Phrase Method

    Class for Spark 3.4 Class for Spark 3.5 Parse injectParser IcebergSparkSqlExtensionsParser IcebergSparkSqlExtensionsParser Analyzer injectResolutionRule ResolveProcedures ResolveViews ResolveMergeIntoTableReferences CheckMergeIntoTableConditions ProcedureArgumentCoercion AlignRowLevelCommandAssignments RewriteUpdateTable RewriteMergeIntoTable ResolveProcedures ResolveViews ProcedureArgumentCoercion injectCheckRule CheckViews MergeIntoIcebergTableResolutionCheck AlignedRowLevelIcebergCommandCheck CheckViews Optimizer injectOptimizerRule ExtendedSimplifyConditionalsInPredicate ExtendedReplaceNullWithFalseInPredicate ReplaceStaticInvoke ReplaceStaticInvoke injectPreCBORule RowLevelCommandScanRelationPushdown ExtendedV2Writes RowLevelCommandDynamicPruning ReplaceRewrittenRowLevelCommand n/a Planner injectPlannerStrategy ExtendedDataSourceV2Strategy ExtendedDataSourceV2Strategy (e.g.) https://github.com/apache/spark/blob/v3.5.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteUpdateTable.scala
  48. "SparkSessionCatalog" (for Iceberg) ͱ͸? • Iceberg Catalog ࣮૷ʹՃ͑ɺSpark ͷϏϧτΠϯΧλϩά (spark_catalog)

    ΋ ؚΜͰ͍Δɻͭ·Γ Spark (Iceberg Ͱͳ͍) ςʔϒϧΦϖϨʔγϣϯ΋࣮ߦͰ ͖Δ • "Fallback" catalog ͱݴΘΕɺ࠷ॳʹ Iceberg Catalog ࣮૷Λ࢖͓͏ͱ͢Δ ͕ɺҧ͏৔߹͸ Spark Catalog Λར༻͢Δ • Spark Catalog ͱ Iceberg Catalog Λར༻͢Δ migrate procedure ࣮ߦ࣌ʹ ࢖༻͢Δ৔߹͕͋Δ
  49. SparkSessionCatalog ͷઃఆ spark = SparkSession.builder\ .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")\ .config("spark.sql.catalog.my_catalog.type", "glue")\ .config("spark.sql.catalog.my_catalog.warehouse",

    "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark.sql("SELECT * FROM my_catalog.db.tbl").show(truncate=False) spark.sql("SELECT * FROM my_catalog.db.tbl").show(truncate=False) spark = SparkSession.builder\ .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")\ .config("spark.sql.catalog.spark_catalog.type", "glue")\ .config("spark.sql.catalog.spark_catalog.warehouse", "s3://bucket/path")\ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkExtensions")\ .getOrCreate() spark.sql("SELECT * FROM db.tbl").show(truncate=False) SparkCatalog SparkSesisonCatalog
  50. ·ͱΊ • Iceberg Λ Spark Ͱར༻͢ΔͨΊʹ͸ɺύοέʔδͷ import ͱɺSparkCatalog ʹؔ ࿈͢ΔύϥϝʔλΛઃఆ͢Δ

    • Spark Ͱ࣮ߦͰ͖Δ Iceberg ΦϖϨʔγϣϯ͸ଟ͘ɺ༷ʑͳϝϯςφϯελεΫΛ Procedure ͱ࣮ͯ͠ߦͰ͖Δ • Iceberg Catalog ͸ Spark ͷ CatalogPlugin API ܦ༝Ͱઃఆ͞ΕɺॳظԽ͞Εͨޙɺ FileIO ͕ॳظԽ͞ΕΔɻ֤ Catalog ͸ɺσϑΥϧτ FileIO ΫϥεΛ࣋ͭ • IcebergSparkSessionExtensions ʹΑΓɺSpark ʹ࣮૷͕ଘࡏ͠ͳ͍ΫΤϦ (Procedure ͳͲ)Λ࣮ߦ͢Δ͜ͱ͕Ͱ͖Δ