Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zesty journey to adopt Apache Iceberg

Eran Levy
September 08, 2023

Zesty journey to adopt Apache Iceberg

Our journey to adopt Apache Iceberg on AWS

Eran Levy

September 08, 2023
Tweet

Other Decks in Technology

Transcript

  1. How can you utilize Iceberg on AWS with no Spark

    expertise in your team and going serverless all-in? WIFM @levyeran
  2. Why did we choose Apache Iceberg? @levyeran - Open data

    table format widely adopted and integrates well with AWS ecosystem (Glue catalog, Athena, etc). - Table evolution - Mainly schema and partitioning layout (particularly hidden partitioning). - Integrating well with many processing engines - supports our long term strategy in leveraging the right technologies to their needs.
  3. While there are many cool things in Iceberg, There are

    some challenges… The main challenge is: Maintenance @levyeran
  4. Table Configuration @levyeran - Iceberg v2 table, created with AWS

    Glue catalog and Athena engine version 3 (preferably a dedicated WorkGroup). - Parquet with ZSTD compression - this is the data format we adopted across our data lake. - Snapshot age - 2 days (default is 5 days). Athena allows predefined key-value TBLPROPERTIES only. Glue catalog - Metadata tracking
  5. Table Maintenance @levyeran We are updating our Iceberg table frequently

    (every minute, 5GBs, insert/update, 50 columns, 500M records)… So we wanted to VACUUM but were hitting the Athena query limits:
  6. Table Maintenance @levyeran Increasing the limits didn’t help much because

    we were hitting another : ICEBERG_VACUUM_MORE_RUNS_NEEDED: Removed 1000 files in this round of vacuum, but there are more files remaining. Please run another VACUUM command to process the remaining files You can try overcome it by running AWS Step Functions in a loop like this suggested solution Missing several runs and you will face another challenge as increasing Athena query limits won’t help you much this time…
  7. Glue Spark ETL Jobs @levyeran In order to solve it

    for the long run, we decided to utilize the Iceberg Spark procedures in order to perform our maintenance jobs: - Glue 3.0 and later supports Iceberg integration out of the box - Ad-hoc & built-in scheduler - Integrated with CI/CD pipeline using AWS SDKs Nice AWS blog and an AWS Glue Developer Guide are available
  8. Glue Spark Maintenance Jobs @levyeran Basically the most important steps

    to perform are: - Register the Iceberg connector for AWS Glue (Not required for Glue 4.0) - Create ETL Job or a Jupyter Notebook - Provide the necessary configuration to the Spark job/notebook such as: –datalake-formats and –conf NOTE: these actions automatically inject the Iceberg Spark SQL extension
  9. Glue Spark Maintenance Jobs @levyeran Main maintenance procedures: • Expire_snapshots

    • Rewrite_data_files • Remove_orphan_files • Rewrite_manifests
  10. Summary • Apache Iceberg is well adopted in the industry

    and specifically in the AWS ecosystem. • It's not persist & forget -> take Iceberg maintenance into consideration while choosing your architecture. • Keep monitoring -> your partitioning strategy might change, file size, query latencies, etc. as there are many moving parts that can impact your performance.
  11. Next Steps • Choosing our data lakehouse platform • Maintenance

    is an issue as we scale to additional use cases with larger data volume - we might need a managed service to assist us here