Zesty journey to adopt Apache Iceberg

Zesty journey to adopt Apache Iceberg Eran Levy @levyeran

Eran Levy Data & Platform Group Lead @Zesty https://levyeran.medium.com/ @levyeran
Introduction @levyeran

How can you utilize Iceberg on AWS with no Spark
expertise in your team and going serverless all-in? WIFM @levyeran

Build Fast. Stay Cost Efficient.

300% Customer Growth YoY $120M Raised Since 2020 406% Employee
Growth YoY

The Goal - Database Explorer @levyeran

Medallion Architecture @levyeran

Why did we choose Apache Iceberg? @levyeran - Open data
table format widely adopted and integrates well with AWS ecosystem (Glue catalog, Athena, etc). - Table evolution - Mainly schema and partitioning layout (particularly hidden partitioning). - Integrating well with many processing engines - supports our long term strategy in leveraging the right technologies to their needs.

While there are many cool things in Iceberg, There are
some challenges… The main challenge is: Maintenance @levyeran

Architecture @levyeran

Table Configuration @levyeran - Iceberg v2 table, created with AWS
Glue catalog and Athena engine version 3 (preferably a dedicated WorkGroup). - Parquet with ZSTD compression - this is the data format we adopted across our data lake. - Snapshot age - 2 days (default is 5 days). Athena allows predefined key-value TBLPROPERTIES only. Glue catalog - Metadata tracking

@levyeran Table Maintenance Main maintenance operations for optimizing Iceberg table
in Athena: 1. VACUUM 2. OPTIMIZE

Table Maintenance @levyeran We are updating our Iceberg table frequently
(every minute, 5GBs, insert/update, 50 columns, 500M records)… So we wanted to VACUUM but were hitting the Athena query limits:

Table Maintenance @levyeran Increasing the limits didn’t help much because
we were hitting another : ICEBERG_VACUUM_MORE_RUNS_NEEDED: Removed 1000 files in this round of vacuum, but there are more files remaining. Please run another VACUUM command to process the remaining files You can try overcome it by running AWS Step Functions in a loop like this suggested solution Missing several runs and you will face another challenge as increasing Athena query limits won’t help you much this time…

Table Maintenance @levyeran Same for OPTIMIZE but were hitting the
partitions limitation:

Glue Spark ETL Jobs @levyeran In order to solve it
for the long run, we decided to utilize the Iceberg Spark procedures in order to perform our maintenance jobs: - Glue 3.0 and later supports Iceberg integration out of the box - Ad-hoc & built-in scheduler - Integrated with CI/CD pipeline using AWS SDKs Nice AWS blog and an AWS Glue Developer Guide are available

Glue Spark Maintenance Jobs @levyeran Basically the most important steps
to perform are: - Register the Iceberg connector for AWS Glue (Not required for Glue 4.0) - Create ETL Job or a Jupyter Notebook - Provide the necessary configuration to the Spark job/notebook such as: –datalake-formats and –conf NOTE: these actions automatically inject the Iceberg Spark SQL extension

@levyeran Full example is available here: https://github.com/eran-levy/iceberg-journey-session-examples Glue Spark Maintenance
Jobs

Glue Spark Maintenance Jobs @levyeran Main maintenance procedures: • Expire_snapshots
• Rewrite_data_files • Remove_orphan_files • Rewrite_manifests

QuickSight for Apache Iceberg metadata analysis @levyeran Not Optimized Optimized!

Snapshots after optimization @levyeran

Snapshots after expiration procedure @levyeran

QuickSight for Apache Iceberg data files analysis @levyeran

Summary • Apache Iceberg is well adopted in the industry
and specifically in the AWS ecosystem. • It's not persist & forget -> take Iceberg maintenance into consideration while choosing your architecture. • Keep monitoring -> your partitioning strategy might change, file size, query latencies, etc. as there are many moving parts that can impact your performance.

Next Steps • Choosing our data lakehouse platform • Maintenance
is an issue as we scale to additional use cases with larger data volume - we might need a managed service to assist us here

Zesty journey to adopt Apache Iceberg

Zesty journey to adopt Apache Iceberg

Eran Levy

More Decks by Eran Levy

Other Decks in Technology

Featured

Transcript