SG Kafka Meetup August 2024: Streaming Lakehouse with Kafka, Flink and Iceberg

Streaming Lakehouse with Kafka, Flink & Iceberg

I’m Zabeer Farook Technical Architect, Credit Agricole CIB - Passionate
about Stream data processing, Event Driven Architecture, Cloud & DevOps. - Love travelling & exploring places HELLO!

Your journey with me today.. OLTP Vs OLAP Data Warehouse
vs Data Lake vs Lakehouse Streaming Lakehouse Open Table Formats Apache Iceberg - Features, Beneﬁts & Challenges Iceberg vs other open table formats Connecting the dots with Kafka & Flink Demo with Kafka , Flink & Iceberg in Action Q&A KNOW DISCOVER DO

OLTP Vs OLAP Online Transaction Processing Online Analytical Processing

OLAP System Technical Components Storage Engine Table Format File Format
Storage Compute Engine Catalog Metadata layer on top of File format Format of data (CSV, Avro, Parquet, ORC etc) Storage Infra (File System, HDFS, Object Storage) Laying out data, maintenance, Optimization etc Run user workloads to process data Dictionary to discover table metadata

Data Warehouse (since 1980s) Late 1980s 2020 - Centralized OLAP
database for BI and Reporting - Only Structured Data - Schema On Write - ACID guarantees - Effective Data Governance - Storage & Compute tightly coupled - No native support for ML workloads - High Cost e.g. Teradata, Oracle Exadata (legacy EDW) Cloud Data Warehouses (since 2010) like Redshift, BigQuery, Snowﬂake separates storage and compute and supports unstructured data and supports ML workloads as well OLAP Cubes

Data Lake (since 2010) Late 1980s 2020 - Structured, Semi-Structured
& Unstructured Data - Schema on Read - Hive Table Format - Storage and Compute decoupling - Open Data formats like CSV, Avro, Parquet, ORC - Lower cost - Supports ML use cases - No metadata layer, no ACID support Data Lake is often used in conjunction with a Data Warehouse - raw data is stored in the lake and further cleansed and aggregated with a data warehouse Started with Hadoop MapReduce and HDFS as storage Evolved with cloud object storage (S3, ADLS, GCS) with query engines (Spark, Presto)

Beware of Data Swamp!! Late 1980s 2020 - No default
storage engine function to optimize data layout - Data is hardly revisited / optimized - Apply Data Governance, Data Catalogue, Cleansing

Data Lakehouse (since 2020) Late 1980s 2020 - Term made
popular by Databricks - Metadata layer with Open table formats like Hudi, Delta Lake, Iceberg - Cost Eﬃcient - ACID guarantees - Schema Evolution - Open Architecture - Faster Queries Combines the best of both worlds! Lakehouse can also double up as a data lake and a warehouse

So what is a Streaming Lakehouse? - Real-time data ingestion
- Stream processing capabilities on Lakehouse - Real-time analytics through distributed query engines - Supports faster decision making

Open Table Formats Originally came into picture to overcome limitations
of Hive Table Format: 🖓 Invisible Speciﬁcation 🖓 Schema Evolution & Partition Evolution needs data rewrites 🖓 Often Metadata and Data not in synch 🖓 No Transactional guarantees 🖓 No Time travel & rollback Apache XTable provides cross-table omni-directional interoperability between lakehouse table formats (incubating) - Hudi, Delta Lake and Iceberg - Lakehouse Open Table formats solve most of these limitations - Apache Paimon is a recent top level Apache project which is optimized for stream processing in the Lakehouse

Apache Iceberg Apache Iceberg is a high performance open table
format purpose-built for large scale analytics. It brings the reliability and simplicity of SQL tables to big data while making it possible to work with multiple engines like Spark, Trino, PrestoDB, Flink, Hive etc • 2017 - Created by Netﬂix’s Ryan Blue and Daniel Weeks • 2018 - Open-sourced and donated to Apache Software Foundation • Overcomes performance, consistency and many other challenges with the Hive table format

Apache Iceberg - Architecture Catalog: Tracks location of table’s current
metadata file Metadata file: File which defines a table’s structure, schema, partition scheme, snapshot list etc Snapshot: Snapshot of data after a write Manifest file: Contains location, path and metadata about a list of data files Manifest list: defines a single Snapshot as a list of manifest files along with stats Data File: File containing the data of the table (parquet, orc, avro etc)

Apache Iceberg - Features • Expressive SQL • Open Speciﬁcation
• Schema Evolution • Partition Evolution • Time Travel & Rollback • ACID Compliant • Branching , Merging & Tagging • Data Compaction • Hidden Partitioning CREATE TABLE employee ( id BIGINT, name STRING, dept STRING, dob date ) PARTITIONED BY ( dob ); Create Table with partition select * from employee /*+ OPTIONS('as-of-timestamp'='1723566414000') */ Time Travel based on time select * from employee /*+ OPTIONS('snapshot-id'='483890958221556534')*/; Time Travel based on snapshot

Apache Iceberg - Catalogs & Compute Engines REST CATALOG Popular
Catalogs (Metadata store for Iceberg tables) Compute Engines

Why Apache Iceberg? - Beneﬁts Image Credits: Starburst • Avoid
Data Silos - Interoperability across different data landscape • Avoid Data Duplication - work with different compute engines • Bring your own Compute Engine • No more data / vendor lock-in • Seamless DML operations to adhere to regulations such as GDPR • Optimized Cost & Performance • SQL database like feel

The Ice Wars - Snowﬂake open sources Polaris, an Iceberg
Catalog - Databricks acquires Tabular, a company founded by the original creators of Iceberg - Databricks open sources Unity Catalog And the Winner is Iceberg

Apache Iceberg - Challenges & Mitigations • Optimizing and maintaining
Iceberg Tables ◦ Compaction of files (avoid small file problem with streaming) ◦ Retention & Expiration of snapshots ◦ Old metadata file removal ◦ Orphan file cleanup • Security ◦ Storage layer security ◦ Catalog with RBAC policies

Apache Iceberg - Comparison with Hudi & Delta Source :
Dremio

Streaming Kafka Data to Iceberg - Apache Iceberg Kafka Sink
Connector - Flink Streaming - Spark Streaming - Conﬂuent Table Flow (In Private Preview) - Other managed vendor solutions

Connecting the dots with Kafka & Flink • Distributed pub
sub messaging system to handle, store and distribute data in real time • Streaming of data in real time • Handles huge volumes of data • High Throughput & Low latency & Fault Tolerance • Uniﬁed Stream and Batch Processing • Highly Eﬃcient stream processing engine • Handles Large scale stateful stream processing with low latency and high throughput • Can work with multiple different sources and sinks Kafka and Flink together can transform a Lakehouse into a streaming lakehouse

Demo Code https://github.com/Zabi82/flink-iceberg

SG Kafka Meetup August 2024: Streaming Lakehous...

SG Kafka Meetup August 2024: Streaming Lakehouse with Kafka, Flink and Iceberg

Zabeer Farook

More Decks by Zabeer Farook

Other Decks in Technology

Featured

Transcript

Streaming Lakehouse with Kafka, Flink & Iceberg

I’m Zabeer Farook Technical Architect, Credit Agricole CIB - Passionate

Your journey with me today.. OLTP Vs OLAP Data Warehouse

OLTP Vs OLAP Online Transaction Processing Online Analytical Processing

OLAP System Technical Components Storage Engine Table Format File Format

Data Warehouse (since 1980s) Late 1980s 2020 - Centralized OLAP

Data Lake (since 2010) Late 1980s 2020 - Structured, Semi-Structured

Beware of Data Swamp!! Late 1980s 2020 - No default

Data Lakehouse (since 2020) Late 1980s 2020 - Term made

So what is a Streaming Lakehouse? - Real-time data ingestion

Open Table Formats Originally came into picture to overcome limitations

Your journey with me today.. OLTP Vs OLAP Data Warehouse

Apache Iceberg Apache Iceberg is a high performance open table

Apache Iceberg - Architecture Catalog: Tracks location of table’s current

Apache Iceberg - Features • Expressive SQL • Open Speciﬁcation

Apache Iceberg - Catalogs & Compute Engines REST CATALOG Popular

Why Apache Iceberg? - Beneﬁts Image Credits: Starburst • Avoid

The Ice Wars - Snowﬂake open sources Polaris, an Iceberg

Apache Iceberg - Challenges & Mitigations • Optimizing and maintaining

Apache Iceberg - Comparison with Hudi & Delta Source :

Streaming Kafka Data to Iceberg - Apache Iceberg Kafka Sink

Connecting the dots with Kafka & Flink • Distributed pub

Your journey with me today.. OLTP Vs OLAP Data Warehouse

Demo

Demo Code https://github.com/Zabi82/flink-iceberg