Streaming Lakehouse with Kafka, Flink & Iceberg - API Days SG 2025 Talk

STREAMING LAKEHOUSE WITH KAFKA, FLINK & ICEBERG ZABEER FAROOK

HELLO !! I’m Zabeer Farook Technical Architect, Credit Agricole CIB
- Passionate about Stream data processing, Event Driven Architecture, Cloud & DevOps. - Love travelling & exploring places https://sg.linkedin.com/in/zabeer-farook

AGENDA 03 04 05 Building a Streaming Lakehouse Quick look
at Kafka & Flink High level overview of Apache Iceberg Lakehouse to the rescue 02 01 Challenges with Data Warehouse & Data Lake 06 Take-home Demo followed by Q&A

CHALLENGES WITH DATA WAREHOUSE Data Warehouse (1980’s) BI Applications Dashboards
Reports Data Mart Data Mart Data warehouse Batch Structured Data Ingest Store & Process Consume CLOSED ARCHITECTURE VENDOR LOCK-IN DATA LOCK-IN HIGHER COST

CHALLENGES WITH DATA LAKE Data Lake (2010) Ingest Store &
Process Consume Machine Learning Batch Analytics Reports & BI Data Lake Raw Data Cleansed Data Batch Structured, Semi Structured & Unstructured Data DATA GOVERNANCE ISSUES NO UNIFIED METADATA LAYER DATA SWAMP NO ACID GUARANTEES (Atomicity, Consistency, Isolation, Durability)

CHALLENGES WITH HYBRID DATA PLATFORMS DATA SILOS Data lives in
different platforms without interoperability DATA DUPLICATION Data copied across to process with different engines or platforms DATA SYNCH ISSUES Data copies are not in synch always EXPENSIVE Drives costs higher

Data Lakehouse (2020) Ingest Store & Process Consume AI/ML &
Data Science Batch Analytics Reports & BI Data Lake Raw Data Cleansed Data Batch Structured, Semi Structured & Unstructured Data LAKEHOUSE - BRIDGING THE GAP Metadata Layer with Data Governance, Indexing and Data Management Combines the best of both worlds! Reliability Performance ACID Guarantees Open Architecture Cost Efficiency No Lockin Interoperability Data Governance API Layer

WHAT ABOUT REAL TIME DATA? • Why real time matters?
◦ Data Freshness ◦ Real time analytics ◦ Faster insights ◦ Faster decision making

Ingest Store & Process Consume AI/ML & Data Science Batch
Analytics Reports & BI Data Lake Raw Data Cleansed Data Batch Structured, Semi Structured & Unstructured Data STREAMING LAKEHOUSE Metadata Layer with Data Governance, Indexing and Data Management AI/ML & Data Science Batch & Real time Analytics Reports & BI Data Lake Raw Data Cleansed Data Stream Batch Structured, Semi Structured & Unstructured Data Metadata Layer with Data Governance, Indexing and Data Management • Real time ingestion layer to ingest data from real time sources like Kafka • Stream processing capabilities in the lakehouse with engines like Flink • Real time analytics through distributed query engines • Realtime Machine Learning Use cases Data Lakehouse Streaming Lakehouse API Layer API Layer

BUILDING A STREAMING LAKEHOUSE Data Sources Ingestion Storage & Processing
Serving Consumption Stream Batch Metadata Catalog Clean Transform Aggregate SQL Declarative Structured, Semi Structured & Unstructured Data Storage API REST CATALOG

APACHE ICEBERG & ITS ROLE IN A LAKEHOUSE Apache Iceberg
is a high performance open table format purpose-built for large scale analytics. Plays the role of the metadata layer in a Lake house Architecture. Catalog: Tracks location of table’s current metadata file Metadata file: File which defines a table’s schema, partition,, snapshot list etc Snapshot: Snapshot of data after a write Manifest file: Contains location, path and metadata about a list of data files Manifest list: defines a single Snapshot as a list of manifest files along with stats Data File: File containing the data of the table (parquet, orc, avro etc)

APACHE ICEBERG - FEATURES & BENEFITS Image Credits: Starburst •
Expressive SQL • Open Speciﬁcation • Schema Evolution • Partition Evolution • Time Travel & Rollback • ACID Compliant • Branching , Merging & Tagging • Data Compaction • Hidden Partitioning

THE ICE WARS - Snowflake open sourced Polaris, an Iceberg
Catalog - Databricks acquired Tabular, a company founded by the original creators of Iceberg. Also open sourced Unity Catalog - All major cloud & data platform providers supports Iceberg (Confluent Table Flow, AWS S3 Tables, GCP BigQuery Tables etc) - Cloudflare has announced R2 Data Catalog (Iceberg REST Catalog) just last week And the Winner is Iceberg

KAFKA & FLINK IN THE LAKEHOUSE • Distributed pub sub
messaging system to handle, store and distribute data in real time • Streaming of data in real time • Handles huge volumes of data • High Throughput & Low latency & Fault Tolerance • Uniﬁed Stream and Batch Processing • Highly Efﬁcient stream processing engine • Handles Large scale stateful stream processing with low latency and high throughput • Can work with multiple different sources and sinks Kafka and Flink together can transform a Lakehouse into a streaming lakehouse

QUICK RECAP DATA LAKEHOUSE Data Lake House offers a cost
effective alternative to Data warehouse. It also avoids vendor and data lock in ICEBERG Iceberg’s out of the box features such as Schema Evolution , Partition Evolution, Time Travel etc reduces operational & maintenance costs OPEN TABLE FORMATS Powered by Open Table Formats like Iceberg which offers consistent performance and an open architecture STREAMING LAKEHOUSE Streaming Lakehouse offers streaming ingestion and processing capabilities to power real time analytics & decision making TRINO Distributed Query engines like Trino helps to integrate a Lakehouse with BI & Analytics tools KAFKA & FLINK Kafka and Flink offers real time capabilities to a Lakehouse

REMEMBER STREAMING LAKEHOUSE IS POWERFUL, BUT NOT A MAGIC WAND...
Data Quality Security & Compliance Maintenance of Iceberg Tables Storage layer security & using catalog with RBAC policies Data Validation Checks Compaction of files, Snapshot Expiration

THE FUTURE BELONGS TO OPEN DATA ARCHITECTURE “What REST did
for the web, Apache Iceberg is doing for data architecture - creating open interoperable standards“

Streaming Lakehouse with Kafka, Flink & Iceberg...

Streaming Lakehouse with Kafka, Flink & Iceberg - API Days SG 2025 Talk

Zabeer Farook

More Decks by Zabeer Farook

Other Decks in Technology

Featured

Transcript

STREAMING LAKEHOUSE WITH KAFKA, FLINK & ICEBERG ZABEER FAROOK

HELLO !! I’m Zabeer Farook Technical Architect, Credit Agricole CIB

AGENDA 03 04 05 Building a Streaming Lakehouse Quick look

CHALLENGES WITH DATA WAREHOUSE Data Warehouse (1980’s) BI Applications Dashboards

CHALLENGES WITH DATA LAKE Data Lake (2010) Ingest Store &

CHALLENGES WITH HYBRID DATA PLATFORMS DATA SILOS Data lives in

Data Lakehouse (2020) Ingest Store & Process Consume AI/ML &

WHAT ABOUT REAL TIME DATA? • Why real time matters?

Ingest Store & Process Consume AI/ML & Data Science Batch

BUILDING A STREAMING LAKEHOUSE Data Sources Ingestion Storage & Processing

APACHE ICEBERG & ITS ROLE IN A LAKEHOUSE Apache Iceberg

APACHE ICEBERG - FEATURES & BENEFITS Image Credits: Starburst •

THE ICE WARS - Snowﬂake open sourced Polaris, an Iceberg

KAFKA & FLINK IN THE LAKEHOUSE • Distributed pub sub

QUICK RECAP DATA LAKEHOUSE Data Lake House offers a cost

DEMO

REMEMBER STREAMING LAKEHOUSE IS POWERFUL, BUT NOT A MAGIC WAND...

THE FUTURE BELONGS TO OPEN DATA ARCHITECTURE “What REST did

Q&A