Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming Lakehouse with Kafka, Flink & Iceberg...

Streaming Lakehouse with Kafka, Flink & Iceberg - API Days SG 2025 Talk

Slide deck from my talk during API Days Singapore 2025 on the title "Streaming Lakehouse with Kafka, Flink & Iceberg"

Zabeer Farook

April 18, 2025
Tweet

More Decks by Zabeer Farook

Other Decks in Technology

Transcript

  1. HELLO !! I’m Zabeer Farook Technical Architect, Credit Agricole CIB

    - Passionate about Stream data processing, Event Driven Architecture, Cloud & DevOps. - Love travelling & exploring places https://sg.linkedin.com/in/zabeer-farook
  2. AGENDA 03 04 05 Building a Streaming Lakehouse Quick look

    at Kafka & Flink High level overview of Apache Iceberg Lakehouse to the rescue 02 01 Challenges with Data Warehouse & Data Lake 06 Take-home Demo followed by Q&A
  3. CHALLENGES WITH DATA WAREHOUSE Data Warehouse (1980’s) BI Applications Dashboards

    Reports Data Mart Data Mart Data warehouse Batch Structured Data Ingest Store & Process Consume CLOSED ARCHITECTURE VENDOR LOCK-IN DATA LOCK-IN HIGHER COST
  4. CHALLENGES WITH DATA LAKE Data Lake (2010) Ingest Store &

    Process Consume Machine Learning Batch Analytics Reports & BI Data Lake Raw Data Cleansed Data Batch Structured, Semi Structured & Unstructured Data DATA GOVERNANCE ISSUES NO UNIFIED METADATA LAYER DATA SWAMP NO ACID GUARANTEES (Atomicity, Consistency, Isolation, Durability)
  5. CHALLENGES WITH HYBRID DATA PLATFORMS DATA SILOS Data lives in

    different platforms without interoperability DATA DUPLICATION Data copied across to process with different engines or platforms DATA SYNCH ISSUES Data copies are not in synch always EXPENSIVE Drives costs higher
  6. Data Lakehouse (2020) Ingest Store & Process Consume AI/ML &

    Data Science Batch Analytics Reports & BI Data Lake Raw Data Cleansed Data Batch Structured, Semi Structured & Unstructured Data LAKEHOUSE - BRIDGING THE GAP Metadata Layer with Data Governance, Indexing and Data Management Combines the best of both worlds! Reliability Performance ACID Guarantees Open Architecture Cost Efficiency No Lockin Interoperability Data Governance API Layer
  7. WHAT ABOUT REAL TIME DATA? • Why real time matters?

    ◦ Data Freshness ◦ Real time analytics ◦ Faster insights ◦ Faster decision making
  8. Ingest Store & Process Consume AI/ML & Data Science Batch

    Analytics Reports & BI Data Lake Raw Data Cleansed Data Batch Structured, Semi Structured & Unstructured Data STREAMING LAKEHOUSE Metadata Layer with Data Governance, Indexing and Data Management AI/ML & Data Science Batch & Real time Analytics Reports & BI Data Lake Raw Data Cleansed Data Stream Batch Structured, Semi Structured & Unstructured Data Metadata Layer with Data Governance, Indexing and Data Management • Real time ingestion layer to ingest data from real time sources like Kafka • Stream processing capabilities in the lakehouse with engines like Flink • Real time analytics through distributed query engines • Realtime Machine Learning Use cases Data Lakehouse Streaming Lakehouse API Layer API Layer
  9. BUILDING A STREAMING LAKEHOUSE Data Sources Ingestion Storage & Processing

    Serving Consumption Stream Batch Metadata Catalog Clean Transform Aggregate SQL Declarative Structured, Semi Structured & Unstructured Data Storage API REST CATALOG
  10. APACHE ICEBERG & ITS ROLE IN A LAKEHOUSE Apache Iceberg

    is a high performance open table format purpose-built for large scale analytics. Plays the role of the metadata layer in a Lake house Architecture. Catalog: Tracks location of table’s current metadata file Metadata file: File which defines a table’s schema, partition,, snapshot list etc Snapshot: Snapshot of data after a write Manifest file: Contains location, path and metadata about a list of data files Manifest list: defines a single Snapshot as a list of manifest files along with stats Data File: File containing the data of the table (parquet, orc, avro etc)
  11. APACHE ICEBERG - FEATURES & BENEFITS Image Credits: Starburst •

    Expressive SQL • Open Specification • Schema Evolution • Partition Evolution • Time Travel & Rollback • ACID Compliant • Branching , Merging & Tagging • Data Compaction • Hidden Partitioning
  12. THE ICE WARS - Snowflake open sourced Polaris, an Iceberg

    Catalog - Databricks acquired Tabular, a company founded by the original creators of Iceberg. Also open sourced Unity Catalog - All major cloud & data platform providers supports Iceberg (Confluent Table Flow, AWS S3 Tables, GCP BigQuery Tables etc) - Cloudflare has announced R2 Data Catalog (Iceberg REST Catalog) just last week And the Winner is Iceberg
  13. KAFKA & FLINK IN THE LAKEHOUSE • Distributed pub sub

    messaging system to handle, store and distribute data in real time • Streaming of data in real time • Handles huge volumes of data • High Throughput & Low latency & Fault Tolerance • Unified Stream and Batch Processing • Highly Efficient stream processing engine • Handles Large scale stateful stream processing with low latency and high throughput • Can work with multiple different sources and sinks Kafka and Flink together can transform a Lakehouse into a streaming lakehouse
  14. QUICK RECAP DATA LAKEHOUSE Data Lake House offers a cost

    effective alternative to Data warehouse. It also avoids vendor and data lock in ICEBERG Iceberg’s out of the box features such as Schema Evolution , Partition Evolution, Time Travel etc reduces operational & maintenance costs OPEN TABLE FORMATS Powered by Open Table Formats like Iceberg which offers consistent performance and an open architecture STREAMING LAKEHOUSE Streaming Lakehouse offers streaming ingestion and processing capabilities to power real time analytics & decision making TRINO Distributed Query engines like Trino helps to integrate a Lakehouse with BI & Analytics tools KAFKA & FLINK Kafka and Flink offers real time capabilities to a Lakehouse
  15. REMEMBER STREAMING LAKEHOUSE IS POWERFUL, BUT NOT A MAGIC WAND...

    Data Quality Security & Compliance Maintenance of Iceberg Tables Storage layer security & using catalog with RBAC policies Data Validation Checks Compaction of files, Snapshot Expiration
  16. THE FUTURE BELONGS TO OPEN DATA ARCHITECTURE “What REST did

    for the web, Apache Iceberg is doing for data architecture - creating open interoperable standards“
  17. Q&A