vs Data Lake vs Lakehouse Streaming Lakehouse Open Table Formats Apache Iceberg - Features, Benefits & Challenges Iceberg vs other open table formats Connecting the dots with Kafka & Flink Demo with Kafka , Flink & Iceberg in Action Q&A KNOW DISCOVER DO
Storage Compute Engine Catalog Metadata layer on top of File format Format of data (CSV, Avro, Parquet, ORC etc) Storage Infra (File System, HDFS, Object Storage) Laying out data, maintenance, Optimization etc Run user workloads to process data Dictionary to discover table metadata
database for BI and Reporting - Only Structured Data - Schema On Write - ACID guarantees - Effective Data Governance - Storage & Compute tightly coupled - No native support for ML workloads - High Cost e.g. Teradata, Oracle Exadata (legacy EDW) Cloud Data Warehouses (since 2010) like Redshift, BigQuery, Snowflake separates storage and compute and supports unstructured data and supports ML workloads as well OLAP Cubes
& Unstructured Data - Schema on Read - Hive Table Format - Storage and Compute decoupling - Open Data formats like CSV, Avro, Parquet, ORC - Lower cost - Supports ML use cases - No metadata layer, no ACID support Data Lake is often used in conjunction with a Data Warehouse - raw data is stored in the lake and further cleansed and aggregated with a data warehouse Started with Hadoop MapReduce and HDFS as storage Evolved with cloud object storage (S3, ADLS, GCS) with query engines (Spark, Presto)
popular by Databricks - Metadata layer with Open table formats like Hudi, Delta Lake, Iceberg - Cost Efficient - ACID guarantees - Schema Evolution - Open Architecture - Faster Queries Combines the best of both worlds! Lakehouse can also double up as a data lake and a warehouse
of Hive Table Format: 🖓 Invisible Specification 🖓 Schema Evolution & Partition Evolution needs data rewrites 🖓 Often Metadata and Data not in synch 🖓 No Transactional guarantees 🖓 No Time travel & rollback Apache XTable provides cross-table omni-directional interoperability between lakehouse table formats (incubating) - Hudi, Delta Lake and Iceberg - Lakehouse Open Table formats solve most of these limitations - Apache Paimon is a recent top level Apache project which is optimized for stream processing in the Lakehouse
vs Data Lake vs Lakehouse Streaming Lakehouse Open Table Formats Apache Iceberg - Features, Benefits & Challenges Iceberg vs other open table formats Connecting the dots with Kafka & Flink Demo with Kafka , Flink & Iceberg in Action Q&A KNOW DISCOVER DO
format purpose-built for large scale analytics. It brings the reliability and simplicity of SQL tables to big data while making it possible to work with multiple engines like Spark, Trino, PrestoDB, Flink, Hive etc • 2017 - Created by Netflix’s Ryan Blue and Daniel Weeks • 2018 - Open-sourced and donated to Apache Software Foundation • Overcomes performance, consistency and many other challenges with the Hive table format
metadata file Metadata file: File which defines a table’s structure, schema, partition scheme, snapshot list etc Snapshot: Snapshot of data after a write Manifest file: Contains location, path and metadata about a list of data files Manifest list: defines a single Snapshot as a list of manifest files along with stats Data File: File containing the data of the table (parquet, orc, avro etc)
• Schema Evolution • Partition Evolution • Time Travel & Rollback • ACID Compliant • Branching , Merging & Tagging • Data Compaction • Hidden Partitioning CREATE TABLE employee ( id BIGINT, name STRING, dept STRING, dob date ) PARTITIONED BY ( dob ); Create Table with partition select * from employee /*+ OPTIONS('as-of-timestamp'='1723566414000') */ Time Travel based on time select * from employee /*+ OPTIONS('snapshot-id'='483890958221556534')*/; Time Travel based on snapshot
Data Silos - Interoperability across different data landscape • Avoid Data Duplication - work with different compute engines • Bring your own Compute Engine • No more data / vendor lock-in • Seamless DML operations to adhere to regulations such as GDPR • Optimized Cost & Performance • SQL database like feel
Catalog - Databricks acquires Tabular, a company founded by the original creators of Iceberg - Databricks open sources Unity Catalog And the Winner is Iceberg
sub messaging system to handle, store and distribute data in real time • Streaming of data in real time • Handles huge volumes of data • High Throughput & Low latency & Fault Tolerance • Unified Stream and Batch Processing • Highly Efficient stream processing engine • Handles Large scale stateful stream processing with low latency and high throughput • Can work with multiple different sources and sinks Kafka and Flink together can transform a Lakehouse into a streaming lakehouse
vs Data Lake vs Lakehouse Streaming Lakehouse Open Table Formats Apache Iceberg - Features, Benefits & Challenges Iceberg vs other open table formats Connecting the dots with Kafka & Flink Demo with Kafka , Flink & Iceberg in Action Q&A KNOW DISCOVER DO