Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Life is But a (Data) Stream by Sandon Jacobs (C...

Life is But a (Data) Stream by Sandon Jacobs (Confluent)

Life is But a (Data) Stream: Building Quality Data Pipelines
Sandon Jacobs, Senior Developer Advocate at Confluent

apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025

------

Check out our conferences at https://www.apidays.global/

Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8

Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io

Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/

Avatar for apidays

apidays

May 24, 2025
Tweet

More Decks by apidays

Other Decks in Programming

Transcript

  1. DATA WAREHOUSE / DATA LAKE ML/AI Dashboards OPERATIONAL DATA Poor

    decision making with stale data 5 / 30 / 60 min batch ingestion Poor lineage and governance and increasing pipeline sprawl Cascading data pollution and failures Time Batch 1 Process Batch 2 Process Batch 3 Process Batch 4 Process Time Batch 1 Process Batch 2 Process Batch 3 Process Batch 4 Process Time Batch 1 Process Batch 2 Process Batch 3 Process Batch 4 Process Time Batch 1 Process Batch 2 Process Batch 3 Process Batch 4 Process Complex remodelling and reprocessing = $$$ ‘JUST-ENOUGH’ CLEANSED DATA READY-TO-USE BUSINESS DATA RAW DATA DUMPS Reports ELT Pipelines Are Brittle, Slow and Inef fi cient
  2. OPERATIONAL ESTATE ANALYTICAL ESTATE Apache Kafka is the standard to

    connect and organize business data as data streams Apache Iceberg is the standard for managing tables that feed the analytical estate
  3. Apache Kafka Connect to almost anything Many client libraries Immutable,

    replicated data Distributed event streaming platform
  4. Apache Flink Stateful processing with windowed operations Python and JVM

    libraries Multiple abstractions (sql, table API, datastream API) Robust connector ecosystem
  5. Open Table Formats Apache Iceberg and Databricks Delta Lake Standard

    for storing tabular data Built on Parquet or ORC Faster queries Reliable transactions
  6. In Summary, Batch Pipelines Pose Signi fi cant Challenges STALE

    DATA A giant mess of monolithic point-to-point connections with data fidelity and governance challenges due to batch ingest and duplicative processing at the destination Operational Databases and Apps ELT ETL Raw Cleansed Business- ready Raw Cleansed Data Warehouse / Data Lake rETL rETL ML/AI Reports & Dashboards EXPENSIVE (RE)PROCESSING MANUAL BREAK FIX SILOED AND REDUNDANT DATASETS
  7. Operational Databases And Apps Business- ready Data Warehouse / Data

    Lake PROCESS GOVERN STREAM Universal Data Products Operational Databases, SaaS Apps, Custom Apps, AI Systems… Cleansed Microservices ML/AI Reports & Dashboards Cleansed CONNECT CONNECT CONNECT Shift Left to Unlock Faster Data Value for Analytics and AI ROI POSITIVE REAL-TIME RELIABLE REUSABLE Build your data once, make it trustworthy and use it anywhere by shifting the processing and governance of your data at the source
  8. Stream Governance Why Stream Governance Matters Kafka The standard for

    operational streaming Flink The standard for stream processing Iceberg and Delta Lake The standard table formats for analytics