Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MDS Fest 3.0 - Your Data is Late Again! Handlin...

MDS Fest 3.0 - Your Data is Late Again! Handling Late-Arriving Data in the Modern Data Stack

Your Data is Late Again! Handling Late-Arriving Data in the Modern Data Stack
Event: MDS Fest 3.0
Date: 9 May 2025
Location: Online

When we say we are processing the "latest" data in the data warehouse, which timestamp are we referring to? Late-arriving data can happen in data warehouses due to operational and technical reasons that may require data corrections, posing challenges in ensuring accurate reporting of point-in-time historical dimensions in the data warehouse. In this talk, I will be exploring approaches on how we can design data systems and history-preserving data models that can handle late-arriving data in the modern data stack.

Avatar for Ong Chin Hwee

Ong Chin Hwee

May 09, 2025
Tweet

More Decks by Ong Chin Hwee

Other Decks in Technology

Transcript

  1. Your Data is Late Again! Handling Late-Arriving Data in the

    Modern Data Stack By: Chin Hwee Ong (@ongchinhwee) 9 May 2025 MDS Fest 3.0, Online
  2. About me Ong Chin Hwee 王敬惠 • Senior Data Engineer

    @ Grab • Speaker and (occasional) writer on data engineering topics • Aerospace + business + computation @ongchinhwee
  3. Before we begin, let’s set some expectations: @ongchinhwee • The

    insights in this talk are based on my own technical expertise / “war stories” and does not in any way represent the views of my team and employer. ◦ Any resemblance to actual events is purely coincidental.
  4. “Latest” data in the data warehouse / lake ⇒ which

    timestamp is it? @ongchinhwee When the data was last updated in source? When the data reaches the data warehouse / lake? When the data event last happened?
  5. Imagine this scenario in your data team: @ongchinhwee • You

    have a data pipeline that ingests transactions and customer data from the database systems every hour. • Customer data needs to be integrated with payments data for transactions monitoring throughout the day. • There is an outage that caused the updated customer data to arrive 3hr later than scheduled.
  6. Your data team’s objectives @ongchinhwee • Ensure timely and accurate

    reporting for business operations • Include late-arriving customer data in your monitoring reports • Maintain audit log of data changes for data governance
  7. Time is an important dimension in your data @ongchinhwee •

    Data is usually not static ◦ State transitions during a business process ◦ Attributes can change over time e.g. age, income, status
  8. Concept of time in data • Fundamental Concepts: System Time

    vs Valid Time ◦ Valid Time ▪ Also known as actual time or event time ▪ Time when the event occured in the real world ▪ Mutable depending on effective period determined by changes in event state @ongchinhwee
  9. Concept of time in data • Fundamental Concepts: System Time

    vs Valid Time ◦ System Time ▪ Also known as record time (or “transaction time” in database) ▪ Time when the data for the event is recorded in the system ▪ Immutable by design based on current state of data @ongchinhwee
  10. Concept of time in data - Modern Data Stack @ongchinhwee

    • Fundamental Concepts: System Time vs Valid Time ◦ System Time ▪ For data platforms, multiple systems’ timestamps to consider • Source system time (database, object storage etc.) • Ingestion system time • Data warehouse / lake processing system time
  11. Data Latency in Modern Data Stack @ongchinhwee • Time lag

    between event time vs system time for data events • Multiple causes: ◦ Latency in upstream systems ◦ Network delays during ingestion to data warehouse / lake ◦ Out-of-order event generation ⇒ Change data capture for timely change tracking
  12. What is change data capture? @ongchinhwee • “Data versioning for

    databases” • Design patterns for capturing and tracking changes in data from upstream source systems over time • Changes are captured in data warehouses / data lakes
  13. Change data capture in the Modern Data Stack • Some

    implementations of change data capture ◦ Modern approaches ▪ Data snapshots ▪ Incremental models ◦ Traditional (Kimball’s) Dimensional Modelling techniques ▪ Slowly Changing Dimensions (SCDs) @ongchinhwee
  14. @ongchinhwee Anatomy of incremental materialization Adapted from: dbt-dummy project (https://github.com/gmyrianthous/dbt-dummy/)

    1. Table exists? Full-refresh or incremental 2. Condition for incremental load Where to get the load from? Where to insert incremental load? How to insert/update incremental load?
  15. What are Slowly Changing Dimensions (SCDs)? • Type 1 Slowly

    Changing Dimensions (Type 1 SCD) ◦ Reflects the latest version of dimension attributes ◦ Previous version of the value in the dimension row is overwritten with new value @ongchinhwee
  16. What are Slowly Changing Dimensions (SCDs)? • Type 2 Slowly

    Changing Dimensions (Type 2 SCD) ◦ Implements row versioning for each dimension attribute ◦ Concept of “validity period” for each version of the data @ongchinhwee
  17. First Line of Defense: Preventing Downstream Impact @ongchinhwee • Objective:

    Stop incomplete / inconsistent data from flowing into destination tables • Preventive approaches we can adopt: ◦ Data reconciliation ◦ Data integrity checkpoints
  18. Data Reconciliation • Comparing data from different sources to ensure:

    ◦ Accuracy: values remain correct ◦ Consistency: data between sources matches ◦ Completeness: all records have been transferred @ongchinhwee
  19. Data Reconciliation • Handling late-arriving dimensions in incremental load /

    stream ◦ Streaming Reconciliation Pattern ▪ Identification ▪ Reconciliation ▪ Retryable stream (Ref: “How to Handle Late Arriving Dimensions with a Streaming Reconciliation Pattern” in Databricks blog) @ongchinhwee
  20. @ongchinhwee (Ref: “How to Handle Late Arriving Dimensions with a

    Streaming Reconciliation Pattern” in Databricks blog)
  21. Data Integrity Checkpoints • Data validation checks to detect data

    quality issues from late data: ◦ Referential integrity across different data sources ◦ Accepted values to detect deviation from expected values ◦ Missing or null values for late-arriving / missing dimensions after table joins ◦ Data freshness checks for tables in staging area @ongchinhwee
  22. Data Integrity Checkpoints • How to implement data integrity checkpoints?

    ◦ Use data testing frameworks / tools to define validation criteria ▪ Examples: dbt (data build tool) data tests, Great Expectations ◦ Automate validation checks with data pipeline orchestration ▪ Schedule validation checks after ingestion + transformation ▪ Examples: Airflow, Dagster @ongchinhwee
  23. Incremental Strategy for Late-arriving Data • Is the table a

    fact or dimension? • When can we expect late data to reach (in most cases)? • Lookback “window” approach to handle late-arriving data ◦ Setting maximum delay allowed to include late data in incremental data stream ◦ Performance vs Accuracy trade-off @ongchinhwee
  24. Case: Incremental Strategy for Late-arriving Facts • Example: ◦ Payment

    transactions stuck in intermediate state for more than a day ◦ Average traffic: ~100 billion transactions / day ◦ Event resolution time: ~3 days w/ outliers @ongchinhwee
  25. Case: Incremental Strategy for Late-arriving Facts • Design Question to

    Consider ◦ Accuracy vs Performance ▪ Choose accuracy? • Scan a wider date range / more date partitions on source table to update the data in destination table ▪ Choose performance? • Small possibility of late data loss from late arrivals outside lookback “window” @ongchinhwee
  26. Case: Incremental Strategy for Late-arriving Dimensions • Example: ◦ Customer

    information for new customers arrives > 1 day later than transactions ◦ 5000 customer records (out of 3 million customers) impacted ◦ We need to join both sets of data for transactions monitoring @ongchinhwee
  27. Case: Incremental Strategy for Late-arriving Dimensions • Design Questions to

    Consider ◦ Slowly-Changing Dimensions - which timestamp to use? ▪ Use valid time (event timestamp)? • Dimension data may arrive much later (~1 year?) in system ⇒ longer lookback when scanning dimension table. ▪ Use system time (system updated_at timestamp)? • System time != real world events e.g. late customer onboarding @ongchinhwee
  28. Case: Incremental Strategy for Late-arriving Dimensions • Design Questions to

    Consider ◦ How to handle late-arriving dimensions in data streams? ▪ Custom incremental logic • Design objective: performant and “close enough” data • Event lookback “window” on early-arriving facts to handle incremental updates on dimensions @ongchinhwee
  29. Lookback “window” w/ maximum accepted delay of 3 days for

    facts and late-arriving dimensions @ongchinhwee
  30. What if we want our data to be eventually accurate

    in data warehouse / data lake? @ongchinhwee
  31. What if we want our data to be eventually accurate

    in data warehouse / data lake? Do a periodic batch refresh to rebuild your tables! @ongchinhwee
  32. Key Takeaways • Design proactive systems and modeling strategies for

    late-arriving data to ensure data quality in data platforms • Manage accuracy vs performance tradeoffs with lookback “window” thresholds and periodic batch refreshes @ongchinhwee
  33. Reach out to me! : ongchinhwee : ongchinhwee.me : @ongchinhwee

    : hweecat : https://ongchinhwee.me And get these slides here: https://bit.ly/mdsfest-late-arriving-data @ongchinhwee