Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FOSSASIA Summit 2025 - Your Data is Late Again!...

FOSSASIA Summit 2025 - Your Data is Late Again! Handling Late-Arriving Data in the Modern Data Stack

Your Data is Late Again! Handling Late-Arriving Data in the Modern Data Stack
Event: FOSSASIA Summit 2025
Date: 13 March 2025
Location: True Digital Park West, Bangkok, Thailand

Talk Recording from FOSSASIA: https://youtu.be/zZhmoGs-Biw

When we say we are processing the "latest" data in the data warehouse, which timestamp are we referring to? Late-arriving data can happen in data warehouses due to operational and technical reasons that may require data corrections, posing challenges in ensuring accurate reporting of point-in-time historical dimensions in the data warehouse. In this talk, I will be exploring approaches on how we can design data systems and history-preserving data models that can handle late-arriving data in the modern data stack.

Ong Chin Hwee

March 13, 2025
Tweet

More Decks by Ong Chin Hwee

Other Decks in Technology

Transcript

  1. Your Data is Late Again! Handling Late-Arriving Data in the

    Modern Data Stack By: Chin Hwee Ong (@ongchinhwee) 13 March 2025 FOSSASIA Summit 2025, Bangkok, Thailand
  2. About me Ong Chin Hwee 王敬惠 • Senior Data Engineer

    @ Grab • Speaker and (occasional) writer on data engineering topics • 90% self-taught (+10% from my engineering degrees) @ongchinhwee
  3. Before we begin, let’s set some expectations: @ongchinhwee • The

    insights in this talk are based on my own technical expertise / “war stories” and does not in any way represent the views of my team and employer. ◦ Any resemblance to actual events is purely coincidental. • Prerequisite: the concept of time in data
  4. “Latest” data in the data warehouse / lake ⇒ which

    timestamp is it? @ongchinhwee When the data was last updated in source? When the data reaches the data warehouse / lake? When the data event last happened?
  5. Imagine this scenario in your data team: @ongchinhwee • You

    have a data pipeline that ingests transactions and customer data from the database systems every hour. • Customer data needs to be integrated with payments data for transactions monitoring throughout the day. • There is an outage that caused the updated customer data to arrive 3hr later than scheduled.
  6. Your data team’s objectives @ongchinhwee • Ensure timely and accurate

    reporting for business operations • Include late-arriving customer data in your monitoring reports • Maintain audit log of data changes for data governance
  7. Time is an important dimension in your data @ongchinhwee •

    Data is usually not static ◦ State transitions during a business process ◦ Attributes can change over time e.g. age, income, status
  8. Concept of time in data • Fundamental Concepts: System Time

    vs Valid Time ◦ Valid Time ▪ Also known as actual time or event time ▪ Time when the event occured in the real world ▪ Mutable depending on effective period determined by changes in event state @ongchinhwee
  9. Concept of time in data • Fundamental Concepts: System Time

    vs Valid Time ◦ System Time ▪ Also known as record time (or “transaction time” in database) ▪ Time when the data for the event is recorded in the system ▪ Immutable by design based on current state of data @ongchinhwee
  10. Concept of time in data - Modern Data Stack @ongchinhwee

    • Fundamental Concepts: System Time vs Valid Time ◦ System Time ▪ For data platforms, multiple systems’ timestamps to consider • Source system time (database, object storage etc.) • Ingestion system time • Data warehouse / lake processing system time
  11. Data Latency in Modern Data Stack @ongchinhwee • Time lag

    between event time vs system time for data events • Multiple causes: ◦ Latency in upstream systems ◦ Network delays during ingestion to data warehouse / lake ◦ Out-of-order event generation ⇒ Change data capture for timely change tracking
  12. What is change data capture? @ongchinhwee • “Data versioning for

    databases” • Design patterns for capturing and tracking changes in data from upstream source systems over time • Changes are captured in data warehouses / data lakes
  13. Change data capture in the Modern Data Stack • Some

    implementations of change data capture ◦ Modern approaches ▪ Data snapshots ▪ Incremental models ◦ Traditional (Kimball’s) Dimensional Modelling techniques ▪ Slowly Changing Dimensions (SCDs) @ongchinhwee
  14. @ongchinhwee Anatomy of incremental materialization Adapted from: dbt-dummy project (https://github.com/gmyrianthous/dbt-dummy/)

    1. Table exists? Full-refresh or incremental 2. Condition for incremental load Where to get the load from? Where to insert incremental load? How to insert/update incremental load?
  15. What are Slowly Changing Dimensions (SCDs)? • Type 1 Slowly

    Changing Dimensions (Type 1 SCD) ◦ Reflects the latest version of dimension attributes ◦ Previous version of the value in the dimension row is overwritten with new value @ongchinhwee
  16. What are Slowly Changing Dimensions (SCDs)? • Type 2 Slowly

    Changing Dimensions (Type 2 SCD) ◦ Implements row versioning for each dimension attribute ◦ Concept of “validity period” for each version of the data ▪ Row effective date / timestamp ▪ Row expiration date / timestamp ▪ Current row indicator ◦ Commonly implemented using tuple versioning @ongchinhwee
  17. First Line of Defense: Preventing Downstream Impact @ongchinhwee • Objective:

    Stop incomplete / inconsistent data from flowing into destination tables • Preventive approaches we can adopt: ◦ Data reconciliation ◦ Data integrity checkpoints
  18. Data Reconciliation • Comparing data from different sources to ensure:

    ◦ Accuracy: values remain correct ◦ Consistency: data between sources matches ◦ Completeness: all records have been transferred @ongchinhwee
  19. Data Reconciliation • Handling late-arriving dimensions in incremental load /

    stream ◦ Streaming Reconciliation Pattern ▪ Identification: Separate processing workflows for joined vs unjoined event records in data stream ▪ Reconciliation: Batch job to deduplicate new unjoined event records + process unjoined events into a reconciliation table for retrying (within retry threshold) ▪ Retryable stream: Integrate unjoined records w/ next data stream for processing (Ref: “How to Handle Late Arriving Dimensions with a Streaming Reconciliation Pattern” in Databricks blog) @ongchinhwee
  20. Data Integrity Checkpoints • Data validation checks to detect data

    quality issues from late data: ◦ Referential integrity across different data sources ◦ Accepted values to detect deviation from expected values ◦ Missing or null values for late-arriving / missing dimensions after table joins ◦ Data freshness checks for tables in staging area @ongchinhwee
  21. Data Integrity Checkpoints • How to implement data integrity checkpoints?

    ◦ Use data testing frameworks / tools to define validation criteria ▪ Examples: dbt (data build tool) data tests, Great Expectations ◦ Automate validation checks with data pipeline orchestration ▪ Schedule validation checks after ingestion + transformation ▪ Examples: Airflow, Dagster @ongchinhwee
  22. Incremental Strategy for Late-arriving Data • Is the table a

    fact or dimension? • When can we expect late data to reach (in most cases)? • Lookback “window” approach to handle late-arriving data ◦ Setting maximum delay allowed to include late data in incremental data stream ◦ Performance vs Accuracy trade-off @ongchinhwee
  23. Case: Incremental Strategy for Late-arriving Facts • Example: ◦ Payment

    transactions stuck in intermediate state for more than a day ◦ Average traffic: ~100 billion transactions / day ◦ Event resolution time: ~3 days w/ outliers @ongchinhwee
  24. Case: Incremental Strategy for Late-arriving Facts • Design Question to

    Consider ◦ Accuracy vs Performance ▪ Choose accuracy? • Scan a wider date range / more date partitions on source table to update the data in destination table ▪ Choose performance? • Small possibility of late data loss from late arrivals outside lookback “window” @ongchinhwee
  25. Case: Incremental Strategy for Late-arriving Dimensions • Example: ◦ Customer

    information for new customers arrives > 1 day later than transactions ◦ 5000 customer records (out of 3 million customers) impacted ◦ We need to join both sets of data for transactions monitoring @ongchinhwee
  26. Case: Incremental Strategy for Late-arriving Dimensions • Design Questions to

    Consider ◦ Slowly-Changing Dimensions - which timestamp to use? ▪ Use valid time (event timestamp)? • Dimension data may arrive much later (~1 year?) in system ⇒ longer lookback when scanning dimension table. ▪ Use system time (system updated_at timestamp)? • System time != real world events e.g. late customer onboarding @ongchinhwee
  27. Case: Incremental Strategy for Late-arriving Dimensions • Design Questions to

    Consider ◦ How to handle late-arriving dimensions in data streams? ▪ Custom incremental logic • Design objective: performant and “close enough” data • Event lookback “window” on early-arriving facts to handle incremental updates on dimensions @ongchinhwee
  28. Lookback “window” w/ maximum accepted delay of 3 days for

    facts and late-arriving dimensions @ongchinhwee
  29. What if we want our data to be eventually accurate

    in data warehouse / data lake? @ongchinhwee
  30. What if we want our data to be eventually accurate

    in data warehouse / data lake? @ongchinhwee Do a periodic batch refresh to rebuild your tables!
  31. What if we need to rewind history on the state

    of the data warehouse / data lake? @ongchinhwee
  32. Imagine this scenario in your data team: @ongchinhwee • On

    1 August 2024, Audit team requests a review of trade valuations for 30 June 2024 as of 1 July 2024. • Due to late-arriving market information, there was a retroactive update on one of the trade valuation records for 30 June 2024 on 15 July 2024. • However, you need to provide the trade valuations based on what was known on 1 July 2024.
  33. Bitemporal models for system versioning • Why bitemporal models? ◦

    Retroactive changes to valid-time tuple versioning models ▪ Usually requires full-refresh → overwrites event history! ◦ Preserve history over time with time-travel for data audit and governance @ongchinhwee
  34. Bitemporal models for system versioning • Some implementations of bitemporal

    models ◦ Data snapshots of tuple versioning models @ongchinhwee
  35. Bitemporal models for system versioning • Some implementations of bitemporal

    models ◦ Data snapshots of tuple versioning models ◦ History table for tracking state changes w/ system time partitioning @ongchinhwee
  36. Bitemporal models for system versioning • Some implementations of bitemporal

    models ◦ Data snapshots of tuple versioning models ◦ History table for tracking state changes w/ system time partitioning ◦ Change data capture (CDC) streams on event-aware data ▪ Capture DML changes on data records with “valid time ranges” @ongchinhwee
  37. Key Takeaways • Design proactive systems and modeling strategies for

    late-arriving data to ensure data quality in data platforms • Manage accuracy vs performance tradeoffs with lookback “window” thresholds and periodic batch refreshes @ongchinhwee
  38. Reach out to me! : ongchinhwee : [email protected] : @ongchinhwee

    : hweecat : https://ongchinhwee.me And get these slides here: https://bit.ly/fossasia-late-arriving-data @ongchinhwee