FOSSASIA Summit 2025 - Your Data is Late Again! Handling Late-Arriving Data in the Modern Data Stack

Your Data is Late Again! Handling Late-Arriving Data in the
Modern Data Stack By: Chin Hwee Ong (@ongchinhwee) 13 March 2025 FOSSASIA Summit 2025, Bangkok, Thailand

About me Ong Chin Hwee 王敬惠 • Senior Data Engineer
@ Grab • Speaker and (occasional) writer on data engineering topics • 90% self-taught (+10% from my engineering degrees) @ongchinhwee

Before we begin, let’s set some expectations: @ongchinhwee • The
insights in this talk are based on my own technical expertise / “war stories” and does not in any way represent the views of my team and employer. ◦ Any resemblance to actual events is purely coincidental. • Prerequisite: the concept of time in data

“Latest” data in the data warehouse / lake ⇒ which
timestamp is it? @ongchinhwee

“Latest” data in the data warehouse / lake ⇒ which
timestamp is it? @ongchinhwee When the data was last updated in source? When the data reaches the data warehouse / lake? When the data event last happened?

Imagine this scenario in your data team: @ongchinhwee • You
have a data pipeline that ingests transactions and customer data from the database systems every hour. • Customer data needs to be integrated with payments data for transactions monitoring throughout the day. • There is an outage that caused the updated customer data to arrive 3hr later than scheduled.

Your data team’s objectives @ongchinhwee • Ensure timely and accurate
reporting for business operations • Include late-arriving customer data in your monitoring reports • Maintain audit log of data changes for data governance

Time is an important dimension in your data @ongchinhwee •
Data is usually not static ◦ State transitions during a business process ◦ Attributes can change over time e.g. age, income, status

Concept of time in data • Fundamental Concepts: System Time
vs Valid Time ◦ Valid Time ▪ Also known as actual time or event time ▪ Time when the event occured in the real world ▪ Mutable depending on effective period determined by changes in event state @ongchinhwee

Concept of time in data • Fundamental Concepts: System Time
vs Valid Time ◦ System Time ▪ Also known as record time (or “transaction time” in database) ▪ Time when the data for the event is recorded in the system ▪ Immutable by design based on current state of data @ongchinhwee

@ongchinhwee

Concept of time in data - Modern Data Stack @ongchinhwee
• Fundamental Concepts: System Time vs Valid Time ◦ System Time ▪ For data platforms, multiple systems’ timestamps to consider • Source system time (database, object storage etc.) • Ingestion system time • Data warehouse / lake processing system time

Data Latency in Modern Data Stack @ongchinhwee • Time lag
between event time vs system time for data events • Multiple causes: ◦ Latency in upstream systems ◦ Network delays during ingestion to data warehouse / lake ◦ Out-of-order event generation ⇒ Change data capture for timely change tracking

What is change data capture? @ongchinhwee • “Data versioning for
databases” • Design patterns for capturing and tracking changes in data from upstream source systems over time • Changes are captured in data warehouses / data lakes

Change data capture in the Modern Data Stack • Some
implementations of change data capture ◦ Modern approaches ▪ Data snapshots ▪ Incremental models ◦ Traditional (Kimball’s) Dimensional Modelling techniques ▪ Slowly Changing Dimensions (SCDs) @ongchinhwee

@ongchinhwee

Incremental models @ongchinhwee

@ongchinhwee Anatomy of incremental materialization Adapted from: dbt-dummy project (https://github.com/gmyrianthous/dbt-dummy/)
1. Table exists? Full-refresh or incremental 2. Condition for incremental load Where to get the load from? Where to insert incremental load? How to insert/update incremental load?

What are Slowly Changing Dimensions (SCDs)? • Type 1 Slowly
Changing Dimensions (Type 1 SCD) ◦ Reﬂects the latest version of dimension attributes ◦ Previous version of the value in the dimension row is overwritten with new value @ongchinhwee

What are Slowly Changing Dimensions (SCDs)? • Type 2 Slowly
Changing Dimensions (Type 2 SCD) ◦ Implements row versioning for each dimension attribute ◦ Concept of “validity period” for each version of the data ▪ Row effective date / timestamp ▪ Row expiration date / timestamp ▪ Current row indicator ◦ Commonly implemented using tuple versioning @ongchinhwee

How does tuple versioning work? @ongchinhwee

“Your data is late again! Do something before your users
complain!” @ongchinhwee

First Layer of Defense: Preventing Downstream Impact @ongchinhwee

First Line of Defense: Preventing Downstream Impact @ongchinhwee • Objective:
Stop incomplete / inconsistent data from ﬂowing into destination tables • Preventive approaches we can adopt: ◦ Data reconciliation ◦ Data integrity checkpoints

Data Reconciliation • Comparing data from different sources to ensure:
◦ Accuracy: values remain correct ◦ Consistency: data between sources matches ◦ Completeness: all records have been transferred @ongchinhwee

@ongchinhwee

Data Reconciliation • Handling late-arriving dimensions in incremental load /
stream ◦ Streaming Reconciliation Pattern ▪ Identiﬁcation: Separate processing workﬂows for joined vs unjoined event records in data stream ▪ Reconciliation: Batch job to deduplicate new unjoined event records + process unjoined events into a reconciliation table for retrying (within retry threshold) ▪ Retryable stream: Integrate unjoined records w/ next data stream for processing (Ref: “How to Handle Late Arriving Dimensions with a Streaming Reconciliation Pattern” in Databricks blog) @ongchinhwee

Data Integrity Checkpoints • Data validation checks to detect data
quality issues from late data: ◦ Referential integrity across different data sources ◦ Accepted values to detect deviation from expected values ◦ Missing or null values for late-arriving / missing dimensions after table joins ◦ Data freshness checks for tables in staging area @ongchinhwee

Data Integrity Checkpoints • How to implement data integrity checkpoints?
◦ Use data testing frameworks / tools to deﬁne validation criteria ▪ Examples: dbt (data build tool) data tests, Great Expectations ◦ Automate validation checks with data pipeline orchestration ▪ Schedule validation checks after ingestion + transformation ▪ Examples: Airﬂow, Dagster @ongchinhwee

@ongchinhwee

The Limits of Incrementality (a.k.a. “how far can we look
back in history”?) @ongchinhwee

Source: dbt discourse @ongchinhwee

Incremental Strategy for Late-arriving Data • Is the table a
fact or dimension? • When can we expect late data to reach (in most cases)? • Lookback “window” approach to handle late-arriving data ◦ Setting maximum delay allowed to include late data in incremental data stream ◦ Performance vs Accuracy trade-off @ongchinhwee

Case: Incremental Strategy for Late-arriving Facts • Example: ◦ Payment
transactions stuck in intermediate state for more than a day ◦ Average traﬃc: ~100 billion transactions / day ◦ Event resolution time: ~3 days w/ outliers @ongchinhwee

Case: Incremental Strategy for Late-arriving Facts • Design Question to
Consider ◦ Accuracy vs Performance ▪ Choose accuracy? • Scan a wider date range / more date partitions on source table to update the data in destination table ▪ Choose performance? • Small possibility of late data loss from late arrivals outside lookback “window” @ongchinhwee

Lookback “window” w/ maximum accepted delay of 3 days @ongchinhwee

“What if the dimension data is late?” @ongchinhwee

Case: Incremental Strategy for Late-arriving Dimensions • Example: ◦ Customer
information for new customers arrives > 1 day later than transactions ◦ 5000 customer records (out of 3 million customers) impacted ◦ We need to join both sets of data for transactions monitoring @ongchinhwee

Case: Incremental Strategy for Late-arriving Dimensions • Design Questions to
Consider ◦ Slowly-Changing Dimensions - which timestamp to use? ▪ Use valid time (event timestamp)? • Dimension data may arrive much later (~1 year?) in system ⇒ longer lookback when scanning dimension table. ▪ Use system time (system updated_at timestamp)? • System time != real world events e.g. late customer onboarding @ongchinhwee

Case: Incremental Strategy for Late-arriving Dimensions • Design Questions to
Consider ◦ How to handle late-arriving dimensions in data streams? ▪ Custom incremental logic • Design objective: performant and “close enough” data • Event lookback “window” on early-arriving facts to handle incremental updates on dimensions @ongchinhwee

Lookback “window” w/ maximum accepted delay of 3 days for
facts and late-arriving dimensions @ongchinhwee

What if we want our data to be eventually accurate
in data warehouse / data lake? @ongchinhwee

What if we want our data to be eventually accurate
in data warehouse / data lake? @ongchinhwee Do a periodic batch refresh to rebuild your tables!

What if we need to rewind history on the state
of the data warehouse / data lake? @ongchinhwee

Imagine this scenario in your data team: @ongchinhwee • On
1 August 2024, Audit team requests a review of trade valuations for 30 June 2024 as of 1 July 2024. • Due to late-arriving market information, there was a retroactive update on one of the trade valuation records for 30 June 2024 on 15 July 2024. • However, you need to provide the trade valuations based on what was known on 1 July 2024.

Imagine this scenario in your data team: @ongchinhwee

Bitemporal models for system versioning • Why bitemporal models? ◦
Retroactive changes to valid-time tuple versioning models ▪ Usually requires full-refresh → overwrites event history! ◦ Preserve history over time with time-travel for data audit and governance @ongchinhwee

Bitemporal models for system versioning • Some implementations of bitemporal
models ◦ Data snapshots of tuple versioning models @ongchinhwee

models ◦ Data snapshots of tuple versioning models ◦ History table for tracking state changes w/ system time partitioning @ongchinhwee

models ◦ Data snapshots of tuple versioning models ◦ History table for tracking state changes w/ system time partitioning ◦ Change data capture (CDC) streams on event-aware data ▪ Capture DML changes on data records with “valid time ranges” @ongchinhwee

Key Takeaways • Design proactive systems and modeling strategies for
late-arriving data to ensure data quality in data platforms • Manage accuracy vs performance tradeoffs with lookback “window” thresholds and periodic batch refreshes @ongchinhwee

Reach out to me! : ongchinhwee : [email protected] : @ongchinhwee
: hweecat : https://ongchinhwee.me And get these slides here: https://bit.ly/fossasia-late-arriving-data @ongchinhwee

FOSSASIA Summit 2025 - Your Data is Late Again!...

FOSSASIA Summit 2025 - Your Data is Late Again! Handling Late-Arriving Data in the Modern Data Stack

More Decks by Ong Chin Hwee

Other Decks in Technology

Featured

Transcript