Postgres Ibiza 2024 - Modeling data fast, slow and back-in-time: Change-aware dimensional data modeling in the modern data stack

Modeling Data Fast, Slow and Back-In-Time Change-aware Dimensional Data Modeling
in the Modern Data Stack By: Chin Hwee Ong (@ongchinhwee) 9 September 2024 Postgres Ibiza 2024, Ibiza, Spain

About me Ong Chin Hwee 王敬惠 • Senior Data Engineer
@ Grab • Speaker and (occasional) writer on data engineering topics • 90% self-taught (+10% from my engineering degrees) @ongchinhwee

And also, here’s the slides: @ongchinhwee https://bit.ly/pgibz-change-aware-data

Before we begin, some disclaimers: @ongchinhwee • The insights in
this talk are based on my own technical expertise and does not in any way represent the views of my team and employer. • Scope is primarily focused on structured tabular data ◦ Similar concept with different implementation for unstructured data, but that’s not the main focus of this talk

Imagine this scenario in your data team: @ongchinhwee • Finance
requests a regeneration of a business-critical report dated 11 months ago for Client A • Client A has been upgraded from Tier 2 to Tier 1 3 months ago, which increases the rebate earn rate • However, you need to generate the report based on the Tier 2 earn rate (as Client A was a Tier 2 customer 11 months ago)

Imagine this scenario in your data team: @ongchinhwee

Time is an important dimension in your data @ongchinhwee •
Data is usually not static ◦ State transitions during a business process ◦ Attributes can change over time e.g. age, income, status

What is data versioning? @ongchinhwee • Concept of capturing state
changes while keeping track of successive versions of data over time • Creates a unique reference for a collection of data when changes occur while retaining previous versions

What is data versioning? @ongchinhwee • Types of data versioning:
◦ Change data capture (CDC) for structured data ◦ Data version control for unstructured data

Why does data versioning matter? @ongchinhwee • Reproducibility for data
governance and audit purposes

governance and audit purposes • Build data history with backward and forward compatibility ◦ What if there is a change in transformation logic that only applies to a speciﬁc time range?

governance and audit purposes • Build data history with backward and forward compatibility ◦ What if there is a change in transformation logic that only applies to a speciﬁc time range? • When you need to use point-in-time values to track business metrics over time

What is change data capture? @ongchinhwee • “Data versioning for
databases” • Design patterns for capturing and tracking changes in data from upstream source systems over time • Changes are captured in data warehouses / data lakes

Design Patterns for Change Data Capture @ongchinhwee • Data versioning
based on a combination of: ◦ Data version identiﬁers ◦ Timestamps ◦ Status indicators • Log trigger / tuple versioning ◦ Also known as Type 2 Slowly Changing Dimensions (SCDs) • Transaction logs

What do we mean by the “Modern Data Stack”? @ongchinhwee
• Cloud-based • Built around the cloud data warehouse / lake • Modular and customizable - “choose the best tool for a speciﬁc job”

@ongchinhwee

Data Warehousing in the Modern Data Stack @ongchinhwee • Cloud-based
compute and storage = more scalable

compute and storage = more scalable • From ETL to ELT - transformation within the data warehouse

compute and storage = more scalable • From ETL to ELT - transformation within the data warehouse • Possible to store “snapshots” of data in a cloud data warehouse to capture historical changes

Change data capture in the Modern Data Stack • Some
implementations of change data capture ◦ Traditional (Kimball’s) Dimensional Modelling techniques ▪ Slowly Changing Dimensions (SCDs) @ongchinhwee

Change data capture in the Modern Data Stack • Some
implementations of change data capture ◦ Traditional (Kimball’s) Dimensional Modelling techniques ▪ Slowly Changing Dimensions (SCDs) ◦ Modern approaches ▪ Data snapshots ▪ Incremental models @ongchinhwee

Traditional (Kimball-style) Dimensional Data Modelling • Developed by Kimball in
1996; updated in 2013 during the emergence of cloud data warehouses • Introduced the concept of facts vs dimensions (star schema) • Designed for storage and compute eﬃciency @ongchinhwee

Traditional (Kimball-style) Dimensional Data Modelling • Fundamental Concepts: Facts vs
Dimensions ◦ Fact Tables ▪ Contains metrics and facts about a business process e.g. process time, transaction amount ▪ Are typically fast-evolving during a business process event ▪ Eventually reaches a ﬁnal state at a point in time upon completion @ongchinhwee

Dimensions ◦ Dimension Tables ▪ Describes the attributes of a business process e.g. customer details ▪ Are typically slow-changing and updated over a longer period of time ▪ Does not have a “ﬁnal state” @ongchinhwee

Dimensions ◦ Dimension Tables ▪ Describes the attributes of a business process e.g. customer details ▪ Are typically slow-changing and updated over a longer period of time ▪ Does not have a “ﬁnal state” → problem of Slowly Changing Dimensions (SCDs) @ongchinhwee

What are Slowly Changing Dimensions (SCDs)? • Change tracking techniques
to handle state changes in dimensions @ongchinhwee SCD Type How state changes are handled 0 Ignore changes 1 Overwrite 2 Row versioning (w/ “validity period”) 3 Previous value column 4 History table (“mini-dimension”)

What are Slowly Changing Dimensions (SCDs)? • Type 0 Slowly
Changing Dimensions (Type 0 SCD) ◦ Fixed dimension e.g. account opening date ◦ Ignores any changes ◦ Assumes that the attribute will not change forever @ongchinhwee

What are Slowly Changing Dimensions (SCDs)? • Type 1 Slowly
Changing Dimensions (Type 1 SCD) ◦ Reﬂects the latest version of dimension attributes ◦ Previous version of the value in the dimension row is overwritten with new value ◦ Destroys history - not possible to rewind back to previous versions of the data @ongchinhwee

What is Slowly Changing Dimensions (SCDs)? • Type 2 Slowly
Changing Dimensions (Type 2 SCD) ◦ Implements row versioning for each dimension attribute ◦ Concept of “validity period” for each version of the data ▪ Row effective date / timestamp ▪ Row expiration date / timestamp ▪ Current row indicator @ongchinhwee

Changing Dimensions (Type 2 SCD) ◦ When a change is detected in a data record: ▪ A new dimension row is added with updated attribute values for the data record @ongchinhwee

Changing Dimensions (Type 2 SCD) ◦ When a change is detected in a data record: ▪ A new dimension row is added with updated attribute values for the data record ▪ New primary surrogate key is assigned to new dimension row @ongchinhwee

Changing Dimensions (Type 2 SCD) ◦ When a change is detected in a data record: ▪ A new dimension row is added with updated attribute values for the data record ▪ New primary surrogate key is assigned to new dimension row ▪ Previous version of the attribute is updated with row expiration timestamp @ongchinhwee

Type 2 SCD: What is tuple versioning? • Change data
capture mechanism that records changes to a mutable upstream table over time • Implements Type-2 Slowly Changing Dimensions on mutable table sources • Detects changes based on an `updated_at` timestamp @ongchinhwee

Components of tuple versioning • Target schema • Column for
timestamp-based tracking • Primary key / unique identiﬁer(s) for record • Whether to invalidate records no longer in source @ongchinhwee

How does tuple versioning work? @ongchinhwee

What is Slowly Changing Dimensions (SCDs)? • Slowly Changing Dimensions
(SCDs) ◦ Type 3, Type 4 etc. ▪ Less commonly used due to additional complexity without significant performance benefits in cloud data warehouses • Type 3: add new column to store previous attribute value • Type 4: add history table to keep record of state changes More information on Slowly Changing Dimension Techniques: Chapter 5 of The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition by Ralph Kimball and Margy Ross @ongchinhwee

Data Snapshots • Read-only (immutable) copies of the state of
a data source at a particular point in time • Stored at the staging area of a data warehouse @ongchinhwee

@ongchinhwee

Incremental models • Limiting the data transformation to a speciﬁed
subset of source data ◦ Usually on rows in source data that have been created or updated since the last scheduled run • Signiﬁcantly optimises runtime of transformations on large data + reduces compute cost @ongchinhwee

Incremental models @ongchinhwee

@ongchinhwee Anatomy of incremental materialization on dbt Adapted from: dbt-dummy
project (https://github.com/gmyrianthous/dbt-dummy/) Condition for incremental load Where to get the load from? Where to insert incremental load? How to insert/update incremental load?

@ongchinhwee Anatomy of incremental materialization with Type-2 SCD model as
source Condition for incremental load vs backfilling Where to get the load from? Where to insert incremental load? How to insert/update incremental load?

Anatomy of incremental models • Does destination table already exist
in data warehouse / lake? @ongchinhwee

in data warehouse / lake? • Full-refresh or incremental run? @ongchinhwee

in data warehouse / lake? • Full-refresh or incremental run? • Incremental strategy ◦ Data warehouse or data lake? @ongchinhwee

in data warehouse / lake? • Full-refresh or incremental run? • Incremental strategy ◦ Data warehouse or data lake? • Columns to track changes @ongchinhwee

in data warehouse / lake? • Full-refresh or incremental run? • Incremental strategy ◦ Data warehouse or data lake? • Columns to track changes • Query ﬁlter for incremental run @ongchinhwee

What if we need to rewind history on the state
of the data warehouse / data lake? @ongchinhwee

Imagine this scenario in your data team: @ongchinhwee • On
1 August 2024, Audit team requests a review of trade valuations for 30 June 2024 as of 1 July 2024. • Due to late-arriving market information, there was a retroactive update on one of the trade valuation records for 30 June 2024 on 15 July 2024. • However, you need to provide the trade valuations based on what was known on 1 July 2024.

Imagine this scenario in your data team: @ongchinhwee

Bitemporal models for system versioning • Fundamental Concepts: System Time
vs Valid Time ◦ Valid Time ▪ Also known as actual time ▪ Time when the event occured in the real world ▪ Mutable depending on effective period determined by changes in event state @ongchinhwee

Bitemporal models for system versioning • Fundamental Concepts: System Time
vs Valid Time ◦ System Time ▪ Also known as record time (or “transaction time” in database) ▪ Time when the data for the event is recorded in the system ▪ Immutable by design based on current state of data @ongchinhwee

Bitemporal models for system versioning • Why bitemporal models? ◦
Retroactive changes to valid-time tuple versioning models ▪ Usually requires full-refresh → overwrites event history! ◦ Preserve history over time with time-travel for data audit and governance @ongchinhwee

Bitemporal models for system versioning • Some implementations of bitemporal
models ◦ Data snapshots of tuple versioning models @ongchinhwee

models ◦ Data snapshots of tuple versioning models ◦ History table for tracking state changes w/ system time partitioning @ongchinhwee

models ◦ Data snapshots of tuple versioning models ◦ History table for tracking state changes w/ system time partitioning ◦ Change data capture (CDC) streams on event-aware data ▪ Capture DML changes on data records with “valid time ranges” @ongchinhwee

Bitemporal models in other databases / data warehouse • MariaDB
◦ Support for both system-versioning tables and application-time periods ◦ Includes the following SQL extensions: ▪ PERIOD FOR syntax supports both system and event time periods ▪ Immutable system time (vs. mutable time periods) ◦ Design pattern: Time ranges for system time and event time Ref: Temporal Tables - MariaDB Knowledge Base @ongchinhwee

Bitemporal models in other databases / data warehouse • CockroachDB
◦ Support for time travel queries ◦ Includes the following SQL extensions on top of Postgres: ▪ Access historical data based on timestamp (“AS OF SYSTEM TIME”) ◦ Design pattern: Tuple versioning (SCD-2) model based on valid time + time-travel data snapshots based on system time Ref: AS OF SYSTEM TIME - CockroachDB Documentation @ongchinhwee

Bitemporal models in other databases / data warehouse • Snowﬂake
Time Travel ◦ Supports time travel retention up to 90 days prior (for Enterprise Edition) ◦ Includes the following SQL extensions ▪ Access historical data based on timestamp ▪ UNDROP command for tables, schemas and databases ◦ Design pattern: Tuple versioning (SCD-2) model based on valid time + time-travel data snapshots based on system time Ref: Understanding & Using Time Travel - Snowﬂake Documentation @ongchinhwee

Bitemporal models in Postgres? (Hint: it’s been brewing) @ongchinhwee

Native features and extensions in Postgres releases • pg_bitemporal ◦
Postgres extension by Henriatta (Hettie) Dombrovskaya ◦ Leverages on GiST indexes with exclusion constraints for bitemporal models • (Future) Temporal keys ◦ Ongoing work by Peter Eisentraut and team to implement temporal keys with exclusion constraints as a native feature in Postgres ◦ Current problem: empty ranges and multiranges @ongchinhwee

A re-look at the original design of Postgres • Original
design is intended to support time travel ◦ The original Postgres storage system design is “no-overwrite” ▪ Copy of data record w/ copy of write in secondary store ▪ “Vacuuming” of historical records + archive data permanently into an immutable store ◦ However, this design was deprecated due to technological constraints at that time. Ref: The Design of Postgres by Michael Stonebraker and Lawrence A. Lowe @ongchinhwee

Is Kimball’s approach still relevant? (especially in the Modern Data
Stack) @ongchinhwee

Data Warehousing in the Modern Data Stack @ongchinhwee “Since storage
and compute are dirt cheap, engineering time is expensive, why not snapshot all your data (and append new partitions for each ETL schedule)?” • Does not apply for very large dimensions • Does not preclude the importance of dimensional data modelling Related reading: Functional Data Enginering - a modern paradigm for batch data processing (and related talks) by Maxime Beauchemin, creator of Airﬂow and Superset

Is Kimball’s Dimensional Data Modelling still relevant? • Yes, in
some ways. • Dimensional data modelling is still needed for these use cases: ◦ Aggregation of facts ◦ Metrics drill-down based on dimensions ▪ e.g. how many customers in Singapore spent > US$400 per month ◦ Financial reporting and audit @ongchinhwee

Tips and tricks on change-aware data modelling @ongchinhwee

Snapshot all your upstream source data! • What if assumptions
about your data change? ◦ Source data update mechanism (append-only vs overwrite) ◦ Source table schema (columns added/altered/dropped) ◦ Business logic • Store data snapshots of upstream source in staging area ◦ Build SCDs from data snapshots @ongchinhwee

Snapshot the state of your data warehouse data! • What
if we need to “time travel” the data back to the past? ◦ Restore data object changes (e.g. accidental table deletion) ◦ Periodic data backups ◦ Audit log of retroactive changes to data records • Store database object snapshots in immutable store ◦ Append-only snapshot record for each database transaction ◦ Build SCDs from database object snapshots @ongchinhwee

Source-target schema reconciliation Detect schema diffs between upstream source system
vs data warehouse • Useful for detecting schema changes @ongchinhwee

Use Type 2 SCDs + incremental model (in most cases)
• In most cases, Type 2 SCDs are suﬃcient for tracking and capturing data changes • Incremental models may be slightly complex, but pays off in eﬃciency + cost when data scales quickly and changes (quickly and slowly) over time @ongchinhwee

Strategies on designing incremental models • Design with upstream data
update mechanism in mind ◦ Affects how incremental load is loaded onto target model • Incremental strategy depends on data warehouse / lake ◦ Specify a unique key (for data warehouse) or partitioning key (for data lake) to update existing records • Performance tip: Filter rows early! @ongchinhwee

Key Takeaways • Dimensional data modeling techniques are still relevant
in the Modern Data Stack • Adopt a mixture of SCD-like approaches and incremental models when designing change-aware data models for non-retroactive business logic • Data snapshots are useful in cases of where re-modelling is required due to retroactive changes in the data @ongchinhwee

Reach out to me! : ongchinhwee : [email protected] : @ongchinhwee
: hweecat : https://ongchinhwee.me And get these slides here: https://bit.ly/pgibz-change-aware-data @ongchinhwee

Postgres Ibiza 2024 - Modeling data fast, slow ...

Postgres Ibiza 2024 - Modeling data fast, slow and back-in-time: Change-aware dimensional data modeling in the modern data stack

More Decks by Ong Chin Hwee

Other Decks in Technology

Featured

Transcript