Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Postgres Ibiza 2024 - Modeling data fast, slow ...

Postgres Ibiza 2024 - Modeling data fast, slow and back-in-time: Change-aware dimensional data modeling in the modern data stack

Modeling data fast, slow and back-in-time: Change-aware dimensional data modeling in the modern data stack
Event: Postgres Ibiza 2024
Date: 9 September 2024
Location: Palacio de Congresos de Ibiza, Santa Eulària des Riu, Ibiza, Spain

Modeling data with versioning enables data teams to track and reproduce history for fast- and slow-changing data over time. In the era of read-optimized cloud data warehouses and ELT (Extract-Load-Transform) workflows, there is still value in integrating dimensional data modeling practices with modern functional approaches in the modern data stack to design efficient and reproducible data models for analytics reporting use cases that require accurate reporting of point-in-time historical dimensions. In this talk, I will be exploring how we can design history-preserving data models and reproducible data transformation workflows by adopting “change-aware” dimensional modeling practices in the modern data stack.

Ong Chin Hwee

September 09, 2024
Tweet

More Decks by Ong Chin Hwee

Other Decks in Technology

Transcript

  1. Modeling Data Fast, Slow and Back-In-Time Change-aware Dimensional Data Modeling

    in the Modern Data Stack By: Chin Hwee Ong (@ongchinhwee) 9 September 2024 Postgres Ibiza 2024, Ibiza, Spain
  2. About me Ong Chin Hwee 王敬惠 • Senior Data Engineer

    @ Grab • Speaker and (occasional) writer on data engineering topics • 90% self-taught (+10% from my engineering degrees) @ongchinhwee
  3. Before we begin, some disclaimers: @ongchinhwee • The insights in

    this talk are based on my own technical expertise and does not in any way represent the views of my team and employer. • Scope is primarily focused on structured tabular data ◦ Similar concept with different implementation for unstructured data, but that’s not the main focus of this talk
  4. Imagine this scenario in your data team: @ongchinhwee • Finance

    requests a regeneration of a business-critical report dated 11 months ago for Client A • Client A has been upgraded from Tier 2 to Tier 1 3 months ago, which increases the rebate earn rate • However, you need to generate the report based on the Tier 2 earn rate (as Client A was a Tier 2 customer 11 months ago)
  5. Time is an important dimension in your data @ongchinhwee •

    Data is usually not static ◦ State transitions during a business process ◦ Attributes can change over time e.g. age, income, status
  6. What is data versioning? @ongchinhwee • Concept of capturing state

    changes while keeping track of successive versions of data over time • Creates a unique reference for a collection of data when changes occur while retaining previous versions
  7. What is data versioning? @ongchinhwee • Types of data versioning:

    ◦ Change data capture (CDC) for structured data ◦ Data version control for unstructured data
  8. Why does data versioning matter? @ongchinhwee • Reproducibility for data

    governance and audit purposes • Build data history with backward and forward compatibility ◦ What if there is a change in transformation logic that only applies to a specific time range?
  9. Why does data versioning matter? @ongchinhwee • Reproducibility for data

    governance and audit purposes • Build data history with backward and forward compatibility ◦ What if there is a change in transformation logic that only applies to a specific time range? • When you need to use point-in-time values to track business metrics over time
  10. What is change data capture? @ongchinhwee • “Data versioning for

    databases” • Design patterns for capturing and tracking changes in data from upstream source systems over time • Changes are captured in data warehouses / data lakes
  11. Design Patterns for Change Data Capture @ongchinhwee • Data versioning

    based on a combination of: ◦ Data version identifiers ◦ Timestamps ◦ Status indicators • Log trigger / tuple versioning ◦ Also known as Type 2 Slowly Changing Dimensions (SCDs) • Transaction logs
  12. What do we mean by the “Modern Data Stack”? @ongchinhwee

    • Cloud-based • Built around the cloud data warehouse / lake • Modular and customizable - “choose the best tool for a specific job”
  13. Data Warehousing in the Modern Data Stack @ongchinhwee • Cloud-based

    compute and storage = more scalable • From ETL to ELT - transformation within the data warehouse
  14. Data Warehousing in the Modern Data Stack @ongchinhwee • Cloud-based

    compute and storage = more scalable • From ETL to ELT - transformation within the data warehouse • Possible to store “snapshots” of data in a cloud data warehouse to capture historical changes
  15. Change data capture in the Modern Data Stack • Some

    implementations of change data capture ◦ Traditional (Kimball’s) Dimensional Modelling techniques ▪ Slowly Changing Dimensions (SCDs) @ongchinhwee
  16. Change data capture in the Modern Data Stack • Some

    implementations of change data capture ◦ Traditional (Kimball’s) Dimensional Modelling techniques ▪ Slowly Changing Dimensions (SCDs) ◦ Modern approaches ▪ Data snapshots ▪ Incremental models @ongchinhwee
  17. Traditional (Kimball-style) Dimensional Data Modelling • Developed by Kimball in

    1996; updated in 2013 during the emergence of cloud data warehouses • Introduced the concept of facts vs dimensions (star schema) • Designed for storage and compute efficiency @ongchinhwee
  18. Traditional (Kimball-style) Dimensional Data Modelling • Fundamental Concepts: Facts vs

    Dimensions ◦ Fact Tables ▪ Contains metrics and facts about a business process e.g. process time, transaction amount ▪ Are typically fast-evolving during a business process event ▪ Eventually reaches a final state at a point in time upon completion @ongchinhwee
  19. Traditional (Kimball-style) Dimensional Data Modelling • Fundamental Concepts: Facts vs

    Dimensions ◦ Dimension Tables ▪ Describes the attributes of a business process e.g. customer details ▪ Are typically slow-changing and updated over a longer period of time ▪ Does not have a “final state” @ongchinhwee
  20. Traditional (Kimball-style) Dimensional Data Modelling • Fundamental Concepts: Facts vs

    Dimensions ◦ Dimension Tables ▪ Describes the attributes of a business process e.g. customer details ▪ Are typically slow-changing and updated over a longer period of time ▪ Does not have a “final state” → problem of Slowly Changing Dimensions (SCDs) @ongchinhwee
  21. What are Slowly Changing Dimensions (SCDs)? • Change tracking techniques

    to handle state changes in dimensions @ongchinhwee SCD Type How state changes are handled 0 Ignore changes 1 Overwrite 2 Row versioning (w/ “validity period”) 3 Previous value column 4 History table (“mini-dimension”)
  22. What are Slowly Changing Dimensions (SCDs)? • Type 0 Slowly

    Changing Dimensions (Type 0 SCD) ◦ Fixed dimension e.g. account opening date ◦ Ignores any changes ◦ Assumes that the attribute will not change forever @ongchinhwee
  23. What are Slowly Changing Dimensions (SCDs)? • Type 1 Slowly

    Changing Dimensions (Type 1 SCD) ◦ Reflects the latest version of dimension attributes ◦ Previous version of the value in the dimension row is overwritten with new value ◦ Destroys history - not possible to rewind back to previous versions of the data @ongchinhwee
  24. What is Slowly Changing Dimensions (SCDs)? • Type 2 Slowly

    Changing Dimensions (Type 2 SCD) ◦ Implements row versioning for each dimension attribute ◦ Concept of “validity period” for each version of the data ▪ Row effective date / timestamp ▪ Row expiration date / timestamp ▪ Current row indicator @ongchinhwee
  25. What is Slowly Changing Dimensions (SCDs)? • Type 2 Slowly

    Changing Dimensions (Type 2 SCD) ◦ When a change is detected in a data record: ▪ A new dimension row is added with updated attribute values for the data record @ongchinhwee
  26. What is Slowly Changing Dimensions (SCDs)? • Type 2 Slowly

    Changing Dimensions (Type 2 SCD) ◦ When a change is detected in a data record: ▪ A new dimension row is added with updated attribute values for the data record ▪ New primary surrogate key is assigned to new dimension row @ongchinhwee
  27. What is Slowly Changing Dimensions (SCDs)? • Type 2 Slowly

    Changing Dimensions (Type 2 SCD) ◦ When a change is detected in a data record: ▪ A new dimension row is added with updated attribute values for the data record ▪ New primary surrogate key is assigned to new dimension row ▪ Previous version of the attribute is updated with row expiration timestamp @ongchinhwee
  28. Type 2 SCD: What is tuple versioning? • Change data

    capture mechanism that records changes to a mutable upstream table over time • Implements Type-2 Slowly Changing Dimensions on mutable table sources • Detects changes based on an `updated_at` timestamp @ongchinhwee
  29. Components of tuple versioning • Target schema • Column for

    timestamp-based tracking • Primary key / unique identifier(s) for record • Whether to invalidate records no longer in source @ongchinhwee
  30. What is Slowly Changing Dimensions (SCDs)? • Slowly Changing Dimensions

    (SCDs) ◦ Type 3, Type 4 etc. ▪ Less commonly used due to additional complexity without significant performance benefits in cloud data warehouses • Type 3: add new column to store previous attribute value • Type 4: add history table to keep record of state changes More information on Slowly Changing Dimension Techniques: Chapter 5 of The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition by Ralph Kimball and Margy Ross @ongchinhwee
  31. Data Snapshots • Read-only (immutable) copies of the state of

    a data source at a particular point in time • Stored at the staging area of a data warehouse @ongchinhwee
  32. Incremental models • Limiting the data transformation to a specified

    subset of source data ◦ Usually on rows in source data that have been created or updated since the last scheduled run • Significantly optimises runtime of transformations on large data + reduces compute cost @ongchinhwee
  33. @ongchinhwee Anatomy of incremental materialization on dbt Adapted from: dbt-dummy

    project (https://github.com/gmyrianthous/dbt-dummy/) Condition for incremental load Where to get the load from? Where to insert incremental load? How to insert/update incremental load?
  34. @ongchinhwee Anatomy of incremental materialization with Type-2 SCD model as

    source Condition for incremental load vs backfilling Where to get the load from? Where to insert incremental load? How to insert/update incremental load?
  35. Anatomy of incremental models • Does destination table already exist

    in data warehouse / lake? • Full-refresh or incremental run? @ongchinhwee
  36. Anatomy of incremental models • Does destination table already exist

    in data warehouse / lake? • Full-refresh or incremental run? • Incremental strategy ◦ Data warehouse or data lake? @ongchinhwee
  37. Anatomy of incremental models • Does destination table already exist

    in data warehouse / lake? • Full-refresh or incremental run? • Incremental strategy ◦ Data warehouse or data lake? • Columns to track changes @ongchinhwee
  38. Anatomy of incremental models • Does destination table already exist

    in data warehouse / lake? • Full-refresh or incremental run? • Incremental strategy ◦ Data warehouse or data lake? • Columns to track changes • Query filter for incremental run @ongchinhwee
  39. What if we need to rewind history on the state

    of the data warehouse / data lake? @ongchinhwee
  40. Imagine this scenario in your data team: @ongchinhwee • On

    1 August 2024, Audit team requests a review of trade valuations for 30 June 2024 as of 1 July 2024. • Due to late-arriving market information, there was a retroactive update on one of the trade valuation records for 30 June 2024 on 15 July 2024. • However, you need to provide the trade valuations based on what was known on 1 July 2024.
  41. Bitemporal models for system versioning • Fundamental Concepts: System Time

    vs Valid Time ◦ Valid Time ▪ Also known as actual time ▪ Time when the event occured in the real world ▪ Mutable depending on effective period determined by changes in event state @ongchinhwee
  42. Bitemporal models for system versioning • Fundamental Concepts: System Time

    vs Valid Time ◦ System Time ▪ Also known as record time (or “transaction time” in database) ▪ Time when the data for the event is recorded in the system ▪ Immutable by design based on current state of data @ongchinhwee
  43. Bitemporal models for system versioning • Why bitemporal models? ◦

    Retroactive changes to valid-time tuple versioning models ▪ Usually requires full-refresh → overwrites event history! ◦ Preserve history over time with time-travel for data audit and governance @ongchinhwee
  44. Bitemporal models for system versioning • Some implementations of bitemporal

    models ◦ Data snapshots of tuple versioning models @ongchinhwee
  45. Bitemporal models for system versioning • Some implementations of bitemporal

    models ◦ Data snapshots of tuple versioning models ◦ History table for tracking state changes w/ system time partitioning @ongchinhwee
  46. Bitemporal models for system versioning • Some implementations of bitemporal

    models ◦ Data snapshots of tuple versioning models ◦ History table for tracking state changes w/ system time partitioning ◦ Change data capture (CDC) streams on event-aware data ▪ Capture DML changes on data records with “valid time ranges” @ongchinhwee
  47. Bitemporal models in other databases / data warehouse • MariaDB

    ◦ Support for both system-versioning tables and application-time periods ◦ Includes the following SQL extensions: ▪ PERIOD FOR syntax supports both system and event time periods ▪ Immutable system time (vs. mutable time periods) ◦ Design pattern: Time ranges for system time and event time Ref: Temporal Tables - MariaDB Knowledge Base @ongchinhwee
  48. Bitemporal models in other databases / data warehouse • CockroachDB

    ◦ Support for time travel queries ◦ Includes the following SQL extensions on top of Postgres: ▪ Access historical data based on timestamp (“AS OF SYSTEM TIME”) ◦ Design pattern: Tuple versioning (SCD-2) model based on valid time + time-travel data snapshots based on system time Ref: AS OF SYSTEM TIME - CockroachDB Documentation @ongchinhwee
  49. Bitemporal models in other databases / data warehouse • Snowflake

    Time Travel ◦ Supports time travel retention up to 90 days prior (for Enterprise Edition) ◦ Includes the following SQL extensions ▪ Access historical data based on timestamp ▪ UNDROP command for tables, schemas and databases ◦ Design pattern: Tuple versioning (SCD-2) model based on valid time + time-travel data snapshots based on system time Ref: Understanding & Using Time Travel - Snowflake Documentation @ongchinhwee
  50. Native features and extensions in Postgres releases • pg_bitemporal ◦

    Postgres extension by Henriatta (Hettie) Dombrovskaya ◦ Leverages on GiST indexes with exclusion constraints for bitemporal models • (Future) Temporal keys ◦ Ongoing work by Peter Eisentraut and team to implement temporal keys with exclusion constraints as a native feature in Postgres ◦ Current problem: empty ranges and multiranges @ongchinhwee
  51. A re-look at the original design of Postgres • Original

    design is intended to support time travel ◦ The original Postgres storage system design is “no-overwrite” ▪ Copy of data record w/ copy of write in secondary store ▪ “Vacuuming” of historical records + archive data permanently into an immutable store ◦ However, this design was deprecated due to technological constraints at that time. Ref: The Design of Postgres by Michael Stonebraker and Lawrence A. Lowe @ongchinhwee
  52. Data Warehousing in the Modern Data Stack @ongchinhwee “Since storage

    and compute are dirt cheap, engineering time is expensive, why not snapshot all your data (and append new partitions for each ETL schedule)?” • Does not apply for very large dimensions • Does not preclude the importance of dimensional data modelling Related reading: Functional Data Enginering - a modern paradigm for batch data processing (and related talks) by Maxime Beauchemin, creator of Airflow and Superset
  53. Is Kimball’s Dimensional Data Modelling still relevant? • Yes, in

    some ways. • Dimensional data modelling is still needed for these use cases: ◦ Aggregation of facts ◦ Metrics drill-down based on dimensions ▪ e.g. how many customers in Singapore spent > US$400 per month ◦ Financial reporting and audit @ongchinhwee
  54. Snapshot all your upstream source data! • What if assumptions

    about your data change? ◦ Source data update mechanism (append-only vs overwrite) ◦ Source table schema (columns added/altered/dropped) ◦ Business logic • Store data snapshots of upstream source in staging area ◦ Build SCDs from data snapshots @ongchinhwee
  55. Snapshot the state of your data warehouse data! • What

    if we need to “time travel” the data back to the past? ◦ Restore data object changes (e.g. accidental table deletion) ◦ Periodic data backups ◦ Audit log of retroactive changes to data records • Store database object snapshots in immutable store ◦ Append-only snapshot record for each database transaction ◦ Build SCDs from database object snapshots @ongchinhwee
  56. Source-target schema reconciliation Detect schema diffs between upstream source system

    vs data warehouse • Useful for detecting schema changes @ongchinhwee
  57. Use Type 2 SCDs + incremental model (in most cases)

    • In most cases, Type 2 SCDs are sufficient for tracking and capturing data changes • Incremental models may be slightly complex, but pays off in efficiency + cost when data scales quickly and changes (quickly and slowly) over time @ongchinhwee
  58. Strategies on designing incremental models • Design with upstream data

    update mechanism in mind ◦ Affects how incremental load is loaded onto target model • Incremental strategy depends on data warehouse / lake ◦ Specify a unique key (for data warehouse) or partitioning key (for data lake) to update existing records • Performance tip: Filter rows early! @ongchinhwee
  59. Key Takeaways • Dimensional data modeling techniques are still relevant

    in the Modern Data Stack • Adopt a mixture of SCD-like approaches and incremental models when designing change-aware data models for non-retroactive business logic • Data snapshots are useful in cases of where re-modelling is required due to retroactive changes in the data @ongchinhwee
  60. Reach out to me! : ongchinhwee : [email protected] : @ongchinhwee

    : hweecat : https://ongchinhwee.me And get these slides here: https://bit.ly/pgibz-change-aware-data @ongchinhwee