Agile/Engineering Culture/ Developer Experience • Team Coaching • Data Engineering Shuhsi Lin Working in Smart manufacturing & AI With data and people Photo by NordWood Themes on Unsplash
What are Scalable and Reliable (+ Maintainable) pipelines Scalable & Reliable + Maintainable 03 dbt & Data Pipelines On K8S 04 Recap & More How to be Scalable and Reliable (+ Maintainable) with dbt + K8S More to do for Scalable, Reliable and Maintainable data pipelines
orders with specific recipes to make pizza • Data (tables) ◦ Order ◦ Recipes ◦ Customer ◦ Inventory ◦ … Assume if - Operate 20000 branches worldwide - Serve 4 million customers per day - Make 5 million pizzas per day
Kleppmann, Chris Riccomini, O'Reilly, 2025 Reliable Scalable Maintainable The ability of a system to perform its required functions consistently over time without failure. • Data Integrity and Consistency • Fault Tolerance and Error Handling • Recoverability and Disaster Recovery • Monitoring, Alerting, and Observability • Security and Compliance • Modular, Reusable, and Evolvable Design • Standardization and Best Practices • Simplicity and Ease of Understanding • Configurability and Operability • Comprehensive Documentation and Knowledge Sharing • Version Control and Collaboration Practices • Automated Testing and Validation • Handling Increased Load and Concurrency • Performance Optimization • Dynamic Scaling Strategies and Resource Management • Elasticity and Automated Scaling • Reliability, Fault Tolerance, and Automation The ease with which a data pipeline can be understood, modified, extended, and troubleshooted over its lifecycle. The system's ability to handle increasing amounts of data, higher processing loads, and more complex transformations efficiently and effectively.
Deployment Difficulties • Lack of Modular System • Lack of Visibility • Inefficient Debugging • Order Processing Failures • Data Inconsistency • Schema Changes • Single Points of Failure Real-Time Order Processing and Delivery Optimization Orders Request Real-time delivery estimates ETL/ELT pipelines Data store/target Order Processing Delivery Optimization Challenges
that software engineers use to build applications. • Centralized • Version Control • Documentation • Modularity • Open-Source https://www.getdbt.com/product/what-is-dbt https://github.com/dbt-labs/dbt-core What is dbt?
model =a single .sql file dbt code = SQL + Jinja • SQL select statement dbt model reference each other • Creates Natural dependencies • dbt determine model execution order 1 command • V • Create DAG • Parallel execution
correctly reflects the real world object/ event Accuracy Expected comprehensiveness/ are all datasets and the data items recorded Completeness Data across all systems reflects the same information and are in synch with each other across the data stores Consistency Information is available when it is expected and needed Timeliness Means that there’s only one instance of the information appearing in a database Uniqueness Refers to information that doesn’t conform to a specific format or doesn’t follow business rules Validity https://kamal-ahmed.github.io/DQ-Dimensions.github.io/ https://hub.getdbt.com/infinitelambda/dq_tools/latest/
Generic data tests • Packages ◦ dbt_utils ◦ dbt_expectation ◦ dbt_elementary ◦ … Test data schema • dbt (model) contract Test data code • dbt unit test • Recce
frameworks in dbt. (2023) Two Types of Testing DEV/TEST env CI Code Data Input Output data Code freezed, data changed Code Data Input Output data Prod env Data freezed, code changed Validating the code that processes data before deployed to prod. Validating the data as it's loaded into production. ETL code ◦ pytest,... model code ◦ dbt unit testing ◦ Recce data content: ◦ pydantic ◦ great expectations ◦ dbt test ◦ dbt_utils/expectatio ns/elementary… data schemas: ◦ dbt data contracts
ETL/ELT pipelines configMap CronJob schedule: “0 ****” Database Data warehouse dbt run starts Run query (SQL) dbt run on Kubernetes • Scalability • High Availability • Resource Optimization • Automation • Monitoring and Logging