Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unleashing Confidence in SQL Development throug...

Avatar for Tobias Lampert Tobias Lampert
April 25, 2024
96

Unleashing Confidence in SQL Development through Unit Testing

As the landscape of data-driven applications expands, the need for robust SQL development practices becomes increasingly critical. This conference talk addresses the challenges faced by data teams in maintaining and evolving complex SQL models for their Data Warehouses, and shows how unit testing can play a vital role in ensuring data quality. With the ease of mind of an automatically verified SQL logic, changes to existing data models can be shipped with confidence, ultimately contributing to faster deployment cycles. Learn how test cases with small sets of static mock data can be defined effortlessly so that they assure the correctness of the SQL logic and help pinpoint potential code errors easily.

Avatar for Tobias Lampert

Tobias Lampert

April 25, 2024
Tweet

Transcript

  1. Icon About me Tobias Lampert Analytics Engineer, Team BI Lotum

    media GmbH Back end developer turned data+AI guy • Data Engineering • Data Science • Data Architecture • Data Platforms
  2. Icon Play together. We create mobile games that millions of

    friends and families play together every day. Bad Nauheim, Germany 50 Employees 900M+ Downloads About Lotum
  3. Icon Lotum franchises 400M+ Downloads 100M+ Downloads 100M+ Downloads 100M+

    Downloads 4 Pics 1 Word Quiz Planet Word Blitz The Test
  4. Icon Analytics Engineering at Lotum many data transformations performed by

    complex SQL statements SQL extract raw data transform & load staging area Data Warehouse Analytics tracking and other data sources ~300M events per day
  5. Icon Analytics Engineering at Lotum How can we make sure

    the pipeline produces the expected output after a code change?
  6. Icon The Ripple Effect of a Small Change Imagine making

    a minor tweak to a SQL statement… 1 …it seems harmless, yet it quietly introduces an error. 2 correcting this requires orchestrating a costly and time-consuming data backfill. 5 How can this be prevented? …weeks pass before anyone notices… 3 …by then, the error has already compromised the data insights… 4
  7. Icon Testing SQL Statements Testing makes sure data is trustworthy!

    Why does it matter? ensure accuracy & reliability meet defined requirements and specifications detect errors early in the development cycle catch stealthy errors
  8. Icon Traditional SQL Testing Approach snapshot of live result table

    compare & verify temporary result table table 1 table 2 table n snapshots of production tables Query Engine execute SQL statement
  9. Icon Disadvantages • Stateful Relies on specific data states •

    Data duplication Requires a copy of production data • Limited scope Works with existing data only • Difficult to isolate Requires a dedicated test environment • Non-atomic Pinpointing issues can be challenging • Can be slow and resource-intensive
  10. Icon Test Case Unit Tests for SQL statements compare execute

    SQL statement Input Data Expected Output Data Actual Output Data Result of the actual behavior of the query Rows from each source table Describes the intended behavior of the query
  11. Icon Test Cases for SQL statements How should a test

    case look like? A test case should have as few input rows with as many empty fields as possible Atomic Only required fields Compact
  12. Icon Let’s write some tests GitHub Copilot “writing tests for

    SQL code is not a common practice” but it should be!
  13. Icon SQL Unit Testing Our requirements No test data persisted

    in the database Production-like Testing Test definition independent from SQL statement Test cases as code
  14. Icon Defining Test Cases as Python Dicts Each table row

    is represented as a dict name department salary John 1 100 Jack 2 50 Jill 3 200 { "name": "John", "department": 1, "salary": 100 } { "name": "Jack", "department": 2, "salary": 50 } { "name": "Jill", "department": 3, "salary": 200 } Employees
  15. Icon Defining Test Cases as Python Dicts Simple Test case

    #1 SQL statement SELECT department, AVG(salary) AS avg_salary FROM employees GROUP BY department Input data: [ { "department": 1, "salary": 100 } ] Expected output data: [ { "department": 1, "avg_salary": 100 } ]
  16. Icon Defining Test Cases as Python Dicts Simple Test case

    #2 SQL statement SELECT department, AVG(salary) AS avg_salary FROM employees GROUP BY department Input data: [ { "department": 1, "salary": 100 }, { "department": 1, "salary": 50 } ] Expected output data: [ { "department": 1, "avg_salary": 75 } ]
  17. Icon [ { "id": 12345, "first_name": "John", "last_name": "Doe", "hire_date":

    date(2000, 1, 1), "department": 1, "salary": 100, "position": "Engineer", "email": "[email protected]", "manager_id": 98765, "full_time": True, "address": "123 Main Street", "city": "Springsville" } ] Defining compact test cases
  18. Icon default_employee = { "id": 12345, "first_name": "John", "last_name": "Doe",

    "hire_date": date(2000, 1, 1), "department": 1, "salary": 100, "position": "Engineer", "email": "[email protected]", "manager_id": 98765, "full_time": True, "address": "123 Main Street", "city": "Springsville" } Defining compact test cases
  19. Icon Defining compact test cases Test Case: Duplicate email address

    [ default_employee | { "id": 1, "email": "[email protected]" }, default_employee | { "id": 2, "email": "[email protected]" } ] Test Case: Average salary calculation [ default_employee | { "id": 1, "salary": 100 }, default_employee | { "id": 2, "salary": 50 } ] Test Case: Hire date in the future [ default_employee | { "hire_date": date(2025, 1, 1) } ]
  20. Icon Verifying Test Cases How can we execute a SQL

    statement without having the test data stored in our database tables?
  21. Icon Combine SQL logic and input data populate virtual tables

    with static data 2 substitute original table references by virtual tables 3 insert virtual tables into the statement for each reference 1 Insert test data into the SQL statement itself
  22. Icon Transforming Python dicts into SQL Input data for table

    “employees”: [ { department: 1, salary: 100 } ] SQL CTE: WITH mockdata_employees AS ( SELECT 1 AS department, 100 AS salary )
  23. Icon Transforming Python dicts into SQL Input data for table

    “employees”: [ { department: 1, salary: 100 }, { department: 1, salary: 50 } ] SQL CTE: WITH mockdata_employees AS ( SELECT 1 AS department, 100 AS salary UNION ALL SELECT 1 AS department, 50 AS salary )
  24. Icon Injecting test data SELECT department, AVG(salary) AS avg_salary FROM

    employees GROUP BY department WITH mockdata_employees AS ( SELECT 1 AS department, 100 AS salary ) CTE with static test case input data Original SQL statement
  25. Icon Injecting test data WITH mockdata_employees AS ( SELECT 1

    AS department, 100 AS salary ) SELECT department, AVG(salary) AS avg_salary FROM employees GROUP BY department Original SQL statement including CTE with static test data
  26. Icon Injecting test data WITH mockdata_employees AS ( SELECT 1

    AS department, 100 AS salary ) SELECT department, AVG(salary) AS avg_salary FROM employees GROUP BY department Original SQL statement including CTE with static test data
  27. Icon Injecting test data WITH mockdata_employees AS ( SELECT 1

    AS department, 100 AS salary ) SELECT department, AVG(salary) AS avg_salary FROM mockdata_employees GROUP BY department SQL statement with test data injected via CTE No references to live database tables Can be executed by the query engine without needing real database tables
  28. Icon Putting it all together input data as Python dicts

    expected output as Python dicts compare SQL statement with test data injected CTEs with mock data Query Engine original SQL statement + actual query result for test case
  29. Icon Integration with pytest def test_single_case(): employees = MockedTable( name="employees",

    data=employees_input_rows ) expected = MockedTable( name="report", data=expected_output_rows ) actual = MockedTable( name="report", references=[employees] ) assert expected == actual [ { department: 1, salary: 100 }, { department: 1, salary: 50 } ] [ { department: 1, avg_salary: 75 } ]
  30. Icon Traditional Approach vs. Unit Testing Stateful Data duplication Limited

    scope Difficult to isolate Non-atomic Can be slow Executed on the fly No data needed in the database Can handle test cases with unseen data Run tests at any time, in any environment Exact test failure diagnosis Light weight test cases, executed quickly Traditional approach Unit testing
  31. Icon Conclusion Improves Code Quality • Ensures expected behavior •

    Can verify correctness for unencountered data scenarios Minimizes Errors • Find issues during development • Safeguards against regressions Boosts Developer Confidence • No need for manual verifications • Makes extensive data comparisons unnecessary • Deploying with peace of mind!
  32. Icon Thank you for your attention! Contact me on LinkedIn:

    linkedin.com/in/tlampert we’re hiring! lotum.com/jobs