Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Jiaqi Liu - Building a Data Pipeline with Testi...

Jiaqi Liu - Building a Data Pipeline with Testing in Mind

It’s one thing to build a robust data pipeline process in python but a whole other challenge to find tooling and build out the framework that allows for testing a data process. In order to truly iterate and develop a codebase, one has to be able to confidently test during the development process and monitor the production system.

In this talk, I hope to address the key components for building out end to end testing for data pipelines by borrowing concepts from how we test python web services. Just like how we want to check for healthy status codes from our API responses, we want to be able to check that a pipeline is working as expected given the correct inputs. We’ll talk about key features that allows a data pipeline to be easily testable and how to identify timeseries metrics that can be used to monitor the health of a data pipeline.

https://us.pycon.org/2018/schedule/presentation/161/

PyCon 2018

May 11, 2018
Tweet

More Decks by PyCon 2018

Other Decks in Programming

Transcript

  1. Building a Data Pipeline with Testing in Mind Jiaqi Liu,

    Software Engineer at Button Director, @WomenWhoCodeNYC @jiaqicodes
  2. Agenda • Data Pipelines • Challenges with Testing Data Pipelines

    • Designing Features: • Well Defined Schemas • Dry Run Mode • Storing Metadata • Testing, Monitoring & Alerting
  3. ETL Pipeline Extract data from a source, this could be

    scraping from a site, a large file, a realtime stream of data feeds. Transform the data - this could be joining the data with additional information for an enhanced data set, running through a machine learning model, or aggregating the data in some way. Load the data into a data warehouse or a User facing dashboard - wherever the end storage and display for data might be.
  4. Batch Periodic Process that reads data in bulk (typically from

    a filesystem or a database) Stream High throughput, low latency system that reads data from a stream or a queue
  5. Problems • Batch Job is never scheduled • Batch Job

    takes too long to run • Data is malformed or corrupt • Data is lost • Stream is backed-up, Stream data is lost • Non-deterministic models
  6. Data is exposed or lost or malformed. A statistical model

    is producing highly inaccurate results Data Integrity Speed in Data Processing could be Core to Business Delayed Processing Data Pipeline Concerns
  7. It’s not enough to know that the pipeline is healthy,

    you also have to know that the data being processed is accurate.
  8. Interpretability Not just understand what a model predicted but also

    why. Allows for debugging and auditing machine learning models.
  9. Because Button is a marketplace, we see the side effects

    of user behavior in our data and have to decipher what assumptions are safe to make.
  10. Features to Include • Well Defined Schemas • Capturing Metadata

    about the Pipeline • Having a Test Run Feature
  11. Functional Tests • Can also be known as Integration Tests

    • In the case of data, it’s the golden tests • Sets the gold standard for data in and data out • Doesn’t need to be logic specific like Unit Tests are • Build a Golden Test framework and define fixtures (expected input, and expected output) Failed Gold Test