Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Designing Data Pipelines for Machine Learning A...

Designing Data Pipelines for Machine Learning Applications

Alexis Seigneurin

March 14, 2019
Tweet

More Decks by Alexis Seigneurin

Other Decks in Technology

Transcript

  1. Me • Data Engineer • Kafka, Spark, AWS… • Blog:

    aseigneurin.github.io • Twitter: @aseigneurin
  2. Data Pipeline • It is the journey of your data

    • Ingest, transform, output… all in streaming • Kafka • Sometimes called a DAG • Apache Airflow
  3. Key elements • Streaming • Transformations = Independent jobs •

    (= different technologies?) • Data can be consumed by multiple consumers
  4. Machine Learning Application • Use historical data to make predictions

    on new data • Train a model on historical data • Apply the model on new data
  5. A batch oriented process • Training on a batch of

    data • Can take hours (days?) • Validate + deploy the new model
  6. Batch + Streaming • Training in batch to create a

    model • Streaming app to use the model
  7. Training • Done from time to time (e.g. every other

    week) • With a fixed data set • Export a model • A new model is deployed when it has been validated by a Data Scientist
  8. Streaming app • Use Kafka as a backbone • Kafka

    Streams to implement transformations • Need a way to use the model on the JVM
  9. Architecture ML Model ML Training Labeled data Unlabeled data (in

    Kafka) Apply the model (in Kafka Streams) Data with predictions Batch Streaming
  10. Python libraries • scikit-learn, TensorFlow, Keras… • ⚠ Need to

    build and expose a REST API • ⚠ Scaling can be complicated
  11. Cloud-hosted services • AWS SageMaker, Google Cloud Machine Learning Engine…

    • No code to write to serve the models • ⚠ Less control over how the model is served
  12. Preparing the model • Set column types: numeric, enum… •

    Split the dataset: 70/20/10 (training/validation/test) • Algorithm: Gradient Boosting Machine
  13. { "date": "2014-05-02 00:00:00", "bedrooms": 2, "bathrooms": 2.0, "sqft_living": 2591,

    "sqft_lot": 5182, "floors": 0.0, "waterfront": "0", "view": "0", "condition": "0", "sqft_above": 2591, "sqft_basement": 0, "yr_built": 1911, "yr_renovated": 1995, "street": "Burke-Gilman Trail", "city": "Seattle", "statezip": "WA 98155", "country": "USA" } Stream processing { "date": "2014-05-02 00:00:00", "bedrooms": 2, "bathrooms": 2.0, "sqft_living": 2591, "sqft_lot": 5182, "floors": 0.0, "waterfront": "0", "view": "0", "condition": "0", "sqft_above": 2591, "sqft_basement": 0, "yr_built": 1911, "yr_renovated": 1995, "street": "Burke-Gilman Trail", "city": "Seattle", "statezip": "WA 98155", "country": "USA", "price": 781898.4215855601 } Kafka Streams application Kafka topic Kafka topic
  14. Kafka Streams • Client library for Java and Scala •

    DSL: stream(), map(), filter(), to()... • Aggregations, joins, windowing • KStream / KTable • Simple deployment model • Allows to create "microservices”
  15. Kafka Streams app Read from the source topic Apply the

    model Write to the output topic Start consuming
  16. Deploying the application • Kafka Streams = a plain Java

    app • Run the app: • On a VM (e.g. EC2) • In Kubernetes • …
  17. Serving the model • Your choice of ML framework constrains

    you! • Embedded model or REST API? • Are you ok running Python code in production? • (Spotify is not - JVM only)
  18. Feature Engineering • Need to apply the same transformations in

    batch & streaming • Scaling, date conversions, text extraction… • Challenging!
  19. Feature Engineering • Option 1: Use the same UDFs •

    Option 2: Featran? (github.com/spotify/featran)
  20. Know your data • Calculate statistics from your training data

    • E.g. min, max, average, median… of the square footage • E.g. nulls?
  21. Check your new data • In streaming: • Check: min,

    max, null… • In batch: • Check: average, median… ‣ Outliers → Update the model?
  22. Enrich your data • Add features from other datasets •

    Use reference datasets: • Zip code → details about the location • User ID → details about the user • IP address → location • …
  23. 2 apps to deploy Kafka Streams enrichment Kafka topic Kafka

    topic Reference data as a CDC stream Join KStream KTable Reference data
  24. Pre-production testing Production application Input topic Output topic (production) ML

    model Pre-prod application Output topic (pre-prod) ML model Different consumer group Compare
  25. A/B testing • 2 models running in parallel • Compare

    the outputs • Split: • Embedded model → partition-based split • REST API → more control
  26. A/B (pre-split) Kafka Streams application Input topic Output topic ML

    model Kafka Streams application ML model Same consumer group A B
  27. A/B (post-split) Kafka Streams application Input topic Temp topic A

    ML model Kafka Streams application Temp topic B ML model Merge A B Different consumer groups Output topic
  28. Updates • Update: • Embedded model → deploy a new

    version of the app • REST API → update the API • Stop/start vs Rolling updates
  29. High Availability • Deploy more than one instance of your

    app • Same consumer group / Different physical locations
  30. High Availability Kafka broker Availability Zone 2 Kafka broker Availability

    Zone 1 Kafka broker Availability Zone 3 Kafka Streams app VM1 / Availability Zone 1 Kafka Streams app VM2 / Availability Zone 2
  31. Summary • Training in batch / Predictions in streaming •

    Model: Embedded vs REST API • Kafka + Kafka Streams • Design for testing, H/A, updates… • Blog: Realtime Machine Learning predictions with Kafka and H2O.ai • https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html