Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Docker and Python: making them play nicely and securely for Data Science and Machine Learning

Docker and Python: making them play nicely and securely for Data Science and Machine Learning

"Docker containers are a popular way to create reproducible development environments without having to install complex dependencies on your local machine. Developers all over the world use them for production and R&D environments.

However, using Docker for Machine Learning is not always straightforward. Plus most of the tutorials and content out there focus on how to use Docker to containerize apps rather than focusing on Data Science solutions.

In this talk, Tania shares some tips and tricks on how to effectively use Docker for Machine Learning and Data Science, helping to make your work more robust and reproducible."

Tania Allard

May 04, 2020
Tweet

More Decks by Tania Allard

Other Decks in Programming

Transcript

  1. TANIA ALLARD, PHD
    Making them play nicely and securely for Data Science and Machine
    Learning
    DOCKER AND PYTHON
    Sr. Developer Advocate @Microsoft. ixek | https://bit.ly/pycon2020-ml-docker

    View full-size slide

  2. @ixek
    @trallard
    trallard.dev

    View full-size slide

  3. https://bit.ly/pycon2020-ml-docker
    THESE SLIDES

    View full-size slide

  4. WHAT YOU’LL LEARN TODAY
    -Why using Docker?
    -Docker for Data Science and Machine Learning
    -Security and performance
    -Do not reinvent the wheel, automate
    -Tips and trick to use Docker
    ixek | https://bit.ly/pycon2020-ml-docker

    View full-size slide

  5. DEV LIFE WITHOUT DOCKER OR CONTAINERS
    Your application
    How are your users or colleagues meant to know what dependencies they need?
    Import Error:
    no module name
    x, y, x
    ixek | https://bit.ly/pycon2020-ml-docker

    View full-size slide

  6. WHAT IS DOCKER?
    A tool that helps you to create, deploy and run your applications or projects
    by using containers.
    This is a container
    ixek | https://bit.ly/pycon2020-ml-docker

    View full-size slide

  7. HOW DO CONTAINERS HELP ME?
    They provide a solution to the
    problem of how to get software to
    run reliably when moved from one
    computing environment to another
    Your laptop
    Test environment
    Staging environment
    Production environment
    ixek | https://bit.ly/pycon2020-ml-docker

    View full-size slide

  8. DEV LIFE WITH CONTAINERS
    Your application
    Libraries, dependencies,
    runtime environment,
    configuration files
    ixek | https://bit.ly/pycon2020-ml-docker

    View full-size slide

  9. THAT SOUNDS A LOT LIKE A VIRTUAL MACHINE
    Each app is
    containerised
    INFRASTRUCTURE
    HOST OPERATING SYSTEM
    DOCKER
    APP
    APP
    APP
    APP
    APP
    ixek | https://bit.ly/pycon2020-ml-docker
    At the app level:
    Each runs as an isolated process

    View full-size slide

  10. THAT SOUNDS A LOT LIKE A VIRTUAL MACHINE
    ixek | https://bit.ly/pycon2020-ml-docker
    CONTAINERS
    INFRASTRUCTURE
    HOST OPERATING SYSTEM
    DOCKER
    APP
    APP
    APP
    APP
    APP
    INFRASTRUCTURE
    HYPERVISOR
    APP
    GUEST OS
    VIRTUAL MACHINE
    VIRTUAL MACHINE
    At the hardware level
    Full OS + app +
    binaries +
    libraries
    APP
    GUEST OS
    VIRTUAL MACHINE

    View full-size slide

  11. -Image: archive with all the
    data needed to run the app
    -When you run an image it
    creates a container
    IMAGE VS CONTAINER
    Docker
    image
    $ docker run
    Latest
    1.0.2
    ixek | https://bit.ly/pycon2020-ml-docker

    View full-size slide

  12. -Complex setups / dependencies
    -Reliance on data / databases
    -Fast evolving projects (iterative R&D process)
    -Docker is complex and can take a lot of time to upskill
    -Are containers secure enough for my data / model /algorithm?
    COMMON PAIN POINTS IN DS AND ML

    View full-size slide

  13. DOCKER FOR DATA
    SCIENCE AND
    MACHINE LEARNING

    View full-size slide

  14. HOW IS IT DIFFERENT FROM WEB APPS FOR EXAMPLE?
    https://twitter.com/dstufft/status/1095164069802397696
    ixek | https://bit.ly/pycon2020-ml-docker

    View full-size slide

  15. -Not every deliverable is an app
    -Not every deliverable is a model either
    -Heavily relies on data
    -Mixture of wheels and compiled packages
    -Security access levels - for data and software
    -Mixture of stakeholders: data scientists, software engineers, ML engineers
    HOW IS IT DIFFERENT FROM WEB APPS FOR EXAMPLE?
    ixek | https://bit.ly/pycon2020-ml-docker

    View full-size slide

  16. Dockerfiles are used to create
    Docker images by providing a set
    of instructions to install software,
    configure your image or copy files
    BUILDING DOCKER IMAGES
    ixek | https://bit.ly/pycon2020-ml-docker

    View full-size slide

  17. ixek | https://bit.ly/pycon2020-ml-docker
    Base image
    Main instructions
    Entry command
    DISSECTING DOCKER IMAGES

    View full-size slide

  18. INSTALL PANDAS
    INSTALL REQUESTS
    ixek | https://bit.ly/pycon2020-ml-docker
    DISSECTING DOCKER IMAGES
    INSTALL FLASK
    BASE
    IMAGE
    Each instruction creates
    A layer
    (like an onion)

    View full-size slide

  19. ixek | https://bit.ly/pycon2020-ml-docker
    CHOOSING THE BEST BASE IMAGE
    https://github.com/docker-library/docs/tree/master/python
    If building from scratch use the
    official Python images
    https://hub.docker.com/_/python

    View full-size slide

  20. ixek | https://bit.ly/pycon2020-ml-docker
    THE JUPYTER DOCKER STACK
    Need Conda, notebooks and
    scientific Python ecosystem?
    Try Jupyter Docker stacks
    https://jupyter-docker-stacks.readthedocs.io/
    ubuntu@SHA
    base-notebook
    minimal-notebook
    scipy-notebook r-notebook
    tensorflow-notebook datascience-notebook pyspark-notebook
    all-spark-notebook

    View full-size slide

  21. ixek | https://bit.ly/pycon2020-ml-docker
    - Always know what you are
    expecting
    -Provide context with LABELS
    -Split complex RUN statements
    and sort them
    -Prefer COPY to add files
    BEST PRACTICES
    https://docs.docker.com/develop/develop-images/dockerfile_best-practices/

    View full-size slide

  22. ixek | https://bit.ly/pycon2020-ml-docker
    - Leverage build cache
    -Install only necessary
    packages
    SPEED UP YOUR BUILD
    https://docs.docker.com/develop/develop-images/dockerfile_best-practices/

    View full-size slide

  23. ixek | https://bit.ly/pycon2020-ml-docker
    - Leverage build cache
    -Install only necessary packages
    -Explicitly ignore files
    https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
    SPEED UP YOUR BUILD AND PROOF

    View full-size slide

  24. ixek | https://bit.ly/pycon2020-ml-docker
    -You can use bind mounts to directories
    (unless you are using a database)
    -Avoid issues by creating a non-root
    user
    https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
    MOUNT VOLUMES TO ACCESS DATA

    View full-size slide

  25. SECURITY AND
    PERFORMANCE

    View full-size slide

  26. ixek | https://bit.ly/pycon2020-ml-docker
    Lock down your container:
    - Run as non-root user (Docker
    runs as root by default)
    - Minimise capabilities
    MINIMISE PRIVILEGE - FAVOUR LESS
    PRIVILEGED USER

    View full-size slide

  27. ixek | https://bit.ly/pycon2020-ml-docker
    Remember Docker images are like onions. If you copy keys in an intermediate layer they
    are cached.
    Keep them out of your Dockerfile.
    DON’T LEAK SENSITIVE INFORMATION

    View full-size slide

  28. -Fetch and manage secrets in
    an intermediate layer
    -Not all your dependencies will
    have been packed as wheels
    so you might need a compiler -
    build a compile and a runtime
    image
    -Smaller images overall
    USE MULTI STAGE BUILDS

    View full-size slide

  29. USE MULTI STAGE BUILDS
    Compile-image
    Docker
    image
    Runtime-image
    Copy virtual
    Environment
    $ docker build --pull --rm -f “Dockerfile"\
    -t trallard:data-scratch-1.0 "."
    Docker
    image

    View full-size slide

  30. USE MULTI STAGE BUILDS
    Docker
    image
    Runtime-image
    FINAL IMAGE
    trallard:data-scratch-1.0

    View full-size slide

  31. PROJECT TEMPLATES
    Need a standard project template?
    Use cookie cutter data science
    Or cookie cutter docker science
    https://github.com/docker-science/cookiecutter-docker-science
    https://drivendata.github.io/cookiecutter-data-science/

    View full-size slide

  32. DO NOT REINVENT
    THE WHEEL
    Leverage the existence and usage
    of tools like repo2docker.
    Already configured and optimised
    for Data Science / Scientific
    computing.
    https://repo2docker.readthedocs.io/en/latest ixek | https://bit.ly/pycon2020-ml-docker
    $ conda install jupyter repo2docker
    $ jupyter-repo2docker “.”

    View full-size slide

  33. DO NOT REINVENT
    THE WHEEL
    Leverage the existence and usage
    of tools like repo2docker.
    Already configured and optimised
    for Data Science / Scientific
    computing.
    https://repo2docker.readthedocs.io/en/latest ixek | https://bit.ly/pycon2020-ml-docker

    View full-size slide

  34. DELEGATE TO YOUR
    CONTINUOUS INTEGRATION
    TOOL
    Set Continuous integration
    (Travis, GitHub Actions, whatever
    you prefer).
    And delegate your build - also
    build often.
    https://repo2docker.readthedocs.io/en/latest ixek | https://bit.ly/pycon2020-ml-docker

    View full-size slide

  35. THIS WORKFLOW
    Docker
    image
    Docker
    image
    ixek | https://bit.ly/pycon2020-ml-docker
    -Code in version control
    -Trigger on tag / Also scheduled trigger
    -Build image
    -Push image

    View full-size slide

  36. 1. Rebuild your images frequently - get security updates for system packages
    2. Never work as root / minimise the privileges
    3. You do not want to use Alpine Linux (go for buster, stretch or the Jupyter stack)
    4. Always know what you are expecting: pin / version EVERYTHING (use pip-
    tools, conda, poetry or pipenv)
    5. Leverage build cache
    TOP TIPS

    View full-size slide

  37. 6. Use one Dockerfile per project
    7. Use multi-stage builds - need to compile code? Need to reduce your image size?
    8. Make your images identifiable (test, production, R&D) - also be careful when
    accessing databases and using ENV variables / build variables
    9. Do not reinvent the wheel! Use repo2docker
    10.Automate - no need to build and push manually
    11. Use a linter
    TOP TIPS

    View full-size slide

  38. THANK YOU
    @ixek
    @trallard
    trallard.dev

    View full-size slide