Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling your data infrastructure

Scaling your data infrastructure

Scaling your data infrastructure @ PyConNove

barrachri

April 20, 2018
Tweet

More Decks by barrachri

Other Decks in Technology

Transcript

  1. Scaling your data infrastructure C H R I S T

    I A N B A R R A @ P Y C O N N O V E
  2. THE AGENDA 2 3 START THE DATA SCIENCE WORKFLOW SCALING

    IS NOT JUST A MATTER OF MACHINE WHEN THE SIZE OF YOUR DATA MATTERS 1
  3. HOW YOU BUILD, ITERATE AND SHARE DEPENDS ON MANY THINGS

    Your Users Your Product Your Team Your Company Your Tech Stack Your Domain
  4. We really care about versioning. We have Untitled_1.ipynb, Untitled_2.ipynb and

    Untitled_3.ipynb. HOMER SIMPSON C H I E F D A T A S C I E N T I S T D A T A B E E R I N C
  5. Since JSON is a plain text format, they can be

    version-controlled and shared with colleagues. E X I P Y T H O N N O T E B O O K D O C U M E N T A T I O N
  6. PARQUET P A R Q U E T + O

    B J E C T S T O R A G E = YO U C A N Q U E R Y I T U S I N G S Q L PA N DA S H A S N AT I V E S U P P O R T F O R G E T A B O U T C S V
  7. CODE OPTIMIZATION APPROACH SCALING FROM DIFFERENT SIDES A BIGGER MACHINE

    USE MULTIPLE CORES MORE MACHINES FRAMEWORKS: DASK RAY SPARK PANDAS: READ BY CHUNKS SCIKIT-LEARN: PARTIAL FIT
  8. I don’t want to use Spark/JVM, what do you have

    for me? H A P P Y P Y T H O N U S E R
  9. Use pandas through ray to query parquet files in an

    object storage. W O R K I N P R O G R E S S
  10. If you trained a model with scikit-learn 0.18.1, will the

    same model work with 0.19.1? P R O B L E M # 1
  11. 1. It’s damn easy to move things around 2. You

    get versioning for free 3. Stack agnostic 4. Move Docker images around T O R E C A P
  12. TAKEAWAYS UNIFIED DATA WAREHOUSE KEEP YOUR CODE RUNNING ON ONE

    MACHINE USE DOCKER TRY RAY BRING CI/CD TO YOUR DATASCIENCE WORKFLOW OBJECT STORAGE IS COOL DISTRIBUTED COMPUTING IS HARD I DIDN’T HAVE ANOTHER POINT