Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scaling your data infrastructure
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
barrachri
April 20, 2018
Technology
1
220
Scaling your data infrastructure
Scaling your data infrastructure @ PyConNove
barrachri
April 20, 2018
Tweet
Share
More Decks by barrachri
See All by barrachri
Will Tech Save Us?
barrachri
0
120
How software can feed the World
barrachri
1
180
How to fight with yourself and win.
barrachri
0
330
Introduction to Statistics with Python
barrachri
0
440
EuroPython 2015 and the future
barrachri
2
120
Start with Flask
barrachri
3
190
Django & Docker
barrachri
6
1.1k
Other Decks in Technology
See All in Technology
最強のAIエージェントを諦めたら品質が上がった話 / how quality improved after giving up on the strongest AI agent
kt2mikan
0
190
Agent ServerはWeb Serverではない。ADKで考えるAgentOps
akiratameto
0
110
楽しく学ぼう!ネットワーク入門
shotashiratori
4
3.4k
モブプログラミング再入門 ー 基本から見直す、AI時代のチーム開発の選択肢 ー / A Re-introduction of Mob Programming
takaking22
5
1.6k
プラットフォームエンジニアリングはAI時代の開発者をどう救うのか
jacopen
6
3.6k
AI時代の「本当の」ハイブリッドクラウド — エージェントが実現した、あの頃の夢
ebibibi
0
130
コンテキスト・ハーネスエンジニアリングの現在
hirosatogamo
PRO
1
120
スクリプトの先へ!AIエージェントと組み合わせる モバイルE2Eテスト
error96num
0
180
20260311 技術SWG活動報告(デジタルアイデンティティ人材育成推進WG Ph2 活動報告会)
oidfj
0
360
OCHaCafe S11 #2 コンテナ時代の次の一手:Wasm 最前線
oracle4engineer
PRO
2
140
決済サービスを支えるElastic Cloud - Elastic Cloudの導入と推進、決済サービスのObservability
suzukij
2
650
Tebiki Engineering Team Deck
tebiki
0
27k
Featured
See All Featured
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
1
1.4k
So, you think you're a good person
axbom
PRO
2
2k
Highjacked: Video Game Concept Design
rkendrick25
PRO
1
320
Automating Front-end Workflow
addyosmani
1370
200k
Design in an AI World
tapps
0
170
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
231
22k
The Cost Of JavaScript in 2023
addyosmani
55
9.8k
The Pragmatic Product Professional
lauravandoore
37
7.2k
Facilitating Awesome Meetings
lara
57
6.8k
Rails Girls Zürich Keynote
gr2m
96
14k
Future Trends and Review - Lecture 12 - Web Technologies (1019888BNR)
signer
PRO
0
3.3k
Claude Code どこまでも/ Claude Code Everywhere
nwiizo
64
53k
Transcript
Scaling your data infrastructure C H R I S T
I A N B A R R A @ P Y C O N N O V E
THE AGENDA 2 3 START THE DATA SCIENCE WORKFLOW SCALING
IS NOT JUST A MATTER OF MACHINE WHEN THE SIZE OF YOUR DATA MATTERS 1
THE AGENDA 4 5 CONTAINERIZED DATA SCIENCE CASSINY: PUT ALL
THE THINGS TOGETHER END
THE DATA SCIENCE WORKFLOW
HEXAGON PRESENTATION TEMPLATE
HOW YOU BUILD, ITERATE AND SHARE DEPENDS ON MANY THINGS
Your Users Your Product Your Team Your Company Your Tech Stack Your Domain
SCIKIT-LEARN DOCKER DATA SCIENCE TOOLBELT PANDAS JUPYTER RAY
SCALING IS NOT JUST A MATTER OF MACHINES
We all use it.
We really care about versioning. We have Untitled_1.ipynb, Untitled_2.ipynb and
Untitled_3.ipynb. HOMER SIMPSON C H I E F D A T A S C I E N T I S T D A T A B E E R I N C
Since JSON is a plain text format, they can be
version-controlled and shared with colleagues. E X I P Y T H O N N O T E B O O K D O C U M E N T A T I O N
THEY GOT IT RIGHT
BUT WE KEEP IMPROVING
90% OF JUPITER IS MADE BY HYDROGEN
THE HARD THING ABOUT STORAGE
PARQUET P A R Q U E T + O
B J E C T S T O R A G E = YO U C A N Q U E R Y I T U S I N G S Q L PA N DA S H A S N AT I V E S U P P O R T F O R G E T A B O U T C S V
WHEN THE SIZE OF YOUR DATA MATTERS
IT’S TOO SLOW DOESN’T FIT IN YOUR RAM
CODE OPTIMIZATION APPROACH SCALING FROM DIFFERENT SIDES A BIGGER MACHINE
USE MULTIPLE CORES MORE MACHINES FRAMEWORKS: DASK RAY SPARK PANDAS: READ BY CHUNKS SCIKIT-LEARN: PARTIAL FIT
chunks & partial_fit 1 M A C H I N
E
Multiple machines. n M A C H I N E
S
I don’t want to use Spark/JVM, what do you have
for me? H A P P Y P Y T H O N U S E R
WHAT IS RAY?
A high-performance distributed execution engine REDIS SCHEDULER WORKER ARROW &
PLASMA
Use pandas through ray to query parquet files in an
object storage. W O R K I N P R O G R E S S
CONTAINERIZED DATA SCIENCE
If you trained a model with scikit-learn 0.18.1, will the
same model work with 0.19.1? P R O B L E M # 1
How do you share your models? P R O B
L E M # 2
How do you put your models in production? P R
O B L E M # 3
Containerize everything. T H E A N S W E
R
1. It’s damn easy to move things around 2. You
get versioning for free 3. Stack agnostic 4. Move Docker images around T O R E C A P
CASSINY: PUT ALL THE THINGS TOGETHER
CLEAR REQUIREMENTS CONTAINERIZED EASY OBJECT STORAGE JUPYTER + IPYTHON PLATFORM
AGNOSTIC
OPEN SOURCE
DEMO
TAKEAWAYS UNIFIED DATA WAREHOUSE KEEP YOUR CODE RUNNING ON ONE
MACHINE USE DOCKER TRY RAY BRING CI/CD TO YOUR DATASCIENCE WORKFLOW OBJECT STORAGE IS COOL DISTRIBUTED COMPUTING IS HARD I DIDN’T HAVE ANOTHER POINT
None