Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Ivory - A Data Store for Data Science
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Ambiata
October 20, 2014
Technology
740
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Ivory - A Data Store for Data Science
Ambiata
October 20, 2014
More Decks by Ambiata
See All by Ambiata
Improving feature engineering in the lab and production with Ivory
ambiata
3
680
Ivory - Concepts
ambiata
0
920
Ivory - Data Modelling
ambiata
0
520
Ivory - An Introduction
ambiata
1
1.4k
Other Decks in Technology
See All in Technology
AIの性能が向上しても未解決な組織の重大問題は何か?/An Unsolved Organizational Problem in the Age of AI
moriyuya
4
690
On-behalf-of Token exchange with AgentCore Identity
hironobuiga
2
230
不要なレビューをAIにまかせて AIコーディングの環境改善を加速した
shoota
1
180
アジャイルな経理と Claude Code と経営の未来
kawaguti
PRO
3
150
Oracle AI Database@Google Cloud:サービス概要のご紹介
oracle4engineer
PRO
6
1.5k
AI駆動開発を通して感じた、 AI時代のデザイナーの役割変化
whisaiyo
3
2.2k
AIのReact習熟度を測る
uhyo
2
620
【Snowflake Summit 2026 Recap!!】Snowflake Summit Deep Dive: Security & Governance
civitaspo
1
230
iAEONの段階的リアーキテクト戦略 / iAEON's_Gradual_Re-architecture_Strategy
aeonpeople
0
210
2026 TECHFRESH 畢業分享會 - 開發日常大解密!從領域驅動到企業級上線
line_developers_tw
PRO
0
1.1k
Oracle AI Database@AWS:サービス概要のご紹介
oracle4engineer
PRO
4
3k
Claude Codeとのおしゃべりでセマンティックモデルの定義からダッシュボード作成まで完成させる
nic_sugiyama
0
120
Featured
See All Featured
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
310
Producing Creativity
orderedlist
PRO
348
40k
Game over? The fight for quality and originality in the time of robots
wayneb77
1
200
Accessibility Awareness
sabderemane
1
140
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.7k
How Software Deployment tools have changed in the past 20 years
geshan
0
34k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
230
23k
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
2
860
Noah Learner - AI + Me: how we built a GSC Bulk Export data pipeline
techseoconnect
PRO
0
200
The browser strikes back
jonoalderson
0
1.2k
Making the Leap to Tech Lead
cromwellryan
135
9.9k
HTML-Aware ERB: The Path to Reactive Rendering @ RubyCon 2026, Rimini, Italy
marcoroth
1
200
Transcript
IVORY A DATA STORE FOR DATA SCIENCE http://github.com/ambiata/ivory © Ambiata
2014
DATA SCIENCE IN THE REAL WORLD © Ambiata 2014
PROBLEM #1 © Ambiata 2014
“DATA WRANGLING” © Ambiata 2014
WHAT WE START WITH © Ambiata 2014
© Ambiata 2014
WHAT WE NEED © Ambiata 2014
Feature vectors © Ambiata 2014 0.00 3 3001 1.00 634.83
16 4670 0.6875 15.12 2 - 0.50 33.56 2 - 1.00 98.34 12 3303 0.8333 523.81 23 2046 0.4782 1086.05 17 - 1.00 224.81 9 - 0.2222 78.21 2 2134 0.50 126.48 4 - 0.0 1 3 1 1 4 1 2 1 1 1 M - F M F - F F M - gender balance purchases zipcode prop_online num_accs 89340218 feature instance 48149407 18452274 07499337 62948721 93754723 00272446 13374497 31989993 46474236
Data set B Data set C Data set D Feature
Eng Model train Score A typical workflow Multiple data sources: • Transaction logs • Database snapshots • Segmentation models • 3rd-party data © Ambiata 2014 Feature engineering: • Data source are joined • Instances are created • Features are engineered The cool stuff: • Models are built • Instances are scored
Feature preparation Modelling 85% 15% © Ambiata 2014
Data set A Data set B Data set C Data
set D Data set E Feature Eng 1 Feature Eng 2 Feature Eng 3 Train 1 Score 1 Train 2 Score 2 Train 3 Score 3 Multiple data science projects Feature engineering is in a silo - no reuse between model builds © Ambiata 2014
PROBLEM #2 © Ambiata 2014
“LAB TO FACTORY” AKA DEV OPS © Ambiata 2014
• Continually receiving data • Want to leverage a history
of all this data • Continually training + scoring • Data may need to be corrected • Need to extend data model on-the-fly © Ambiata 2014
LAMBDA ARCHITECTURE © Ambiata 2014
© Ambiata 2014 query = function(all data)
© Ambiata 2014 New data stream Query Magical query engine
© Ambiata 2014 SERVING LAYER New data stream Query All
data Precomputed views Stream processing Incremental views BATCH LAYER SPEED LAYER Real-time store
© Ambiata 2014 New data stream Query All data Feature
view Stream processing Incremental views Real-time store
© Ambiata 2014 New data stream Query All data Feature
view Stream processing Incremental views Real-time store Model train and score
IVORY © Ambiata 2014
Data set A Data set B Data set C Data
set D Data set E Train 1 Score 1 Train 2 Score 2 Train 3 Score 3 Ivory © Ambiata 2014 A shared feature view asset
© Ambiata 2014 New data stream Query All data Ivory
Stream processing Incremental views Real-time store Model train and score An immutable, batch-oriented data store
© Ambiata 2014 Feature vectors Ivory An extensible data model,
backed by HDFS/S3 HDFS / S3
Apache V2 Licence github.com/ambiata/ivory © Ambiata 2014