Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Productionizing Big Data - stories from the tre...
Search
Roksolana
September 14, 2023
Technology
0
60
Productionizing Big Data - stories from the trenches
Presented at ScalaDays 2023 (Madrid, Spain)
Roksolana
September 14, 2023
Tweet
Share
More Decks by Roksolana
See All by Roksolana
Pain of engineering management
roksolanad
1
66
Alice and the return to the world of pods and higher-order functions
roksolanad
0
150
Modern data pipelines in AdTech - life in the trenches
roksolanad
1
260
Alice and travelling back in time
roksolanad
0
130
Big Data at AdTech
roksolanad
0
280
Alice and the Mad Hatter: Predict or not to predict
roksolanad
0
140
Alice in the world of machine learning
roksolanad
0
88
Alice and the lost pod: practical guide to Kubernetes in Scala
roksolanad
1
290
Scala meets Kubernetes
roksolanad
0
450
Other Decks in Technology
See All in Technology
"君は見ているが観察していない"で考えるインシデントマネジメント
grimoh
4
1.1k
[CV勉強会@関東 ECCV2024 読み会] オンラインマッピング x トラッキング MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping (Chen+, ECCV24)
abemii
0
110
株式会社島津製作所_研究開発(集団協業と知的生産)の現場を支える、OSS知識基盤システムの導入
akahane92
1
200
【若手エンジニア応援LT会】ソフトウェアを学んできた私がインフラエンジニアを目指した理由
kazushi_ohata
0
120
ISUCONに強くなるかもしれない日々の過ごしかた/Findy ISUCON 2024-11-14
fujiwara3
8
800
リンクアンドモチベーション ソフトウェアエンジニア向け紹介資料 / Introduction to Link and Motivation for Software Engineers
lmi
4
300k
安心してください、日本語使えますよ―Ubuntu日本語Remix提供休止に寄せて― 2024-11-17
nobutomurata
0
180
FOSS4G 2024 Japan コアデイ 一般発表25 PythonでPLATEAUのデータを手軽に扱ってみる
ra0kley
1
140
誰も全体を知らない ~ ロールの垣根を超えて引き上げる開発生産性 / Boosting Development Productivity Across Roles
kakehashi
1
140
これまでの計測・開発・デプロイ方法全部見せます! / Findy ISUCON 2024-11-14
tohutohu
3
340
TanStack Routerに移行するのかい しないのかい、どっちなんだい! / Are you going to migrate to TanStack Router or not? Which one is it?
kaminashi
0
310
マルチプロダクトな開発組織で 「開発生産性」に向き合うために試みたこと / Improving Multi-Product Dev Productivity
sugamasao
1
280
Featured
See All Featured
The MySQL Ecosystem @ GitHub 2015
samlambert
250
12k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
665
120k
Optimising Largest Contentful Paint
csswizardry
33
2.9k
A Modern Web Designer's Workflow
chriscoyier
693
190k
Designing for Performance
lara
604
68k
Adopting Sorbet at Scale
ufuk
73
9.1k
Statistics for Hackers
jakevdp
796
220k
Fashionably flexible responsive web design (full day workshop)
malarkey
405
65k
Gamification - CAS2011
davidbonilla
80
5k
Bash Introduction
62gerente
608
210k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
356
29k
Facilitating Awesome Meetings
lara
50
6.1k
Transcript
Productionizing big data - stories from the trenches
Roksolana Diachuk •Engineering manager at Captify •Women Who Code Kyiv
Data Engineering Lead •Speaker
AdTech methodologies deliver the right content at the right time
to the right consumer AdTech
None
You have your pipelines in production What’s next?
Types of issues • Low performance • Human errors •
Data source errors
Story #1. Unlucky query
Problem Drop 13 months of user profiles
Reporting
Problem 13 months hour=22042001
Loading mechanism loader.ImpalaLoaderConfig.periodToLoad: “P5D” loader.ImpalaLoaderConfig.periodToLoad: “P13M” val minTime = currentDay.minus(config.feedPeriod)
listFiles.filter(file => file.eventDateTime isAfter minTime)
Solution loader.ImpalaLoaderConfig.periodToLoad: “P5D” loader.ImpalaLoaderConfig.periodToLoad: “P1M” loader.ImpalaLoaderConfig.periodToLoad: “P13M” …
Story #2. Missing data
Data ingestion Data from Partner X Data costs attribution Extractor
Problem XX Advertiser ID, Language, XX Device Type, …, XX
Media Cost (USD) X Advertiser ID, Language, X Device Type, …, X Media Cost (USD)
Solution • Rename old columns • Reload data for the
week
Solution val colRegex: Regex = “””X (.+)“””.r val oldNewColumnsMapping =
df.schema.collect { case oldColdName@colRegex(pattern) => (oldColName.name, (“XX “ + pattern)) } oldNewColumnsMapping.foldLeft(df) { case (data, (oldName, newName)) => data.withColumnRenamed(oldName, newName) }
XX Advertiser ID, Language, XX Device Type, …, XX Media
Cost (USD) Solution
Story #3. Divide and conquer
Problem processing_time part-*.parquet filtering aggregations created part-*.parquet
• Slow processing • Large parquet files • Failing job
that consumes lots of resources Problem
• Write new partitioned state • Run downstream jobs with
smaller states • Generate seed partition column - xxhash64(fullUrl, domain) Solution
processing_time part-*.parquet created bucket=0 part-*.parquet part-*.parquet … bucket=9 part-*.parquet part-*.parquet
processing_time part-*.parquet Solution
Story #4. Catch the evolution train
Data organisation evolution
Problem • Missing columns from the source • Impala to
Databricks migration speed • Dependency with another team • Unhappy users
Log-level data Mapper Ingestor Transformer Data costs calculator Data costs
attribution
Data costs attribution Data costs attribution Data extractor Impala loader
Data costs attribution Data extractor Impala loader Data costs attribution
Solution XX Advertiser ID, Language, XX Device Type, …, XX
Partner Currency, XX CPM Fee (USD) XX Advertiser ID, Language, XX Device Type, …, XX Media Cost (USD) 26 columns 82 columns
Solution Data extractor New ingestion job
//final step is writing the data df.write .partitionBy(“event_date”, “event_hour”) .mode(SaveMode.Overwrite)
.parquet(dstPath) Solution
Why this solution doesn’t work data_feed clicks.csv.gz views.csv.gz activity.csv.gz event_date
clicks1.parquet clicks2.parquet
Impressions Clicks Conversions Attribution data source
Solution impressions clicks conversions clicks.csv.gz views.csv.gz activity.csv.gz
Story #5. Cleanup time
Corrupted data Data from Partner X Ingestor
Corrupted data Data from Partner X Ingestor IllegalArgumentException: Can't convert
value to BinaryType data type
Solution • Adjust pipeline • Reload data for 3 days
on S3 • Relaunch Databricks autoloader
Current solution impressions videoevents conversions impressions conversions Clicks clicks videoevents
Current solution impressions conversions clicks videoevents
Better solution impressions videoevents conversions impressions conversions clicks clicks videoevents
Conclusions
2. Observability is the key 4. Plan major changes carefully
1. Set up clear expectations with stakeholders Prevention mechanisms 3. Distribute data transformation load
2. Errors can be prevented 4. Data evolution is hard
1. Data setup is always changing Conclusions 3. There are multiple approaches with different tools
None
dead_ fl owers22 roksolana-d roksolanadiachuk roksolanad My contact info