Productionizing Big Data - stories from the trenches

Productionizing big data - stories from the trenches

Roksolana Diachuk •Engineering manager at Captify •Women Who Code Kyiv
Data Engineering Lead •Speaker

AdTech methodologies deliver the right content at the right time
to the right consumer AdTech

You have your pipelines in production What’s next?

Types of issues • Low performance • Human errors •
Data source errors

Story #1. Unlucky query

Problem Drop 13 months of user profiles

Reporting

Problem 13 months hour=22042001

Loading mechanism loader.ImpalaLoaderConfig.periodToLoad: “P5D” loader.ImpalaLoaderConfig.periodToLoad: “P13M” val minTime = currentDay.minus(config.feedPeriod)
  listFiles.filter(file => file.eventDateTime isAfter minTime)

Solution loader.ImpalaLoaderConfig.periodToLoad: “P5D” loader.ImpalaLoaderConfig.periodToLoad: “P1M” loader.ImpalaLoaderConfig.periodToLoad: “P13M” …

Story #2. Missing data

Data ingestion Data from Partner X Data costs attribution Extractor

Problem XX Advertiser ID, Language, XX Device Type, …, XX
Media Cost (USD) X Advertiser ID, Language, X Device Type, …, X Media Cost (USD)

Solution • Rename old columns • Reload data for the
week

Solution val colRegex: Regex = “””X (.+)“””.r val oldNewColumnsMapping =
df.schema.collect { case oldColdName@colRegex(pattern) => (oldColName.name, (“XX “ + pattern)) } oldNewColumnsMapping.foldLeft(df) { case (data, (oldName, newName)) => data.withColumnRenamed(oldName, newName) }

XX Advertiser ID, Language, XX Device Type, …, XX Media
Cost (USD) Solution

Story #3. Divide and conquer

Problem processing_time part-*.parquet filtering aggregations created part-*.parquet

• Slow processing • Large parquet files • Failing job
that consumes lots of resources Problem

• Write new partitioned state • Run downstream jobs with
smaller states • Generate seed partition column - xxhash64(fullUrl, domain) Solution

processing_time part-*.parquet created bucket=0 part-*.parquet part-*.parquet … bucket=9 part-*.parquet part-*.parquet
processing_time part-*.parquet Solution

Story #4. Catch the evolution train

Data organisation evolution

Problem • Missing columns from the source • Impala to
Databricks migration speed • Dependency with another team • Unhappy users

Log-level data Mapper Ingestor Transformer Data costs calculator Data costs
attribution

Data costs attribution Data costs attribution Data extractor Impala loader

Data costs attribution Data extractor Impala loader Data costs attribution

Solution XX Advertiser ID, Language, XX Device Type, …, XX
Partner Currency, XX CPM Fee (USD) XX Advertiser ID, Language, XX Device Type, …, XX Media Cost (USD) 26 columns 82 columns

Solution Data extractor New ingestion job

//final step is writing the data df.write .partitionBy(“event_date”, “event_hour”) .mode(SaveMode.Overwrite)
.parquet(dstPath) Solution

Why this solution doesn’t work data_feed clicks.csv.gz views.csv.gz activity.csv.gz event_date
clicks1.parquet clicks2.parquet

Impressions Clicks Conversions Attribution data source

Solution impressions clicks conversions clicks.csv.gz views.csv.gz activity.csv.gz

Story #5. Cleanup time

Corrupted data Data from Partner X Ingestor

Corrupted data Data from Partner X Ingestor IllegalArgumentException: Can't convert
value to BinaryType data type

Solution • Adjust pipeline • Reload data for 3 days
on S3 • Relaunch Databricks autoloader

Current solution impressions videoevents conversions impressions conversions Clicks clicks videoevents

Current solution impressions conversions clicks videoevents

Better solution impressions videoevents conversions impressions conversions clicks clicks videoevents

Conclusions

2. Observability is the key 4. Plan major changes carefully
1. Set up clear expectations with stakeholders Prevention mechanisms 3. Distribute data transformation load

2. Errors can be prevented 4. Data evolution is hard
1. Data setup is always changing Conclusions 3. There are multiple approaches with different tools

dead_ fl owers22 roksolana-d roksolanadiachuk roksolanad My contact info

Productionizing Big Data - stories from the tre...

Productionizing Big Data - stories from the trenches

More Decks by Roksolana

Other Decks in Technology

Featured

Transcript