How Traveloka Handle Data Pipeline for Big Things?

How Traveloka Handle Data Pipeline for Big Things? Andi N.
Dirgantara Lead Data Engineer

2 Speaker Profile • I’m Andi Nugroho Dirgantara • 5+
years as a software engineer • 3+ years as a data engineer (big data) • Lead Data Engineer, Traveloka • Working remotely from Malang • Lead, FB DevC Malang • Big Data and JavaScript lover • Father of 3+ years old son • Gamer ◦ Steam Account: hellowin_cavemen ◦ Battle Tag: Hellowin#11826

Why we need data pipeline? Background Story

4 How we use our data • Business Intelligence •
Analytics • Personalization • Fraud Detection • Ads optimization • Cross selling • AB Test • etc.

5 Problems Client • Web • Android • etc. Backend
Database Big Data Platform ? Data Processing • Analytics • Machine Learning • etc. Overly simplified data architecture on Traveloka Product Side Data Side 1. Incoming stream, like tracking, its throughput is so huge. How to handle those stream until persisted on persistent storage? 2. Even it’s already on persistent storage, how to manipulate it?

6 We need stream and batch processing which able to
scale Then we tried to “googling” it ... and turns out it causes another problem ... What We Need? Which technology stack is best for us?

in Traveloka under Data Architecture Production Team Data Pipeline Implementation

8 Simplified Data Pipeline Architecture Data Lake • BigQuery •
Hive (S3) Bigtable Service Dataflow input output Spark • Dataflow is drop in replacement for Spark in some cases • Spark feature still much more rich than Dataflow (SparkML, SparkQL, etc.) • Dataflow v2 API is using Apache Beam compatible • Apache Beam is compatible of any stream processing platform (Apex, Flink, Spark, Dataflow, Gearpump) Apache beam works with ...

9 How Steam/ Batch Processing Works Large Data Chunk (partitioned)
Chunk (partitioned) Chunk (partitioned) Node Node Node Node • Every large data which able to be partitioned is compatible with those pipeline • Large data partitioned into smaller chunks • Small chunks processed parallel on several nodes • Eventually the result is collected to single node again, but this is not necessary

10 • Bigtable is proprietary version of HBase • It
can be used as Data Lake as well • It’s columnar NoSQL • Support high throughput • Row-key as a primary key and also atomic • It’s “get” API claimed as O(1) complexity • More details can be read on its paper https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf • Service consume data from Bigtable instead of Data Lake Persistent Storage for Large Pre-computed Data is Needed Bigtable Service Dataflow

11 Pros • Easier to maintain (managed by GCP) •
Good integration with other GCP managed tools ◦ BigQuery ◦ PubSub ◦ Cloud Storage • Enterprise ready, support is 24/7 Dataflow Pros Cons Cons • Less mature compared to Hadoop ecosystem • Limited API yet (not supported Scala API) • Have no ML API as SparkML • Have no Query API as SparkQL • Close sourced

12 • This presentation share perhaps not more than 20%
of total data pipeline being use in Traveloka • Data Pipeline technology is very dynamic, current pipeline might be obsolete next year, and migration is needed • We still using Databricks, Kafka, PubSub, in some services More Data Pipeline Still In Use

Conclusions

14 • Maintaining data pipeline is quite hard, so don’t
forget to put proper monitoring effort • Data pipeline technology stack is evolving so fast, it’s Data Engineer responsibility to adapt with every changes • There’s no silver bullet or one solution fits all technology Conclusions

15 • How Big Data Platform Handle big Things (https://speakerdeck.com/hellowin/how-big-data-platform-handle-big-things)
• How to Improve Data Warehouse Efficiency using S3 over HDFS on Hive (https://blog.andi.dirgantara.co/how-to-improve-data-warehouse-efficiency-using-s3-over-hdfs-on-hive-e9da90ea378c) References and Other Presentations

Thank you for your time.

We are hiring... visit https://www.traveloka.com/en/careers

How Traveloka Handle Data Pipeline for Big Things?

How Traveloka Handle Data Pipeline for Big Things?

Andi N. Dirgantara

More Decks by Andi N. Dirgantara

Other Decks in Programming

Featured

Transcript