Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Lake Implementation in Traveloka

Data Lake Implementation in Traveloka

Avatar for Andi N. Dirgantara

Andi N. Dirgantara

January 23, 2018
Tweet

More Decks by Andi N. Dirgantara

Other Decks in Programming

Transcript

  1. 2 Speaker Profile • I’m Andi Nugroho Dirgantara • 5+

    years as a software engineer • 3+ years as a data engineer (big data) • Lead Data Engineer, Traveloka • Lead, FB DevC Malang • Big Data and JavaScript lover • Father of 3+ years old son • Gamer ◦ Steam Account: hellowin_cavemen ◦ Battle Tag: Hellowin#11826
  2. 3 How we use our data • Business Intelligence •

    Analytics • Personalization • Fraud Detection • Ads optimization • Cross selling • AB Test • etc.
  3. 4 Problems Client • Web • Android • etc. Backend

    Database Big Data Platform ? Data Processing • Analytics • Machine Learning • etc. Overly simplified data architecture on Traveloka Product Side Data Side How to accommodate: • Data Scientists • Data Analysts • Business Intelligence Tools Without disrupting production side? It should be: • Scalable • Query-able • Fault tolerant (reliable)
  4. 7 • A data lake is a storage repository that

    holds a vast amount of raw data in its native format until it is needed. - http://searchaws.techtarget.com • A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed. - Tamara Dull, (SAS), https://www.kdnuggets.com • It store the data in its native/ raw format • The schema applied when on query time • Sometimes it’s also just a “marketing label” to simplified people saying the technology which complied with Hadoop, just like “big data” terms for distributed storing and query engine Data Lake by Definitions
  5. 8 Data Lake implementation on Data Team Side Big Data

    Platform ? Data Processing • Analytics • Machine Learning • etc. Backend Data Source • Stream Processing (Kafka, PubSub, etc.) • DBs • Data Warehouse • etc. Hive (S3) Presto BigQuery input output Hive + Presto • Deployed on Amazon Web Service (AWS) • Self hosted and self managed • Hadoop family Big Query • Deployed on Google Cloud Platform (GCP) • Managed service • GCP family
  6. 9 Pros • More flexible in the context of managing

    (self managed) ◦ Able to define nodes, replication factor, cluster, etc. ◦ Able to specify node specs. • Good integration with other Hadoop ecosystem ◦ Spark ◦ Kafka ◦ Impala • More mature • Open sourced Hive + Presto Pros and Cons Cons • Harder to maintain (also because of self managed)
  7. 10 Pros • Easier to maintain (managed by GCP) •

    Good integration with other GCP managed tools ◦ Dataflow ◦ PubSub ◦ Cloud Storage • Enterprise ready, support is 24/7 Big Query Cons • Less mature compared to Hadoop ecosystem • Limited API yet (not supported Scala API) • Unable to store data on S3, need to be on Cloud Storage • Close sourced
  8. 12 • We use still use AWS and GCP side

    by side • Maintainability is one thing, but in industry its value is everything • Big Data stack is moving so fast • It’s Data Engineer’s responsibility to make the migration agile • There’s no “one thing fits all” solution Conclusions
  9. 13 • How Big Data Platform Handle big Things (https://speakerdeck.com/hellowin/how-big-data-platform-handle-big-things)

    • How to Improve Data Warehouse Efficiency using S3 over HDFS on Hive (https://blog.andi.dirgantara.co/how-to-improve-data-warehouse-efficiency-using-s3-over-hdfs-on-hive-e9da90ea378c) References and Other Presentations