Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Transforming Big Data with Spark and Shark @ Am...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for Reynold Xin Reynold Xin
November 20, 2012
100

Transforming Big Data with Spark and Shark @ Amazon reInvent

Talk by Michael Franklin and Matei Zaharia

Avatar for Reynold Xin

Reynold Xin

November 20, 2012
Tweet

Transcript

  1. It’s All Happening On-line Every: Click Ad impression Billing event

    Fast Forward, pause,… Friend Request Transaction Network message Fault … User Generated (Web, Social & Mobile) ….. Internet of Things / M2M Scientific Computing
  2. 3 Petabytes+ Volume Unstructured Variety Real-Time Velocity Our view: More

    data should mean better answers •  Must balance Cost, Time, and Answer Quality
  3. 4

  4. Algorithms: Machine Learning and Analytics Machines: Cloud Computing People: CrowdSourcing

    & Human Computation 5 Massive and Diverse Data UC  BERKELEY  
  5. 7 Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy) Ken

    Goldberg (Crowdsourcing) Randy Katz (Systems) *Michael Franklin (Databases) Dave Patterson (Systems) Armando Fox (Systems) *Ion Stoica (Systems) *Mike Jordan (Machine Learning) Scott Shenker (Networking) Organized for Collaboration:"
  6. 8

  7. 10 •  UCSF cancer researchers + UCSC cancer genetic database

    + AMP Lab + Intel Cluster" @TCGA: 5 PB = 20 cancers x 1000 genomes" •  Sequencing costs (150X) Big Data David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 12/5/2011 $0.1 $1.0 $10.0 $100.0 $1,000.0 $10,000.0 $100,000.0 2001 - 2014 $K per genome •  See Dave Patterson’s Talk: Thursday 3-4, BDT205
  8. MLBase (Declarative Machine Learning) BlinkDB (approx QP) 11 HDFS Shark

    (SQL) + Streaming AMPLab (released) 3rd party AMPLab (in progress) Streaming Hadoop MR MPI Graphlab etc. Spark Shared RDDs (distributed memory) Mesos (cluster resource manager)
  9. 12

  10. 13

  11. lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs

    = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)
  12. + – + + + + + + + +

    – – – – – – – – + target – random initial line
  13. map readPoint cache map p => (1 / (1 +

    exp(-p.y*(w dot p.x))) - 1) * p.y * p.x reduce _ + _ Initial parameter vector Repeated MapReduce steps to do gradient descent Load data in memory once
  14. 0 10 20 30 40 50 60 1 10 20

    30 Running Time (min) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s
  15. JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String

    s) { return s.contains(“error”); } }).count(); lines = sc.textFile(...) lines.filter(lambda x: x.contains('error')) \ .count() Java API (out now) PySpark (coming soon)
  16. Meta store HDFS Client Driver SQL Parser Physical Plan Execution

    CLI JDBC Spark Cache Mgr. Query Optimizer
  17. 1 Column Storage 2 3 john mike sally 4.1 3.5

    6.4 Row Storage 1 john 4.1 2 mike 3.5 3 sally 6.4
  18. 1.1 0 10 20 30 40 50 60 70 80

    90 100 Selection Shark Shark (disk) Hive 100 m2.4xlarge nodes 2.1 TB benchmark (Pavlo et al)
  19. 100 m2.4xlarge nodes 2.1 TB benchmark (Pavlo et al) 32

    0 100 200 300 400 500 600 Group By Shark Shark (disk) Hive
  20. 100 m2.4xlarge nodes 2.1 TB benchmark (Pavlo et al) 105

    0 300 600 900 1200 1500 1800 Join Shark (copartitioned) Shark Shark (disk) Hive
  21. 0.8 0 10 20 30 40 50 60 70 Query

    1 Shark Shark (disk) Hive 0.7 0 10 20 30 40 50 60 70 Query 2 1.0 0 10 20 30 40 50 60 70 80 90 100 Query 3 100 m2.4xlarge nodes, 1.7 TB Conviva dataset
  22. We are sincerely eager to hear your feedback on this

    presentation and on re:Invent. Please fill out an evaluation form when you have a chance.