Transforming Big Data with Spark and Shark @ Amazon reInvent

UC BERKELEY

It’s All Happening On-line Every: Click Ad impression Billing event
Fast Forward, pause,… Friend Request Transaction Network message Fault … User Generated (Web, Social & Mobile) ….. Internet of Things / M2M Scientific Computing

3 Petabytes+ Volume Unstructured Variety Real-Time Velocity Our view: More
data should mean better answers •  Must balance Cost, Time, and Answer Quality

Algorithms: Machine Learning and Analytics Machines: Cloud Computing People: CrowdSourcing
& Human Computation 5 Massive and Diverse Data UC BERKELEY

throughout the entire analytics lifecycle 6

7 Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy) Ken
Goldberg (Crowdsourcing) Randy Katz (Systems) *Michael Franklin (Databases) Dave Patterson (Systems) Armando Fox (Systems) *Ion Stoica (Systems) *Mike Jordan (Machine Learning) Scott Shenker (Networking) Organized for Collaboration:"

9 > 450,000 downloads

10 •  UCSF cancer researchers + UCSC cancer genetic database
+ AMP Lab + Intel Cluster" @TCGA: 5 PB = 20 cancers x 1000 genomes" •  Sequencing costs (150X) Big Data David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 12/5/2011 $0.1 $1.0 $10.0 $100.0 $1,000.0 $10,000.0 $100,000.0 2001 - 2014 $K per genome •  See Dave Patterson’s Talk: Thursday 3-4, BDT205

MLBase (Declarative Machine Learning) BlinkDB (approx QP) 11 HDFS Shark
(SQL) + Streaming AMPLab (released) 3rd party AMPLab (in progress) Streaming Hadoop MR MPI Graphlab etc. Spark Shared RDDs (distributed memory) Mesos (cluster resource manager)

Lightning-Fast Cluster Computing

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs
= messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)

messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… FilteredRDD func
= _.contains(...) MappedRDD func = _.split(…)

+ – + + + + + + + +
– – – – – – – – + target – random initial line

map readPoint cache map p => (1 / (1 +
exp(-p.y*(w dot p.x))) - 1) * p.y * p.x reduce _ + _ Initial parameter vector Repeated MapReduce steps to do gradient descent Load data in memory once

0 10 20 30 40 50 60 1 10 20
30 Running Time (min) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s

JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String
s) { return s.contains(“error”); } }).count(); lines = sc.textFile(...) lines.filter(lambda x: x.contains('error')) \ .count() Java API (out now) PySpark (coming soon)

0.5 20 0 5 10 15 20 Spark Hive Time
(hours)

Meta store HDFS Client Driver SQL Parser Query Optimizer Physical
Plan Execution CLI JDBC MapReduce

Meta store HDFS Client Driver SQL Parser Physical Plan Execution
CLI JDBC Spark Cache Mgr. Query Optimizer

1 Column Storage 2 3 john mike sally 4.1 3.5
6.4 Row Storage 1 john 4.1 2 mike 3.5 3 sally 6.4

1.1 0 10 20 30 40 50 60 70 80
90 100 Selection Shark Shark (disk) Hive 100 m2.4xlarge nodes 2.1 TB benchmark (Pavlo et al)

100 m2.4xlarge nodes 2.1 TB benchmark (Pavlo et al) 32
0 100 200 300 400 500 600 Group By Shark Shark (disk) Hive

100 m2.4xlarge nodes 2.1 TB benchmark (Pavlo et al) 105
0 300 600 900 1200 1500 1800 Join Shark (copartitioned) Shark Shark (disk) Hive

0.8 0 10 20 30 40 50 60 70 Query
1 Shark Shark (disk) Hive 0.7 0 10 20 30 40 50 60 70 Query 2 1.0 0 10 20 30 40 50 60 70 80 90 100 Query 3 100 m2.4xlarge nodes, 1.7 TB Conviva dataset

spark-project.org amplab.cs.berkeley.edu UC BERKELEY

We are sincerely eager to hear your feedback on this
presentation and on re:Invent. Please fill out an evaluation form when you have a chance.

Transforming Big Data with Spark and Shark @ Am...

Transforming Big Data with Spark and Shark @ Amazon reInvent

Reynold Xin

More Decks by Reynold Xin

Featured

Transcript

UC BERKELEY

It’s All Happening On-line Every: Click Ad impression Billing event

3 Petabytes+ Volume Unstructured Variety Real-Time Velocity Our view: More

4

Algorithms: Machine Learning and Analytics Machines: Cloud Computing People: CrowdSourcing

throughout the entire analytics lifecycle 6

7 Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy) Ken

8

9 > 450,000 downloads

10 •  UCSF cancer researchers + UCSC cancer genetic database

MLBase (Declarative Machine Learning) BlinkDB (approx QP) 11 HDFS Shark

12

13

Lightning-Fast Cluster Computing

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs

messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… FilteredRDD func

+ – + + + + + + + +

map readPoint cache map p => (1 / (1 +

0 10 20 30 40 50 60 1 10 20

JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String

0.5 20 0 5 10 15 20 Spark Hive Time

Meta store HDFS Client Driver SQL Parser Query Optimizer Physical

Meta store HDFS Client Driver SQL Parser Physical Plan Execution

1 Column Storage 2 3 john mike sally 4.1 3.5

1.1 0 10 20 30 40 50 60 70 80

100 m2.4xlarge nodes 2.1 TB benchmark (Pavlo et al) 32

100 m2.4xlarge nodes 2.1 TB benchmark (Pavlo et al) 105

0.8 0 10 20 30 40 50 60 70 Query

spark-project.org amplab.cs.berkeley.edu UC BERKELEY

We are sincerely eager to hear your feedback on this