Short introduction to data processing using Apache Spark, with the ubiquitous word count example. Most of the presentation was live coding, which is unfortunately not part of the slides.
better tool to work with Big Data (but also useful for not- so-big data) • Interactive Data Analysis • Iterative Algorithms, Machine Learning, Graph • “Real-Time” Stream Processing Monday, 12 May 14
list, just bigger • RDDs consists of multiple partitions • Partitions are distributed across cluster, cached in memory, spilled to disk, automatically recovered on failure, etc. • Partitions must fit in RAM (but just N-at-a-time) Monday, 12 May 14
val spark = new SparkContext(args(0), "wordcount") val lines = spark.textFile(args(1)) val words = lines .flatMap(line => line split "\\W+") .filter(word => word.nonEmpty) val counts = words .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(args(2)) } } Monday, 12 May 14
the RDD is computed • RDDs that are expensive to re-compute can be cached, persisted, and/or replicated • rdd.persist(MEMORY_AND_DISK_SER_2) • cache is an alias for persist(MEMORY_ONLY) Monday, 12 May 14
val spark = new SparkContext(args(0), "wordcount") val lines = spark.textFile(args(1)) val words = lines .flatMap(line => line.split("\\W+")) .filter(word => word.nonEmpty) val counts = words .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(args(2)) } } public class Map extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word: line.split("\\W+")) { if (!word.isEmpty()) { context.write(new Text(word), new IntWritable(1)); } } } } public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } Monday, 12 May 14
on top of Hadoop HDFS • Spark code base is much smaller (and written in Scala) • Many active contributors • Much easier to program • Hadoop Map/Reduce should be considered legacy • ... this is both good and bad! Monday, 12 May 14
an easy-to-use, semantically clean API • Operationally the complexity is still there • Requires tuning and hand-holding! • Spark relies on reflection based serialization (Java / Kryo) Monday, 12 May 14