Interface Design for Spark Community

Interfaces3 Reynold Xin Aug 22, 2014 @ Databricks Retreat Repurposed
Jan 27, 2015 for Spark community

Spark’s two improvements over Hadoop MR • Performance: “100X” faster
than Hadoop MR • Programming model: easier to use

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text,
IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Hadoop MR Spark “It has been the easiest learning experience that I went through” - Burak coerced by Reynold

• Undergrad CS education cares more about implementation of functionality
• PhD research cares more about prototyping and validating ideas • Neither requires thinking hard about interface design

– Damian Conway on “Ten Essential Development Practices" “The most
important aspect of any module is not how it implements the facilities it provides, but the way in which it provides those facilities in the ﬁrst place.”

Example of Interfaces • public programming APIs (e.g. RDD) •
external modules we expose (matplotlib) • default imports in notebooks • internal module methods (e.g. tree store) • command line arguments • conﬁguration options

Why is interface design important? • If you write code,
you are already doing design • Interfaces can be our biggest asset • or biggest liabilities!

Public Interfaces as Assets • Great public interfaces capture emotions
and in turn capture customers • Customers invest heavily in (public) interfaces • Cost of switching interfaces is HIGH: rewriting & retraining • Network effect: each “customer” brings value to another by writing apps and talking about it

Internal Interfaces as Assets • Great internal interfaces capture emotions
and in turn capture developers • Developers reinforce our leadership • Well designed internal interfaces enable us to move faster • e.g. compression codec vs connection manager

Interfaces as Liabilities • Bad public interfaces increase support burden
• groupByKey anyone? • Bad internal interfaces increase cost of maintenance and innovation

Good Interfaces • Easy to learn & use • Sufﬁciently
powerful • Anticipating an inability to know future needs • Backward compatible

–Andy Konwinski “Other than hiring Reza and buying him drinks,
how do I get better at it?”

Process 1. Identify modules: separation of concerns 2. For each
module: don’t sweat implementation details but take time to identify interfaces, minimize them, and think how they evolve 3. Design, prototype & program using the interfaces 4. Write out a short design doc and ask for feedback 5. Implement the interface, and re-iterate

Guidelines

Keep it simple, stupid (KISS) • Easier to learn /
use • Easier to document • Easier to implement (less bugs) • Easier to optimize narrow interfaces • Easier to throw out / re-implement • Easier to support long term

Ways to Simplify Design

Ways to Simply Design Remove: Get rid of anything that
isn’t essential to the application. This could mean content, too; like the language you use in the navigation labels. Organize: Arrange the elements of the interface so that they fit into logical chunks. This might mean based on a person’s mental model (how they think), or tie in to a more familiar interface pattern. Hide: Place the most important elements within reach (make them obvious), and hide the others, making them accessible through navigation. Displace: Pushing some of the functionality to another device, or feature, so that the one interface isn’t responsible for displaying every possible interaction.

Name Matters • Class, variable, method names should be self-
explanatory • Avoid cryptic names (e.g. operator overloading) • Be consistent

Bad Examples in Spark ExecutorLauncher ExecutorRunner DriverRunner DriverWrapper Client Client
(another one) Client Base AppClient

ExecutorLauncher yarn-client ExecutorRunner standalone DriverRunner standalone DriverWrapper standalone Client standalone
Client (another one) yarn Client Base yarn AppClient standalone Bad Examples in Spark

Documentation Matters

Documentation Matters + Explicit typing for public interfaces also part
of the doc

Minimize Accessibility • Make classes and members as private as
possible, even for internal modules • This maximizes information hiding • Enables modules to be used, understood, built, tested, and debugged independently • A bad habit of many Scala developers to leave everything wide open

Principle of least astonishment • Use your common sense; interfaces
should not surprise users • e.g. Tachyon format command accidentally deletes ﬁle

Composability • LogisticRegressionWithSGD • LogisticRegressionWithADMM • LogisticRegressionWithLBFGS • LogisticRegressionWithNewton •
LinearRegressionWithSGD • …

Composability • LogisticRegression.ﬁt(data, method=“admm”)

Long-term Maintainability • When in doubt, leave it out •
Every interface added increases complexity • Easier to add than remove in the future • Avoid exposing dependency on 3rd party libraries • e.g. MLlib’s use of Breeze (+) • e.g. Spark’s use of Guava Optional (-) • Don’t let implementation details impact interface design

• KISS • Remove, hide, organize, displace • Name matters
• Documentations matter • Minimize accessibility • Compose interfaces for expressivity • Long-term maintainability • …

Interface Design • Years of effort; impossible to do overnight
• Critical in building out a strong platform • Critical in ensuring the long-term pace of innovation • We scored better than anybody else out there, but still a long way to go

References • Eric S. Raymond, Basics of the Unix Philosophy
http://www.faqs.org/docs/artu/ch01s06.html • Joshua Bloch, How to Design a Good API and Why it Matters http://lcsd05.cs.tamu.edu/slides/ keynote.pdf • Richard Gabriel, The Rise of ``Worse is Better’’ http://www.jwz.org/doc/worse-is-better.html (I don’t actually agree with the article)

Interface Design for Spark Community

Interface Design for Spark Community

Reynold Xin

More Decks by Reynold Xin

Other Decks in Programming

Featured

Transcript

Interfaces3 Reynold Xin Aug 22, 2014 @ Databricks Retreat Repurposed

Spark’s two improvements over Hadoop MR • Performance: “100X” faster

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text,

• Undergrad CS education cares more about implementation of functionality

– Damian Conway on “Ten Essential Development Practices" “The most

Example of Interfaces • public programming APIs (e.g. RDD) •

Why is interface design important? • If you write code,

Public Interfaces as Assets • Great public interfaces capture emotions

Internal Interfaces as Assets • Great internal interfaces capture emotions

Interfaces as Liabilities • Bad public interfaces increase support burden

Good Interfaces • Easy to learn & use • Sufﬁciently

–Andy Konwinski “Other than hiring Reza and buying him drinks,

Process 1. Identify modules: separation of concerns 2. For each

Guidelines

Keep it simple, stupid (KISS) • Easier to learn /

Ways to Simplify Design

Ways to Simply Design Remove: Get rid of anything that

Name Matters • Class, variable, method names should be self-

Bad Examples in Spark ExecutorLauncher ExecutorRunner DriverRunner DriverWrapper Client Client

ExecutorLauncher yarn-client ExecutorRunner standalone DriverRunner standalone DriverWrapper standalone Client standalone

Documentation Matters

Documentation Matters + Explicit typing for public interfaces also part

Minimize Accessibility • Make classes and members as private as

Principle of least astonishment • Use your common sense; interfaces

Composability • LogisticRegressionWithSGD • LogisticRegressionWithADMM • LogisticRegressionWithLBFGS • LogisticRegressionWithNewton •

Composability • LogisticRegression.ﬁt(data, method=“admm”)

Long-term Maintainability • When in doubt, leave it out •

• KISS • Remove, hide, organize, displace • Name matters

Interface Design • Years of effort; impossible to do overnight

References • Eric S. Raymond, Basics of the Unix Philosophy