◇ Support for both batch and stream processing ◇ SQL, MLlib, GraphX ◇ 100x times faster in memory and 10x faster running on disk ◇ Can run on a standalone cluster, Mesos or YARN ◇ HDFS for data storage
instead of disk ◇ Less disk I/O ◇ Lazy evaluation of tasks - optimizes data processing workflow ◇ In-memory caching helpful when multiple operations access the same dataset
and actions ◇ Immutable - Every transformation returns a new RDD ◇ Fault tolerant - recomputed Eg: rdd .map(person -> person.age) .filter(_ > 18) Dataframe ◇ Distributed data organized into named columns ◇ Like a relational db ◇ As of Spark 2.0 Dataframe is nothing but a Dataset[Row] Eg: df .select(“age”) .filter(“age > 18”) Datasets ◇ Available since 1.6 ◇ Has all the benefits of an RDD along with Spark SQL’s optimized execution engine ◇ Has a strongly typed API as well as an untyped one(df) Eg: ds .map(person -> person.age) .filter(_ > 18)
larger executors - less movement of data across the n/w. ◇ The more the number of cores per executor, the more number of tasks that can run in parallel ◇ Having a single core executor doesn’t fully take advantage of running multiple tasks in the same JVM
Have smaller partitions ◇ >2000 partitions ◇ Avoid groupByKeys while working with RDDs - Switch to reduceByKey if aggregating (Map side combine) - Apply filters before grouping than after
Avoid actions that bring all data to the driver for large datasets(collect) ◇ Too many partitions - Every task sends a map status object back to the driver - Has to deal with multiple map output status requests from the reducers
serialization, and network I/O ◇ Avoid shuffles when possible - Broadcast the smaller dataset while joining. Moving all of the data of the smaller table could be more expensive than just putting a copy of the small table on all executors. ◇ More cores per executor - still achieves parallelism and lesser shuffles over the network vs having more executors with lesser cores
to disk - UnsafeExternalSorter: Thread 75 spilling sort data of 141.0 MB to disk (90 times so far) ◇ Ensure that spark.memory.fraction isn’t too low ◇ If you’re running with minimal number of nodes - increase them ◇ By default the spilled data is written to /tmp, you can add more disks by specifying them in spark.local.dirs
75% data compression on an average ◇ Push down filters(1.6 onwards) to reduce disk I/O : Filter pushdowns to access only required columns ◇ Use DirectParquetOutputCommitter - Avoids expensive renames
with aggregate functions wherever possible Never collect/coalesce large data into a single node - Let the executors write out the data in a distributed manner Use parquet file format and kryo serialization Switch to using datasets to take full advantage of Spark’s SQL engine Default partitions may not work all the time - Know when to increase or decrease them Cache data especially in jobs where it is re-used across stages - Avoids re-computation Prefer to have larger executors with more cores - Achieves more parallelism Broadcast variables that are shared to all executors - Also while performing joins with a smaller dataset