Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beating State-of-the-art By -10000% @ CIDR Gong...

Reynold Xin
January 07, 2013

Beating State-of-the-art By -10000% @ CIDR Gong Show

I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.

Reynold Xin

January 07, 2013
Tweet

More Decks by Reynold Xin

Other Decks in Research

Transcript

  1. Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with

    help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
  2. Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,

    UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
  3. “The bar for open source software is at historical low.”

    i.e. “This is the right time to do grad school.”
  4. Shark How to do SQL query processing efficiently in “MapReduce”

    style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
  5. “You need to beat Hadoop by at least 100X to

    publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
  6. Query 1 Query 2 Log Regress 0 20 40 60

    80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
  7. I spent a day pair-programming with Joey Gonzalez and improved

    performance by 10X. Not bad for a day of work!
  8. I spent a day pair-programming with Joey Gonzalez and improved

    performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
  9. A lot of open questions for fault- tolerant, distributed graph

    computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?