Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GraphX: Graph Parallelism Made Simple @ AMPLab ...

Reynold Xin
May 20, 2013
1.3k

GraphX: Graph Parallelism Made Simple @ AMPLab Retreat

GraphX is a new graph processing system on Spark. GraphX enables algorithms to blend graph and tabular views of graph data and can easily express graph parallel abstractions like GraphLab and Pregel. GraphX borrows many of the optimizations from the GraphLab platform to enable efficient distributed graph computation. This is work-in-progress.

Reynold Xin

May 20, 2013
Tweet

Transcript

  1. Graphs are Essential to Data Mining and Machine Learning Identify

    influential people and information Find communities Target ads and products Model complex data dependencies
  2. B C D E F A Specialized Graph Systems 1. 

    APIs to capture complex graph dependencies
  3. Specialized Graph Systems 1.5 423 0 100 200 300 400

    500 GraphLab Hadoop Runtime (in minutes, counting 34.8 billion triangles)
  4. How can data-parallel engines support graph computations efficiently? Spark Shark

    SQL HDFS / Hadoop Storage Mesos Resource Manager Spark Streaming GraphX MLBase
  5. How can data-parallel engines support graph computations efficiently? 1.  The

    right interface for expressing graph computations 2.  Efficient implementation of that interface
  6. Remainder of the Talk 1.  Resilient Distributed Graphs (RDGs) 2. 

    Phase One Implementation 3.  Phase Two Implementation (Future Work)
  7. Resilient Distributed Graphs An extension of Spark RDDs » Immutable, partitioned

    set of vertices and edges » Can be constructed using RDD[Edge] and RDD[Vertex] Tight integration with Spark » Use data-parallel engine (Spark) to do efficient ETL » Consume results of graph computations in Spark
  8. Resilient Distributed Graphs def vertices: RDD[Vertex] def edges: RDD[Edge] def

    edgesWithVertices: RDD[EdgeWithVertices] def mapVertices(mapFunc): Graph def mapEdges(mapFunc): Graph def filterVertices(predicate): Graph def filterEdges(predicate): Graph
  9. Example: Vertex Degree B C D E F A sum:

    5 A: 5 B: 1 C: 1 D: 2 E: 3 F: 2
  10. updateVertices Taking a set of update “messages”, and apply them

    to the vertices using a user-specified function.
  11. Example: updateVertices B C D E F A A: 5

    B: 1 C: 1 D: 2 E: 3 F: 2
  12. Example: updateVertices B C D E F A A: 5

    B: 1 C: 1 D: 2 E: 3 F: 2
  13. Example: updateVertices B C D E F A A: 5

    B: 1 C: 1 D: 2 E: 3 F: 2 5 1 1 2 3 2
  14. Resilient Distributed Graphs RDD-like primitives: » map, reduce, filter… Graph computation

    primitives: » aggregateNeighbors: RDD[(VID, Value)] » updateVertices: Graph Surprising expressive and general: » Implemented GraphLab and Pregel API in 20 lines of code
  15. GraphX Phase 1 Implemented RDG abstraction on Spark » Using existing

    Spark operators (map, reduce, filter, join) » Minimal network communication (eq. GraphLab) Higher level APIs: » Pregel (20 lines of code) » GraphLab/PowerGraph (25) Algorithms: » PageRank (5), Connected components (10), Shortest path (10), Alternating Least Squares (40)
  16. PageRank Performance 22 165 1340 0 200 400 600 800

    1000 1200 1400 1600 GraphLab GraphX Hadoop Runtime (in seconds, PageRank for 10 iterations)
  17. Phase 2: Improved Perf Introduce 2 new primitives in Spark

    » “Mutable” RDD, i.e. update-in-place for iterative computations » Pre-built hash-indexes for record lookups
  18. GraphX 1.  Graph-parallel primitives implementable in data-parallel engines (Spark). 2. 

    Currently slower than GraphLab, but » No need for specialized systems » Easier ETL, and easier consumption of output » Interactive graph data mining 3.  Phase 2 (with small additions to Spark), will bring performance closer to specialized engines.
  19. Resilient Distributed Graphs 1.  Graph-parallel primitives implementable in data-parallel engines

    (Spark) MPP databases. 2.  Phase 1 » map, reduce, filter, join = select, group by, where, join » Implementable in MPP databases using only UDFs! 3.  Phase 2 » Materialized views » New hash-indexes and access methods for optimized update-in-place
  20. Resilient Distributed Graphs 1.  Expressive graph-parallel primitives » Pregel, GraphLab (20

    lines of code) » PageRank (5 lines of code) 2.  Existing data-parallel engines and MPP databases can support these primitives without any modifications 3.  Can significantly improve performance using a few new operators and access methods (come to the next retreat!)