Science London Dec 10, 2014 Aapo Kyrölä Ph.D., Carnegie Mellon University 2014 (now: Facebook) hBp://www.cs.cmu.edu/~akyrola TwiBer: @kyrpov Big Data – small machine
handle my problem in a reasonable time. 2. I need to solve the problem very fast. Our work expands the space of feasible (graph) problems on one machine: -‐ Our experiments use the same graphs, or bigger, than previous papers on distributed graph computaPon. (+ we can do TwiBer graph on a laptop) -‐ Most data not that “big” anyway. Our work raises the bar on required performance for a “complicated” system.
big problems… 1. Programmer productivity – Global state – Can use “real data” for development 2. Inexpensive to install, administer, less power. 3. Scalability: – 10x machines doing a full job each = 10x throughput
= M[u,v] – Note: always sparse graphs • Intuitive, human-understandable representation – Easy to visualize and explain. • Unifies collaborative filtering (typically matrix based) with recommendation in social networks. – Random walk algorithms. • Local view à vertex-centric computation
directed edges: e = (source, destination) – each edge and vertex associated with a value (user-defined type) – vertex and edge values can be modified • (structure modification also supported) Data Data Data Data Data Data Data Data Data Data 9 GraphChi – Aapo Kyrola A B
Data Data Data Data Data Vertex-centric Programming • “Think like a vertex” • Popularized by the Pregel and GraphLab projects MyFunc(vertex) { // modify neighborhood } Data Data Data Data Data
intervals, each associated with a shard on disk. – sub-graph = interval of vertices GraphChi’s Data Storage shard(1) interval(1) interval(2) interval(P) shard(2) shard(P) 1 n v1 v2 13 Expensive graph parPPoning not required
/ GraphLab Inc) • Includes: – Alternative Least Squares (ALS) – Sparse-ALS – SVD++ – LibFM (factorization machines) – GenSGD – Item-similarity based methods – PMF – CliMF (contributed by Mark Levy) – …. Note: In the C++ -‐version. See Danny’s blog for more informaPon: hBp:// bickson.blogspot.com/ 2012/12/collaboraPve-‐ filtering-‐with-‐graphchi.html
Zhou, D. Wilkinson, R. Schreiber, R. Pan: “Large-‐Scale Parallel CollaboraPve Filtering for the Neolix Prize” (2008) • Task: Predict ratings for items (movies) by users. • Model: – Latent factor model (see next slide)
Wild Strawberries The CelebraPon La Dolce Vita Women on the Verge of a Nervous Breakdown 4 3 2 5 0.4 2.3 -‐1.8 2.9 1.2 -‐3.2 2.8 0.9 0.2 4.1 8.7 2.9 0.04 2.1 3.141 2.3 2.5 3.9 0.02 0.04 User’s raPng of a movie modeled as a dot-‐product: <factor(user), factor(movie)>!
time (user or movie) • For each user: – Estimate latent(user): minimize least squares of dot-product predicted ratings • GraphChi executes the update function for each vertex (in parallel), and loads edges (ratings) from disk – Latent factors in memory: need O(V) memory. – If factors don’t fit in memory, can replicate to edges. and thus store on disk Scales to very large problems!
(8 cores) GraphChi (Mac Mini) 0 2 4 6 8 10 12 Minutes Ne1lix (99M edges), D=20 Remark: Neolix is not a big problem, but GraphChi will scale at most linearly with input size (ALS is CPU bounded, so should be sub-‐linear in #raPngs).
Jaccard] for each movie-pair that has at least one viewer in common. – Similarity(X, Y) ~ # common viewers • Problem: enumerating all pairs takes too much time.
La Dolce Vita Women on the Verge of a Nervous Breakdown 3 SoluPon: Enumerate all triangles of the graph. New problem: how to enumerate triangles if the graph does not fit in RAM?
of the verPces; • Load the list of neighbors of pivots into RAM • Use GraphChi to load all verPces from disk, one by one, and compare their neighbors to neighboring pivots’ neighbor list • Repeat with a new set of pivots. Triangle Enumeration in GraphChi
most important (non-friend) persons for a person: – Example: Pick top 10 nodes visited by 10,000-step random walk (with restart). • Used by Twitter as first step in their “Who to Follow” –algorithm (Gupta et al., WWW’13)
network visualizaPon, by Akshay Java, 2009 Distributed graph systems: -‐ Each hop across parPPon boundary is costly. Disk-‐based “single-‐ machine” graph systems: -‐ “Paging” from disk is costly. DrunkardMob -‐ RecSys '13
‘13) – Reverse thinking: simulate m/billions of short walks in parallel. – Handle one vertex a time (instead of one walk a time). Note: Need to store only current posiPon of each walk in memory (4B/walk) ! DrunkardMob -‐ RecSys '13
3000 5000 Number of walks Seconds DrunkardMob in−memory walks (Cassovary) (a) Comparison to in-memory walks 0e+00 0 2000 4000 6000 Gra Seconds • • • • • • • • • • • • • • • • (b) Runn CompePPve with in-‐memory walks. However, if you can fit your graph in memory – no need for DrunkardMob. DrunkardMob -‐ RecSys '13
(similar to RocksDB, LevelDB) • Fast in- and out-edge queries using sparse and compressed indices – Storage model optimized for large graphs. • Columnar data storage for fast analytical computation and schema changes Read more from my thesis / arxiv.
759.8 5.9 0 50 100 150 200 GraphChi-‐DB Neo4j MySQL milliseconds 50-‐percenPle 1264 1631 4776 0 1000 2000 3000 4000 5000 6000 GraphChi-‐DB GraphChi-‐DB + Pagerank MySQL milliseconds 99-‐percenPle Latency percenGles over 100K random queries Graph: 1.5B edges GraphChi-‐DB is the most scalable DB with large power-‐law graphs
Easier to work with, better economics. • GraphChi and Parallel Sliding Window –algorithm allow processing graphs in big chunks from disk • GraphChi’s collaborative filtering toolkit for matrix- and graph-oriented recommendation algorithms – Scales to big problems, high efficiency by storing critical data in memory. • GraphChi-DB adds online database features: – Graph database that can do analytical computation.
http://github.com/graphchi-java • http://github.com/graphchiDB-scala Thank you! [email protected] twiBer: @kyrpov See also GraphLab Create by graphlab.com!
in-‐edges A: out-‐edges A B B: in-‐edges B: out-‐edges x Moral: You can either access in-‐ or out-‐edges sequenPally, but not both! Random write! Random read! Processing sequenPally
P For v in interval(p) updateFuncPon(v) For T iteraPons: For v=1 to V updateFuncPon(v) “Asynchronous”: updates immediately visible (vs. bulk-‐synchronous).