crowd workers, and everyday consumers, sales, marke@ng… they all benefit from insights derived from data. Democratization of Data Developers and DBAs are no longer the only ones genera@ng, processing and analyzing data.
Problem: Matrices become large > Mutable state leads to concise algorithms but complicates parallelism and fault tolerance Matrix userItem = new Matrix(); Matrix coOcc = new Matrix(); > Cannot lose state aTer failure > Need to manage state to support data-‐parallelism
state but with performance and fault tolerance of distributed dataflow systems > Programming distributed dataflow graphs requires learning new programming models Imperative Big Data Processing
state to allow data and pipeline parallelism 16 Task Elements (TEs) process data State Elements (SEs) represent state Dataflows represent data > Task Elements have local access to State Elements
program > SEEP: data-‐parallel processing placorm • Transla@on occurs in two stages: – Sta-c code analysis: From Java to SDG – Bytecode rewri-ng: From SDG to SEEP [SIGMOD’13] Program.java
Live variable analysis TE and SE access code assembly SEEP runnable SOOT Framework Javassist > Extract state and state access paderns through sta@c code analysis > Genera@on of runnable code using TE and SE connec@ons Translation Process Janino
Live variable analysis TE and SE access code assembly SEEP runnable Javassist > Extract state and state access paderns through sta@c code analysis > Genera@on of runnable code using TE and SE connec@ons Translation Process Annotated Program.java SOOT Framework Janino
Matrix coOcc = new SeepMatrix(); void addRa@ng(int user, int item, int ra@ng) { userItem.setElement(user, item, ra@ng); updateCoOccurrence(@Global coOcc, userItem); } Partial State and Global Annotations > @Global annotates variable to indicate access to all par@al instances > @ParNal field annotaNon indicates par-al state
> Node failures may lead to state loss CheckpoinNng State • No updates allowed while state is being checkpointed • Checkpoin@ng state should not impact data processing path > Task elements access local in-‐memory state Physical nodes Challenges of Making SDGs Fault Tolerant
• Backups large and cannot be stored in memory • Large writes to disk through network have high cost State Backup > Node failures may lead to state loss CheckpoinNng State • No updates allowed while state is being checkpointed • Checkpoin@ng state should not impact data processing path > Task elements access local in-‐memory state Physical nodes Challenges of Making SDGs Fault Tolerant
2:1 5:1 100 1000 Throughput (1000 requests/s) Latency (ms) Workload (state read/write ratio) Throughput Latency Combines batch and online processing to serve fresh results over large mutable state Processing with Large Mutable State > addRa@ng and getRec func@ons from recommender algorithm, while changing read/write ra@o
dataflow frameworks SDG: Stateful Dataflow Graphs – Abstrac@ons for distributed mutable state – AnnotaNons to disambiguate types of distributed state and state access – Checkpoint-‐based fault tolerance mechanism 51 Summary
dataflow frameworks SDG: Stateful Dataflow Graphs – Abstrac@ons for distributed mutable state – AnnotaNons to disambiguate types of distributed state and state access – Checkpoint-‐based fault tolerance mechanism 52 Summary Thank you! Any QuesNons? @raulcfernandez [email protected] h)ps://github.com/raulcf/SEEPng/