really. • Strong Eventual Consistency (SEC) “Replicas that deliver the same updates have equivalent state” • Primary requirement Eventual replica-to-replica communication 8
really. • Strong Eventual Consistency (SEC) “Replicas that deliver the same updates have equivalent state” • Primary requirement Eventual replica-to-replica communication • Order insensitive! (Commutativity) 8
really. • Strong Eventual Consistency (SEC) “Replicas that deliver the same updates have equivalent state” • Primary requirement Eventual replica-to-replica communication • Order insensitive! (Commutativity) • Duplicate insensitive! (Idempotent) 8
• Convergent data structures Primary data abstraction is the CRDT • Enables composition Provides functional composition of CRDTs that preserves the SEC property 25
elements to initial set and update. update(S1, {add, [1,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S1 and S2. map(S1, fun(X) -> X * 2 end, S2).
elements to initial set and update. update(S1, {add, [1,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S1 and S2. map(S1, fun(X) -> X * 2 end, S2).
elements to initial set and update. update(S1, {add, [1,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S1 and S2. map(S1, fun(X) -> X * 2 end, S2).
elements to initial set and update. update(S1, {add, [1,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S1 and S2. map(S1, fun(X) -> X * 2 end, S2).
elements to initial set and update. update(S1, {add, [1,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S1 and S2. map(S1, fun(X) -> X * 2 end, S2).
runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution • Well-matched to Lattice Processing (Lasp) 33
runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution • Well-matched to Lattice Processing (Lasp) • Epidemic broadcast mechanisms provide weak ordering but are resilient and efficient 33
runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution • Well-matched to Lattice Processing (Lasp) • Epidemic broadcast mechanisms provide weak ordering but are resilient and efficient • Lasp’s programming model is tolerant to message re- ordering, disconnections, and node failures 33
runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution • Well-matched to Lattice Processing (Lasp) • Epidemic broadcast mechanisms provide weak ordering but are resilient and efficient • Lasp’s programming model is tolerant to message re- ordering, disconnections, and node failures • “Selective Receive” Nodes selectively receive and process messages based on interest. 33
in a client-server or peer-to-peer mode • Broadcast (via Gossip, Tree, etc.) Efficient dissemination of both program state and application state via gossip, broadcast tree, or hybrid mode 34
in a client-server or peer-to-peer mode • Broadcast (via Gossip, Tree, etc.) Efficient dissemination of both program state and application state via gossip, broadcast tree, or hybrid mode • Auto-discovery Integration with Mesos, auto-discovery of Lasp nodes for ease of configurability 34
are paid according to a minimum number of impressions • Clients will go offline Clients have limited connectivity and the system still needs to make progress while clients are offline 43
Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client 44
Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Rovio Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 1 Client 46 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
Ad ounter 1 Riot Ad ounter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Rovio Ad Counter 1 Ro C Rovio Ad Counter 1 Ro C Rovio Ad Counter 1 Ro C Rovio Ad Counter 1 Ro C Client Side, Sing 47 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
Ads Filter Product move Read Union Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client 48 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
Union Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client 49 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client 50 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client 52 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
cross-node message passing. • Known scalability limitations Analyzed in academic in various publications. • Single connection Head of line blocking. 54
cross-node message passing. • Known scalability limitations Analyzed in academic in various publications. • Single connection Head of line blocking. • Full membership All-to-all failure detection with heartbeats and timeouts. 54
port Similar to Solaris sunrpc style portmap: known port for mapping to dynamic port-based services. • Bridged networking Problematic for cluster in bridged networking with dynamic port allocation. 55
Entertainment. • Runtime configuration Application controlled through runtime environment variables. • Membership Full membership with Distributed Erlang via EPMD. 56
deployment of both EPMD and Lasp application. • Single EPMD instance per slave Controlled through the use of host networking and HOSTNAME: UNIQUE constraints in Mesos. 57
deployment of both EPMD and Lasp application. • Single EPMD instance per slave Controlled through the use of host networking and HOSTNAME: UNIQUE constraints in Mesos. • Lasp Local execution using host networking: connects to local EPMD. 57
deployment of both EPMD and Lasp application. • Single EPMD instance per slave Controlled through the use of host networking and HOSTNAME: UNIQUE constraints in Mesos. • Lasp Local execution using host networking: connects to local EPMD. • Service Discovery Service discovery facilitated through clustering EPMD instances through Sprinter. 57
too much work needed to be adapted for things to work correctly. • Single orchestration task Dispatched events, controlled when to start and stop the evaluation and performed log aggregation. 60
too much work needed to be adapted for things to work correctly. • Single orchestration task Dispatched events, controlled when to start and stop the evaluation and performed log aggregation. • Bottleneck Events immediately dispatched: would require blocking for processing acknowledgment. 60
too much work needed to be adapted for things to work correctly. • Single orchestration task Dispatched events, controlled when to start and stop the evaluation and performed log aggregation. • Bottleneck Events immediately dispatched: would require blocking for processing acknowledgment. • Unrealistic Events do not queue up all at once for processing by the client. 60
of memory. • Weeks spent adding instrumentation Process level, VM level, Erlang Observer instrumentation to identify heavy CPU and memory processes. 61
of memory. • Weeks spent adding instrumentation Process level, VM level, Erlang Observer instrumentation to identify heavy CPU and memory processes. • Dissemination too expensive 1000 threads to a single dissemination process (one Mesos task) leads to backed up message queues and memory leaks. 61
of memory. • Weeks spent adding instrumentation Process level, VM level, Erlang Observer instrumentation to identify heavy CPU and memory processes. • Dissemination too expensive 1000 threads to a single dissemination process (one Mesos task) leads to backed up message queues and memory leaks. • Unrealistic Two different dissemination mechanisms: thread to thread and node to node: one is synthetic. 61
EPMD during execution. • Lost connection EPMD loses connections with nodes for some arbitrary reason. • EPMD task restarted by Mesos Restarted for an unknown reason, which leads Lasp instances to restart in their own container. 62
5 GiB of state within 90 seconds. • Delta dissemination Delta dissemination only provides around a 30% decrease in state transmission. • Unbounded queues Message buffers would lead to VMs crashing because of large memory consumption. 63
service with abstract interface initially on EPMD and later migrate after tested. • Adapt Lasp and Broadcast layer Integrate pluggable membership service throughout the stack and librate existing libraries from distributed Erlang. 65
service with abstract interface initially on EPMD and later migrate after tested. • Adapt Lasp and Broadcast layer Integrate pluggable membership service throughout the stack and librate existing libraries from distributed Erlang. • Build service discovery mechanism Mechanize node discovery outside of EPMD based on new membership service. 65
configuration of protocols used for cluster membership. • Several protocol implementations: • Full membership via EPMD. • Full membership via TCP. • Client-server membership via TCP. 66
configuration of protocols used for cluster membership. • Several protocol implementations: • Full membership via EPMD. • Full membership via TCP. • Client-server membership via TCP. • Peer-to-peer membership via TCP (with HyParView) 66
configuration of protocols used for cluster membership. • Several protocol implementations: • Full membership via EPMD. • Full membership via TCP. • Client-server membership via TCP. • Peer-to-peer membership via TCP (with HyParView) • Visualization Provide a force-directed graph-based visualization engine for cluster debugging in real-time. 66
have full visibility into the entire graph. • Failure detection Performed by peer-to-peer heartbeat messages with a timeout. • Limited scalability Heartbeat interval increases when node count increases leading to false or delayed detection. 67
have full visibility into the entire graph. • Failure detection Performed by peer-to-peer heartbeat messages with a timeout. • Limited scalability Heartbeat interval increases when node count increases leading to false or delayed detection. • Testing Used to create the initial test suite for Partisan. 67
in the system as peers; client has only the server as a peer. • Failure detection Nodes heartbeat with timeout all peers they are aware of. • Limited scalability Single point of failure: server; with limited scalability on visibility. 68
in the system as peers; client has only the server as a peer. • Failure detection Nodes heartbeat with timeout all peers they are aware of. • Limited scalability Single point of failure: server; with limited scalability on visibility. • Testing Used for baseline evaluations as “reference” architecture. 68
(fixed) and passive (log n); passive used for failure replacement with active view. • Failure detection Performed by monitoring active TCP connections to peers with keep-alive enabled. 69
(fixed) and passive (log n); passive used for failure replacement with active view. • Failure detection Performed by monitoring active TCP connections to peers with keep-alive enabled. • Very scalable (10k+ nodes during academic evaluation) However, probabilistic; potentially leads to isolated nodes during churn. 69
to cluster all nodes and ensure connected overlay network: reads information from Marathon. • Node local Operates at each node and is responsible for taking actions to ensure connected graph: required for probabilistic protocols. 70
to cluster all nodes and ensure connected overlay network: reads information from Marathon. • Node local Operates at each node and is responsible for taking actions to ensure connected graph: required for probabilistic protocols. • Membership mode specific Knows, based on the membership mode, how to properly cluster nodes and enforces proper join behaviour. 70
view for analysis. • Elected node (or group) analyses Periodically analyses the information in S3 for the following: • Isolated node detection Identifies isolated nodes and takes corrective measures to repair the overlay. 71
view for analysis. • Elected node (or group) analyses Periodically analyses the information in S3 for the following: • Isolated node detection Identifies isolated nodes and takes corrective measures to repair the overlay. • Verifies symmetric relationship Ensures that if a node knows about another node, the relationship is symmetric: prevents I know you, but you don’t know me. 71
view for analysis. • Elected node (or group) analyses Periodically analyses the information in S3 for the following: • Isolated node detection Identifies isolated nodes and takes corrective measures to repair the overlay. • Verifies symmetric relationship Ensures that if a node knows about another node, the relationship is symmetric: prevents I know you, but you don’t know me. • Periodic alerting Alerts regarding disconnected graphs so external measures can be taken, if necessary. 71
a cluster of node and configure simulations at runtime. • Each simulation: • Different application scenario Uniquely execute a different application scenario at runtime based on runtime configuration. 73
a cluster of node and configure simulations at runtime. • Each simulation: • Different application scenario Uniquely execute a different application scenario at runtime based on runtime configuration. • Result aggregation Aggregate results at end of execution and archive these results. 73
a cluster of node and configure simulations at runtime. • Each simulation: • Different application scenario Uniquely execute a different application scenario at runtime based on runtime configuration. • Result aggregation Aggregate results at end of execution and archive these results. • Plot generation Automatically generate plots for the execution and aggregate the results of multiple executions. 73
a cluster of node and configure simulations at runtime. • Each simulation: • Different application scenario Uniquely execute a different application scenario at runtime based on runtime configuration. • Result aggregation Aggregate results at end of execution and archive these results. • Plot generation Automatically generate plots for the execution and aggregate the results of multiple executions. • Minimal coordination Work must be performed with minimal coordination, as a single orchestrator is a scalability bottleneck for large applications. 73
containing counters that each node manipulates. • Simulates a workflow Nodes use this operation to simulate a lock-stop workflow for the experiment. • Event Generation Event generation toggles a boolean for the node to show completion. 74
containing counters that each node manipulates. • Simulates a workflow Nodes use this operation to simulate a lock-stop workflow for the experiment. • Event Generation Event generation toggles a boolean for the node to show completion. • Log Aggregation Completion triggers log aggregation. 74
containing counters that each node manipulates. • Simulates a workflow Nodes use this operation to simulate a lock-stop workflow for the experiment. • Event Generation Event generation toggles a boolean for the node to show completion. • Log Aggregation Completion triggers log aggregation. • Shutdown Upon log aggregation completion, nodes shutdown. 74
containing counters that each node manipulates. • Simulates a workflow Nodes use this operation to simulate a lock-stop workflow for the experiment. • Event Generation Event generation toggles a boolean for the node to show completion. • Log Aggregation Completion triggers log aggregation. • Shutdown Upon log aggregation completion, nodes shutdown. • External monitoring When events complete execution, nodes automatically begin the next experiment. 74
you exceed a few nodes: message queues, memory, delays. • Partial Views Required: rely on transitive dissemination of information and partial network knowledge. 76
you exceed a few nodes: message queues, memory, delays. • Partial Views Required: rely on transitive dissemination of information and partial network knowledge. • Results Reduced Lasp memory footprint to 75MB; larger in practice for debugging. 76
mechanism: random promotion of isolated nodes; mainly issues of symmetry. • FIFO across connections Not per connection, but protocol assumes across all connections leading to false disconnects. 77
mechanism: random promotion of isolated nodes; mainly issues of symmetry. • FIFO across connections Not per connection, but protocol assumes across all connections leading to false disconnects. • Unrealistic system model You need per message acknowledgements for safety. 77
mechanism: random promotion of isolated nodes; mainly issues of symmetry. • FIFO across connections Not per connection, but protocol assumes across all connections leading to false disconnects. • Unrealistic system model You need per message acknowledgements for safety. • Pluggable protocol helps debugging Being able to switch to full membership or client-server assists in debugging protocol vs. application problems. 77
Connectivity, but transient partitions and isolated nodes at 500 - 1000 nodes (across 140 instances.) • Limited financially and by Amazon Harder to run larger evaluations because we’re limited financially (as a university) and because of Amazon limits. 78
Connectivity, but transient partitions and isolated nodes at 500 - 1000 nodes (across 140 instances.) • Limited financially and by Amazon Harder to run larger evaluations because we’re limited financially (as a university) and because of Amazon limits. • Mean state reduction per client Around 100x improvement from our PaPoC 2016 initial evaluation results. 78
your cluster: all of these things lead to easier debugging. • Control changes No Lasp PR accepted without divergence, state transmission, and overhead graphs. 79
your cluster: all of these things lead to easier debugging. • Control changes No Lasp PR accepted without divergence, state transmission, and overhead graphs. • Automation Developers use graphs when they are easy to make: lower the difficulty for generation and understand how changes alter system behaviour. 79
your cluster: all of these things lead to easier debugging. • Control changes No Lasp PR accepted without divergence, state transmission, and overhead graphs. • Automation Developers use graphs when they are easy to make: lower the difficulty for generation and understand how changes alter system behaviour. • Make work easily testable When you test locally and deploy globally, you need to make things easy to test, deploy and evaluate (for good science, I say!) 79