Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Highly Available Transactions: Virtues and Limi...

pbailis
September 04, 2014

Highly Available Transactions: Virtues and Limitations

pbailis

September 04, 2014
Tweet

More Decks by pbailis

Other Decks in Technology

Transcript

  1. Highly Available Transactions Peter Bailis Aaron Davidson Alan Fekete Ali

    Ghodsi Joe Hellerstein Ion Stoica UC Berkeley & University of Sydney Virtues and Limitations VLDB 2015, Hangzhou, China, 4 Sept. 2014
  2. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  3. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  4. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  5. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  6. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  7. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  8. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  9. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  10. “Network partitions should be rare but net gear continues to

    cause more issues than it should.” --James Hamilton, Amazon Web Services [perspectives.mvdirona.com, 2010] NETWORK PARTITIONS
  11. MSFT LAN: avg. 40.8 failures/day (95th %ile: 136) 5 min

    median time to repair (up to 1 week) [SIGCOMM 2011] UC WAN: avg. 16.2–302.0 failures/link/year avg. downtime of 24–497 minutes/link/year [SIGCOMM 2011] HP LAN: 67.1% of support tickets are due to network median incident duration 114-188 min [HP Labs 2012] “Network partitions should be rare but net gear continues to cause more issues than it should.” --James Hamilton, Amazon Web Services [perspectives.mvdirona.com, 2010] NETWORK PARTITIONS
  12. “THE NETWORK IS RELIABLE” tops Peter Deutsch’s classic list of

    “Eight fallacies of distributed computing,” all [of which] “prove to be false in the long run and all [of which] cause big trouble and painful learning experiences” (https://blogs.oracle. com/jag/resource/Fallacies.html). Accounting for and understanding the implications of network behavior is key to designing robust distributed programs— in fact, six of Deutsch’s “fallacies” directly pertain to limitations on networked communications. This should be unsurprising: the ability (and often requirement) to communicate over a shared channel possibility and impossibility of perform- ing distributed computations under particular sets of network conditions. For example, the celebrated FLP impossibility result9 demonstrates the inability to guarantee consensus in an asynchronous network (that is, one facing indefinite communication partitions between processes) with one faulty process. This means that, in the presence of unreliable (untimely) mes- sage delivery, basic operations such as modifying the set of machines in a cluster (that is, maintaining group membership, as systems such as Zoo- keeper are tasked with today) are not guaranteed to complete in the event of both network asynchrony and indi- vidual server failures. Related results describe the inability to guarantee the progress of serializable transactions,7 linearizable reads/writes,11 and a variety of useful, programmer-friendly guar- antees under adverse conditions.3 The implications of these results are not simply academic: these impossibility results have motivated a proliferation of systems and designs offering a range of alternative guarantees in the event of network failures.5 However, under a friendlier, more reliable network that guarantees timely message delivery, FLP and many of these related results no longer hold:8 by making stronger guarantees about network behavior, we can circumvent the programmabil- ity implications of these impossibility proofs. Therefore, the degree of reliability in deployment environments is critical in robust systems design and directly determines the kinds of operations that systems can reliably perform with- out waiting. Unfortunately, the degree to which networks are actually reliable in the real world is the subject of con- siderable and evolving debate. Some have claimed that networks are reliable (or that partitions are rare enough in practice) and that we are too concerned with designing for theoretical failure The Network Is Reliable DOI:10.1145/2643130 Article development led by queue.acm.org An informal survey of real-world communications failures. BY PETER BAILIS AND KYLE KINGSBURY CACM, September 2014 issue
  13. High Availability System guarantees a response, even during network partitions

    (async network) [Gilbert and Lynch, ACM SIGACT News 2002]
  14. High Availability System guarantees a response, even during network partitions

    (async network) [Gilbert and Lynch, ACM SIGACT News 2002] [“PACELC,” Abadi, IEEE Computer 2012] Corollary: low latency, especially over WAN
  15. LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x

    average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY
  16. LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x

    average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY
  17. LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x

    average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY
  18. LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x

    average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY
  19. LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x

    average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY
  20. “BQ”!jt!gvoebnfoubmmz!bcpvu! Availability Low Latency High Throughput Aggressive Scale-out cf. “Coordination

    Avoidance in Database Systems” to appear in VLDB 2015 avoiding coordination
  21. ep!opu!tvqqpsu! tfsjbmj{bcjmjuz HANA Actian Ingres YES Aerospike NO N Persistit

    NO N Clustrix NO N Greenplum YES IBM DB2 YES IBM Informix YES MySQL YES MemSQL NO N MS SQL Server YES NuoDB NO N Oracle 11G NO N Oracle BDB YES Oracle BDB JE YES Postgres 9.2.2 YES SAP Hana NO N ScaleDB NO N VoltDB YES 8/18 databases surveyed did not 15/18 used weak models by default Serializability supported?
  22. serializability snapshot isolation read committed repeatable read cursor stability read

    uncommitted monotonic view update serializability HA? HA? HA? HA? HA? HA? HA?
  23. serializability snapshot isolation read committed repeatable read cursor stability read

    uncommitted monotonic view update serializability HA? HA? HA? HA? HA? HA? HA? Highly Available Transactions
  24. serializability snapshot isolation read committed repeatable read cursor stability read

    uncommitted monotonic view update serializability HA? HA? HA? HA? HA? HA? HA? HATs
  25. Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents

    write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available
  26. Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents

    write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available
  27. Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents

    write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available
  28. Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents

    write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available
  29. Read Committed (RC) Transactions buffer writes until commit time Replicas

    never serve dirty or non-final writes ANSI Repeatable Read (RR)
  30. Read Committed (RC) Transactions buffer writes until commit time Replicas

    never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions read from a snapshot of DB
  31. Read Committed (RC) Transactions buffer writes until commit time Replicas

    never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions buffer reads from replicas Transactions read from a snapshot of DB
  32. Read Committed (RC) Transactions buffer writes until commit time Replicas

    never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions buffer reads from replicas Transactions read from a snapshot of DB
  33. Read Committed (RC) Transactions buffer writes until commit time Replicas

    never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions buffer reads from replicas Transactions read from a snapshot of DB Unavailable implementations ⇏ unavailable semantics
  34. Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents

    write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available
  35. Snapshot reads of database state (database does not change) including

    predicate-based reads + ANSI Repeatable Read Read Atomic Isolation (+TA) Read your writes Time doesn’t go backwards Writes follow reads + Causal Consistency Observe all or none of another txn’s updates
  36. https://github.com/pbailis/hat-vldb2014-code Experimental Validation Thrift - based sharded key - value

    store with LevelDB for persistence Focus on “CP” vs. HAT overheads cluster A cluster B
  37. 2 clusters in us-east 5 servers/cluster transactions of length 8

    50% reads, 50% writes 0 200 400 600 800 1000 0 20 40 60 80 100 120 Avg. Latency (ms) Eventual RC TA Master
  38. 0 200 400 600 800 1000 0 20 40 60

    80 100 120 Avg. Latency (ms) Eventual RC TA Master 2 clusters in us-east 5 servers/cluster transactions of length 8 50% reads, 50% writes
  39. 0 200 400 600 800 1000 0 20 40 60

    80 100 120 Avg. Latency (ms) Eventual RC TA Master Mastered 2x latency of HATs 2 clusters in us-east 5 servers/cluster transactions of length 8 50% reads, 50% writes
  40. 0 200 400 600 800 1000 0 20 40 60

    80 100 120 Avg. Latency (ms) Eventual RC TA Master Mastered 2x latency of HATs 2 clusters in us-east 5 servers/cluster transactions of length 8 50% reads, 50% writes 128K ops/s
  41. 0 200 400 600 800 1000 Avg. Latency (ms) Eventual

    RC TA Master clusters in us-east, us-west 5 servers/DC transactions of length 8 50% reads, 50% writes
  42. 0 200 400 600 800 1000 Avg. Latency (ms) Eventual

    RC TA Master 300ms clusters in us-east, us-west 5 servers/DC transactions of length 8 50% reads, 50% writes
  43. 0 200 400 600 800 1000 Avg. Latency (ms) Eventual

    RC TA Master Mastered 2-70x latency of HATs 300ms clusters in us-east, us-west 5 servers/DC transactions of length 8 50% reads, 50% writes
  44. 0 500 1000 1500 2000 2500 Avg. Latency (ms) Eventual

    RC TA Master CA, VA, OR, Ireland, Singapore 5 servers/DC transactions of length 8 50% reads, 50% writes
  45. 0 500 1000 1500 2000 2500 Avg. Latency (ms) Eventual

    RC TA Master 800ms CA, VA, OR, Ireland, Singapore 5 servers/DC transactions of length 8 50% reads, 50% writes
  46. 0 500 1000 1500 2000 2500 Avg. Latency (ms) Eventual

    RC TA Master 800ms Mastered 8-186x latency of HATs CA, VA, OR, Ireland, Singapore 5 servers/DC transactions of length 8 50% reads, 50% writes
  47. Also in paper In-depth discussion of isolation guarantees Extending “AP”

    to transactional context Sticky availability and sessions Discussion of atomicity and durability More evaluation
  48. This paper: All about coordination + isolation levels (Some surprising

    results!) How else can databases benefit? How do we address whole programs? Our experience: isolation levels are unintuitive!
  49. RAMP Transactions: new isolation model and coordination-free implementation of indexing,

    matviews, multi-put [SIGMOD14] I-confluence: which integrity constraints are enforceable without coordination? OLTPBench suite plus general theory [VLDB 2015] Real-world applications: analysis of open- source applications for coordination requirements; similar results [In preparation] Distributed optimization: Numerical convex programs have close analogues to transaction-processing techniques [In preparation]
  50. PUNCHLINE: Coordination is avoidable surprisingly often Need to understand use

    cases + semantics The use cases are staggering in number and often in plain sight Hint: look to applications, big systems in the wild We have a huge opportunity to improve theory and practice by understanding what’s possible