Upgrade to Pro — share decks privately, control downloads, hide ads and more …

keeping-time-velocity.pdf

kavya
October 20, 2017
230

 keeping-time-velocity.pdf

kavya

October 20, 2017
Tweet

Transcript

  1. distributed key-value store.
 three nodes. assume no failures, all operations

    succeed. userx PUT { key: v } datastore’s timeline
  2. distributed key-value store.
 three nodes. assume no failures, all operations

    succeed. userx PUT { key: v } userx PUT { key: v2 } datastore’s timeline
  3. distributed key-value store.
 three nodes. assume no failures, all operations

    succeed. userx PUT { key: v } userx PUT { key: v2 } usery GET key ? datastore’s timeline
  4. distributed key-value store.
 three nodes. assume no failures, all operations

    succeed. value depends on data store’s consistency model userx PUT { key: v } userx PUT { key: v2 } usery GET key ? datastore’s timeline
  5. consistency model set of guarantees the system makes about what

    events will be visible, and when. set of valid timelines of events These guarantees are informed and enforced by the timekeeping mechanisms used by the system.
  6. computer clocks the system clock, NTP, UNIX time. stepping back

    bridging the gap other timekeeping mechanisms Spanner, Riak
  7. the model multiple nodes for fault tolerance, scalability, performance. logical

    (processes) or physical (machines). are sequential. communicate by message-passing i.e.
 connected by unreliable network, 
 no shared memory. data may be replicated, partitioned a distributed datastore:
  8. computers have clocks… func measureX() { start := time.Now() x()

    end := time.Now() // Time x takes. elapsed := end.Sub(start) } …can we use them?
  9. computers have clocks… func measureX() { start := time.Now() x()

    end := time.Now() // Time x takes. elapsed := end.Sub(start) } …can we use them? hardware clocks drift. NTP is slow etc. the system clock keeps Unix time. ?
  10. Details vary by language, OS, architecture, hardware. …but the details

    don’t matter today. That said, we will be assuming Linux on an x86 processor. a caveat
  11. computer clocks are not hardware clocks, but are “run” by

    hardware, the OS kernel. time.Now() MONOTONIC clock_gettime(CLOCK_REALTIME) sys call to get the value of a particular computer clock The system clock or wall clock. Gives the current UNIX timestamp. hardware clocks drift
  12. set from the hardware clock. (or external source like NTP).

    Real Time Clock (RTC) keeps UTC time at system boot “hey HPET, interrupt me in 10ms” then when interrupted, knows to increment by 10ms. “tickless” kernel:
 the interrupt interval (“tick”) is dynamically calculated. incr using a hardware ticker. subsequently the system clock is a counter kept by hardware, OS kernel.
  13. set from the hardware clock. (or external source like NTP).

    incr using a hardware ticker. these are the hardware clocks that drift. causes system clocks of different computers to change at different rates. at system boot subsequently the system clock is a counter kept by hardware, OS kernel.
  14. NTP is slow etc. synchronizes the system clock to a


    highly accurate clock network: need trusted, reachable NTP servers. NTP is slow, up to hundreds of ms over public internet. stepping results in discontinuous jumps in time. } gradually adjusts clock rate (“skew”) sets a new value (“step”)
 if differential is too large.
  15. The system clock keeps UNIX time increases by exactly 86,

    400 seconds per day. So,1000th day after the epoch = 86400000 etc. …but a UTC day is not a constant 86, 400 seconds! “number of seconds since epoch” midnight UTC, 01.01.1970
  16. interlude: UTC messy compromise between: measured using atomic clocks atomic

    time based on the Earth’s rotation astronomical time very stable; this is what we want to use e.g. the (SI) second matches the Earth’s position; sometimes useful (we’re told) So, UTC: based on atomic time adjusted to be in sync with the Earth’s rotational period.
  17. interlude: UTC messy compromise between: based on the Earth’s rotation

    measured using atomic clocks atomic time astronomical time very stable; this is what we want to use e.g. the (SI) second matches the Earth’s position; sometimes useful (we’re told) but problem…
  18. the Earth’s rotation slows down over time. To compensate for

    this drift, UTC periodically adds a second. So, an astronomical day “takes longer” in absolute (atomic) terms. …so a UTC day may be 86, 400 or 86, 401 seconds! 23:59:59 23:59:60 00:00:00 leap second
  19. Unix time can’t represent the extra second, but
 want the

    computer’s “current time” to be aligned with UTC (in the long run): The system clock keeps UNIX time 23:59:59 23:59:59 00:00:00 repeats ! Unix Unix time is not monotonic. 23:59:59 23:59:60 00:00:00 leap second UTC
  20. not synchronized, monotonic across nodes hardware clocks drift. NTP is

    slow etc. the system clock keeps Unix time. timestampA = 150 A userX PUT { k: v } N1 N2 example: fast
  21. not synchronized, monotonic across nodes hardware clocks drift. NTP is

    slow etc. the system clock keeps Unix time. timestampA = 150 A userX PUT { k: v2 } timestampB = 50 B userX PUT { k: v } N1 N2 example: fast
  22. not synchronized, monotonic across nodes hardware clocks drift. NTP is

    slow etc. the system clock keeps Unix time. timestampA = 150 A userX PUT { k: v2 } timestampB = 50 B userX PUT { k: v } N1 N2 example: ruh roh. fast
  23. prelude timekeeping mechanism used by a system depends on: desired

    consistency model
 what the valid timelines of events are desired availability
 how “responsive” the system is desired performance 
 read and write latency and so, throughput ] costs of higher consistency (CAP theorem, etc.)
  24. spanner • Distributed relational database
 supports distributed transactions • Horizontally

    scalable
 data is partitioned
 • Geo-replicated for fault tolerance • Performant • Externally consistent:
 “a globally consistent ordering of transactions that matches the observed commit order.”
  25. spanner • Distributed relational database
 supports distributed transactions • Horizontally

    scalable
 data is partitioned
 • Geo-replicated for fault tolerance • Performant • Externally consistent:
 “a globally consistent ordering of transactions that matches the observed commit order.” savings N1 checking N2
  26. spanner • Distributed relational database
 supports distributed transactions • Horizontally

    scalable
 data is partitioned
 • Geo-replicated for fault tolerance • Performant • Externally consistent:
 “a globally consistent ordering of transactions that matches the observed commit order.” savings N1 N1 G1 N2 checking N2 G2
  27. spanner • Distributed relational database
 supports distributed transactions • Horizontally

    scalable
 data is partitioned
 • Geo-replicated for fault tolerance • Performant • Externally consistent:
 “a globally consistent ordering of transactions that matches the observed commit order.”
  28. spanner • Distributed relational database
 supports distributed transactions • Horizontally

    scalable
 data is partitioned
 • Geo-replicated for fault tolerance • Performant • Externally consistent:
 “a globally consistent ordering of transactions that matches the observed commit order.”
  29. savings N1 N1 G1 N2 checking N2 G2 minimum total

    balance requirement = 200 total balance = 200 G1 G2 deposit 100 T1 debit 100 T2 US partition EU partition
  30. need desired consistency guarantees, desired performance: reads from replicas, 


    consistent snapshot reads consistent timeline across replicas. to order transactions across the system as well. the order to correspond to the observed commit order. want reads to never contain T2, if they don’t also contain T1. “globally consistent transaction order that corresponds to 
 observed commit order“.
 performant consensus.
  31. if T1 commits before T2 starts to commit, T1 is

    ordered before T2. Can we enforce ordering using commit timestamps? order of transactions == observed order even if T1, T2 across the globe! Yes, if perfectly synchronized clocks. …or, if you can know clock uncertainty perfectly, and account for it. }
  32. TrueTime tracks and exposes the uncertainty about perceived time across

    system clocks. t tt } explicitly represents time as an interval, not a point. TT.now() [earliest, latest] interval that contains “true now”.
 earliest is the earliest time that could be 
 “true now”; latest is the latest.
  33. commit_ts(T1) = TT.now().latest waits for one full uncertainty window i.e.

    until commit_ts < TT.now().earliest then, commits and replies. if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T1 commit ts G1 leader T1
  34. commit_ts(T1) = TT.now().latest waits for one full uncertainty window i.e.

    until commit_ts < TT.now().earliest then, commits and replies. G1 leader if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T1 commit wait T1 commit ts
  35. commit_ts(T1) = TT.now().latest waits for one full uncertainty window i.e.

    until commit_ts < TT.now().earliest then, commits and replies. G1 leader if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T1 commits guarantees commit_ts for next transaction is higher, despite different clocks. ] commit wait T1 commit ts
  36. commit_ts(T2) = TT.now().latest wait for one full uncertainty window i.e.

    until commit_ts < TT.now().earliest then, commit and reply. G2 leader T1 commit ts if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T2 T2 commit ts commit wait commits
  37. TrueTime provides externally consistent transaction commit timestamps, so enables external

    consistency without coordination. …this is neat. The uncertainty window affects commit wait time, and so
 write latency and throughput. Google uses impressive and expensive! infrastructure to keep this small; ~7ms as of 2012. but note
  38. riak • Distributed key-value database:
 // A data item =

    <key: blob>
 {“uuid1234”: {“name”:”ada”}}
 • Highly available:
 data partitioned and replicated,
 decentralized i.e. all replicas serve reads, writes. • Eventually consistent:
 “if no new updates are made to an object, eventually all accesses will return the last updated value.”
  39. if no new updates are made to an object, eventually

    all accesses will return the last updated value. timekeeping want: need: determine causal updates for convergence to latest.
  40. three replicas. read_quorum = write_quorum = 1. cart: [ ]

    { cart : [ A ] } N1 N2 N3 userX { cart : [ A ]} userX { cart : [ D ]} causal updates converge
  41. if no new updates are made to an object, eventually

    all accesses will return the last updated value. timekeeping want: need: determine causal updates for convergence to latest. any node serves reads and writes for availability determine conflicting updates.
  42. three replicas. read_quorum = write_quorum = 1. cart: [ ]

    { cart : [ A ] } N1 N2 N3 userY { cart : [ B ] } userX concurrent updates conflict
  43. vector clocks logical clocks that use versions as “timestamps”. means

    to establish causal ordering. { cart : [ A ] } N1 N2 N3 userY { cart : [ B ] } userX { cart : [ A ]} userX { cart : [ D ]} A B C D
  44. 0 0 0 0 0 0 0 0 0 n1

    n2 n3 n1 n2 n3 n1 n2 n3 n1 n2 n3 vector clocks
  45. 0 0 0 0 0 0 0 0 0 n1

    n2 n3 n1 n2 n3 n1 n2 n3 n1 n2 n3 1 0 0 userX { cart : [ A ] } A 0 0 1 { cart : [ B ] } userY B vector clocks
  46. 0 0 0 0 0 0 1 0 0 2

    0 0 0 0 0 0 0 1 n1 n2 n3 userX GET cart A C B n1 n2 n3 n1 n2 n3 n1 n2 n3 (2, 0, 0) returns: vector clocks
  47. 0 0 0 0 0 0 1 0 0 2

    0 0 0 0 0 0 0 1 n1 n2 n3 A C B n1 n2 n3 n1 n2 n3 n1 n2 n3 userX { cart : [ D ] } (2, 0, 0) vector clocks
  48. 0 0 0 0 0 0 1 0 0 2

    0 0 0 0 0 0 0 1 0 1 0 n1 n2 n3 A C D B n1 n2 n3 n1 n2 n3 n1 n2 n3 userX { cart : [ D ] } (2, 0, 0) vector clocks
  49. 0 0 0 2 1 0 n1 n2 n3 0

    0 0 1 0 0 2 0 0 0 0 0 0 0 1 n1 n2 n3 n1 n2 n3 n1 n2 n3 max ((2, 0, 0), (0, 1, 0)) A C D B userX { cart : [ D ] } (2, 0, 0) vector clocks
  50. 2 1 0 2 0 0 0 0 1 n1

    n2 n3 { cart : [ A ] } { cart : [ D ] } { cart : [ B ] } VCx ≺ VCy indicates x precedes y means to establish causal ordering. 2 0 0 2 1 0 { cart : [ A ] } precedes { cart : [ D ] } vector clocks
  51. 2 1 0 2 0 0 0 0 1 n1

    n2 n3 { cart : [ A ] } { cart : [ D ] } { cart : [ B ] } If that doesn’t hold for x and y, they conflict VCx ≺ VCy indicates x precedes y means to establish causal ordering. { cart : [ D ] } conflicts with { cart : [ B ] } 0 0 1 2 1 0 vector clocks
  52. timestampA = (1, 0) A userX PUT { k: v2

    } timestampB = (1, 1) B userX PUT { k: v } N1 N2 faster by 10ms
  53. need to passed around. are divorced from physical time. but

    logical clocks logical clocks are a clever proxy for physical time. vector clocks, dotted version vectors, 
 a more precise form that Riak uses. …this is pretty neat too.
  54. TrueTime
 + timestamps that correspond to wall-clock time. - specialized

    infrastructure. logical clocks
 + we can all do causality tracking! 
 - timestamps don’t correspond to wall-clock time.
  55. hybrid logical clocks <max_physical_time_seen, logical clock> augmented logical clocks: A

    N1 N2 faster by 20ms userX PUT { k: v } 50 40 timestampB = <100, 2> timestampA = <50, 1> <50, 1> userX PUT { k: v2 } B 55 45 <50, 1>
  56. “A person with a watch knows what time it is.

    A person with two watches is never sure.” - Segal’s Law, reworded. @kavya719 speakerdeck.com/kavya719/keeping-time-in-real-systems Special thanks to Eben Freeman for reading drafts of this.
  57. Spanner
 Original paper: http://static.googleusercontent.com/media/research.google.com/en/us/archive/ spanner-osdi2012.pdf
 
 Brewer’s 2017 paper: https://static.googleusercontent.com/media/research.google.com/en//pubs/

    archive/45855.pdf Dynamo http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf 
 Logical Clocks
 http://amturing.acm.org/p558-lamport.pdf
 Dotted Version Vectors https://arxiv.org/abs/1011.5808
 Hybrid Logical Clocks https://www.cse.buffalo.edu//tech-reports/2014-04.pdf
  58. replicas must agree on the order of transactions. consistent timeline

    across replicas N1 N1 G …is logical proxy for physical time. provides a unified timeline across nodes. leader proposes write to other replicas,
 write commits iff n replicas ACK it. Spanner uses Paxos, 2PC (other protocols are 3PC, Raft, Zab). consensus
  59. compromises availability — if n replicas are not be available

    to ACK writes. compromises performance — 
 increases write latency, decreases throughput;
 multiple coordination rounds until a write commits. but consensus … so, don’t want to use consensus to order transactions across partitions. e.g. T1, T2
  60. happens-before X ≺ Y IF one of: — same actor

    — are a synchronization pair — X ≺ E ≺ Y across actors. IF X not ≺ Y and Y not ≺ X , concurrent! orders events Formulated in Lamport’s 
 Time, Clocks, and the Ordering of Events paper in 1978. establishes causality and concurrency. (threads or nodes)
  61. A ≺ C (same actor) C ≺ D (synchronization pair)

    So, A ≺ D (transitivity) causality and concurrency A B C D N1 N2 N3
  62. …but B ? D
 D ? B So, B, D

    concurrent! A B C D N1 N2 N3 causality and concurrency
  63. A B C D N1 N2 N3 { cart :

    [ A ] } { cart : [ B ] } { cart : [ A ]} { cart : [ D ]} A ≺ D
 D should update A 
 B, D concurrent B, D need resolution
  64. GET, PUT operations on a key pass around a casual

    context object, that contains the vector clocks. a more precise form,
 “dotted version vector” Riak stores a vector clock with each version of the data. Therefore, able to determine causal updates versus conflicts.
  65. conflict resolution in riak Behavior is configurable.
 Assuming vector clock

    analysis enabled:
 • last-write-wins
 i.e. version with higher timestamp picked. • merge, iff the underlying data type is a CRDT • return conflicting versions to application
 riak stores “siblings” or conflicting versions,
 returned to application for resolution.
  66. return conflicting versions to application: 0 0 1 2 1

    0 D: { cart: [ “date crepe” ] } B: { cart: [ “blueberry crepe” ] } Riak stores both versions next op returns both to application application must resolve conflict { cart: [ “blueberry crepe”, “date crepe” ] } 2 1 1 which creates a causal update { cart: [ “blueberry crepe”, “date crepe” ] }
  67. …what about resolving those conflicts? doesn’t (default behavior). instead, exposes

    happens-before graph to the application for conflict resolution.