events will be visible, and when. set of valid timelines of events These guarantees are informed and enforced by the timekeeping mechanisms used by the system.
(processes) or physical (machines). are sequential. communicate by message-passing i.e. connected by unreliable network, no shared memory. data may be replicated, partitioned a distributed datastore:
end := time.Now() // Time x takes. elapsed := end.Sub(start) } …can we use them? hardware clocks drift. NTP is slow etc. the system clock keeps Unix time. ?
hardware, the OS kernel. time.Now() MONOTONIC clock_gettime(CLOCK_REALTIME) sys call to get the value of a particular computer clock The system clock or wall clock. Gives the current UNIX timestamp. hardware clocks drift
Real Time Clock (RTC) keeps UTC time at system boot “hey HPET, interrupt me in 10ms” then when interrupted, knows to increment by 10ms. “tickless” kernel: the interrupt interval (“tick”) is dynamically calculated. incr using a hardware ticker. subsequently the system clock is a counter kept by hardware, OS kernel.
incr using a hardware ticker. these are the hardware clocks that drift. causes system clocks of different computers to change at different rates. at system boot subsequently the system clock is a counter kept by hardware, OS kernel.
highly accurate clock network: need trusted, reachable NTP servers. NTP is slow, up to hundreds of ms over public internet. stepping results in discontinuous jumps in time. } gradually adjusts clock rate (“skew”) sets a new value (“step”) if differential is too large.
400 seconds per day. So,1000th day after the epoch = 86400000 etc. …but a UTC day is not a constant 86, 400 seconds! “number of seconds since epoch” midnight UTC, 01.01.1970
time based on the Earth’s rotation astronomical time very stable; this is what we want to use e.g. the (SI) second matches the Earth’s position; sometimes useful (we’re told) So, UTC: based on atomic time adjusted to be in sync with the Earth’s rotational period.
measured using atomic clocks atomic time astronomical time very stable; this is what we want to use e.g. the (SI) second matches the Earth’s position; sometimes useful (we’re told) but problem…
this drift, UTC periodically adds a second. So, an astronomical day “takes longer” in absolute (atomic) terms. …so a UTC day may be 86, 400 or 86, 401 seconds! 23:59:59 23:59:60 00:00:00 leap second
computer’s “current time” to be aligned with UTC (in the long run): The system clock keeps UNIX time 23:59:59 23:59:59 00:00:00 repeats ! Unix Unix time is not monotonic. 23:59:59 23:59:60 00:00:00 leap second UTC
consistency model what the valid timelines of events are desired availability how “responsive” the system is desired performance read and write latency and so, throughput ] costs of higher consistency (CAP theorem, etc.)
scalable data is partitioned • Geo-replicated for fault tolerance • Performant • Externally consistent: “a globally consistent ordering of transactions that matches the observed commit order.”
scalable data is partitioned • Geo-replicated for fault tolerance • Performant • Externally consistent: “a globally consistent ordering of transactions that matches the observed commit order.” savings N1 checking N2
scalable data is partitioned • Geo-replicated for fault tolerance • Performant • Externally consistent: “a globally consistent ordering of transactions that matches the observed commit order.”
scalable data is partitioned • Geo-replicated for fault tolerance • Performant • Externally consistent: “a globally consistent ordering of transactions that matches the observed commit order.”
consistent snapshot reads consistent timeline across replicas. to order transactions across the system as well. the order to correspond to the observed commit order. want reads to never contain T2, if they don’t also contain T1. “globally consistent transaction order that corresponds to observed commit order“. performant consensus.
ordered before T2. Can we enforce ordering using commit timestamps? order of transactions == observed order even if T1, T2 across the globe! Yes, if perfectly synchronized clocks. …or, if you can know clock uncertainty perfectly, and account for it. }
system clocks. t tt } explicitly represents time as an interval, not a point. TT.now() [earliest, latest] interval that contains “true now”. earliest is the earliest time that could be “true now”; latest is the latest.
until commit_ts < TT.now().earliest then, commits and replies. if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T1 commit ts G1 leader T1
until commit_ts < TT.now().earliest then, commits and replies. G1 leader if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T1 commit wait T1 commit ts
until commit_ts < TT.now().earliest then, commits and replies. G1 leader if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T1 commits guarantees commit_ts for next transaction is higher, despite different clocks. ] commit wait T1 commit ts
consistency without coordination. …this is neat. The uncertainty window affects commit wait time, and so write latency and throughput. Google uses impressive and expensive! infrastructure to keep this small; ~7ms as of 2012. but note
<key: blob> {“uuid1234”: {“name”:”ada”}} • Highly available: data partitioned and replicated, decentralized i.e. all replicas serve reads, writes. • Eventually consistent: “if no new updates are made to an object, eventually all accesses will return the last updated value.”
all accesses will return the last updated value. timekeeping want: need: determine causal updates for convergence to latest. any node serves reads and writes for availability determine conflicting updates.
n2 n3 { cart : [ A ] } { cart : [ D ] } { cart : [ B ] } If that doesn’t hold for x and y, they conflict VCx ≺ VCy indicates x precedes y means to establish causal ordering. { cart : [ D ] } conflicts with { cart : [ B ] } 0 0 1 2 1 0 vector clocks
logical clocks logical clocks are a clever proxy for physical time. vector clocks, dotted version vectors, a more precise form that Riak uses. …this is pretty neat too.
A person with two watches is never sure.” - Segal’s Law, reworded. @kavya719 speakerdeck.com/kavya719/keeping-time-in-real-systems Special thanks to Eben Freeman for reading drafts of this.
across replicas N1 N1 G …is logical proxy for physical time. provides a unified timeline across nodes. leader proposes write to other replicas, write commits iff n replicas ACK it. Spanner uses Paxos, 2PC (other protocols are 3PC, Raft, Zab). consensus
to ACK writes. compromises performance — increases write latency, decreases throughput; multiple coordination rounds until a write commits. but consensus … so, don’t want to use consensus to order transactions across partitions. e.g. T1, T2
— are a synchronization pair — X ≺ E ≺ Y across actors. IF X not ≺ Y and Y not ≺ X , concurrent! orders events Formulated in Lamport’s Time, Clocks, and the Ordering of Events paper in 1978. establishes causality and concurrency. (threads or nodes)
context object, that contains the vector clocks. a more precise form, “dotted version vector” Riak stores a vector clock with each version of the data. Therefore, able to determine causal updates versus conflicts.
analysis enabled: • last-write-wins i.e. version with higher timestamp picked. • merge, iff the underlying data type is a CRDT • return conflicting versions to application riak stores “siblings” or conflicting versions, returned to application for resolution.