Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Distributed Message Log from Scratch...

Building a Distributed Message Log from Scratch - SCaLE 16x

Apache Kafka has shown that the log is a powerful abstraction for data-intensive applications. It can play a key role in managing data and distributing it across the enterprise efficiently. Vital to any data plane is not just performance, but availability and scalability. In this session, we examine what a distributed log is, how it works, and how it can achieve these goals. Specifically, we'll discuss lessons learned while building NATS Streaming, a reliable messaging layer built on NATS that provides similar semantics. We'll cover core components like leader election, data replication, log persistence, and message delivery. Come learn about distributed systems!

Tyler Treat

March 11, 2018
Tweet

More Decks by Tyler Treat

Other Decks in Programming

Transcript

  1. @tyler_treat - Managing Partner @ Real Kinetic - Messaging &

    distributed systems - Former nats.io core contributor - bravenewgeek.com Tyler Treat
  2. @tyler_treat - The Log
 -> What?
 -> Why? - Implementation


    -> Storage mechanics
 -> Data-replication techniques
 -> Scaling message delivery
 -> Trade-offs and lessons learned Outline
  3. @tyler_treat Examples in the wild: -> Apache Kafka
 -> Amazon

    Kinesis -> NATS Streaming
 -> Apache Pulsar
  4. @tyler_treat The purpose of this talk is to learn…
 ->

    a bit about the internals of a log abstraction. -> how it can achieve these goals. -> some applied distributed systems theory.
  5. @tyler_treat You will probably never need to build something like

    this yourself, but it helps to know how it works.
  6. @tyler_treat Some first principles… • The log is an ordered,

    immutable sequence of messages • Messages are atomic (meaning they can’t be broken up) • The log has a notion of message retention based on some policies (time, number of messages, bytes, etc.) • The log can be played back from any arbitrary position • The log is stored on disk • Sequential disk access is fast* • OS page cache means sequential access often avoids disk
  7. @tyler_treat avg-cpu: %user %nice %system %iowait %steal %idle 13.53 0.00

    11.28 0.00 0.00 75.19 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn xvda 0.00 0.00 0.00 0 0 iostat
  8. @tyler_treat Storage Mechanics log segment 3 file log segment 0

    file 0 1 2 3 4 5 0 1 2 0 1 2 index segment 0 file index segment 3 file
  9. @tyler_treat Questions:
 -> How do we ensure continuity of reads/writes?

    -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  10. @tyler_treat Questions:
 -> How do we ensure continuity of reads/writes?

    -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  11. @tyler_treat Questions:
 -> How do we ensure continuity of reads/writes?

    -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  12. @tyler_treat Data-Replication Techniques 1. Gossip/multicast protocols Epidemic broadcast trees, bimodal

    multicast, SWIM, HyParView
 2. Consensus protocols 2PC/3PC, Paxos, Raft, Zab, chain replication
  13. @tyler_treat Questions:
 -> How do we ensure continuity of reads/writes?

    -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  14. @tyler_treat Data-Replication Techniques 1. Gossip/multicast protocols Epidemic broadcast trees, bimodal

    multicast, SWIM, HyParView, NeEM
 2. Consensus protocols 2PC/3PC, Paxos, Raft, Zab, chain replication
  15. @tyler_treat Replication in Kafka 1. Select a leader 2. Maintain

    in-sync replica set (ISR) (initially every replica) 3. Leader writes messages to write-ahead log (WAL) 4. Leader commits messages when all replicas in ISR ack 5. Leader maintains high-water mark (HW) of last committed message 6. Piggyback HW on replica fetch responses which replicas periodically checkpoint to disk
  16. @tyler_treat 0 1 2 3 4 5 b1 (leader) 0

    1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Replication in Kafka
  17. @tyler_treat 0 1 2 3 4 5 b1 (leader) 0

    1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Leader fails
  18. @tyler_treat 0 1 2 3 4 5 b1 (leader) 0

    1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Leader fails
  19. @tyler_treat 0 1 2 3 4 5 b1 (leader) 0

    1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Leader fails
  20. @tyler_treat 0 1 2 3 HW: 3 0 1 2

    3 HW: 3 b2 (leader) b3 (follower) ISR: {b2, b3} writes Leader fails
  21. @tyler_treat 0 1 2 3 4 5 b1 (leader) 0

    1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Follower fails
  22. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes
  23. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes replica.lag.time.max.ms
  24. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2} writes replica.lag.time.max.ms
  25. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5
  26. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5
  27. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 4 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5 4
  28. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 5 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5 4 5
  29. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 5 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes 5 4 5
  30. @tyler_treat Replication in NATS Streaming 1. Raft replicates client state,

    messages, and subscriptions
 2. Conceptually, two logs: Raft log and message log
 3. Parallels work implementing Raft in RabbitMQ
  31. @tyler_treat Replication in NATS Streaming • Initially used Raft group

    per topic and separate metadata group 
 • A couple issues with this:
 -> Topic scalability
 -> Increased complexity due to lack of ordering between Raft groups
  32. @tyler_treat Scaling Raft With a single topic, one node is

    elected leader and it heartbeats messages to followers
  33. @tyler_treat Scaling Raft Technique 1: run a fixed number of

    Raft groups and use a consistent hash to map a topic to a group.
  34. @tyler_treat Scaling Raft Technique 2: run an entire node’s worth

    of topics as a single group using a layer on top of Raft. https://www.cockroachlabs.com/blog/scaling-raft
  35. @tyler_treat Dual Writes msg 1 msg 2 sub msg 3

    Raft msg 1 msg 2 Store committed
  36. @tyler_treat Dual Writes msg 1 msg 2 sub msg 3

    add peer msg 4 Raft msg 1 msg 2 msg 3 Store committed
  37. @tyler_treat Dual Writes msg 1 msg 2 sub msg 3

    add peer msg 4 Raft msg 1 msg 2 msg 3 Store committed
  38. @tyler_treat Dual Writes msg 1 msg 2 sub msg 3

    add peer msg 4 Raft msg 1 msg 2 msg 3 msg 4 Store commit
  39. @tyler_treat Dual Writes msg 1 msg 2 sub msg 3

    add peer msg 4 Raft msg 1 msg 2 msg 3 msg 4 Store 0 1 2 3 4 5 0 1 2 3 physical offset logical offset
  40. @tyler_treat Dual Writes msg 1 msg 2 sub msg 3

    add peer msg 4 Raft msg 1 msg 2 Index 0 1 2 3 4 5 0 1 2 3 physical offset logical offset msg 3 msg 4
  41. @tyler_treat Questions:
 -> How do we ensure continuity of reads/writes?

    -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  42. @tyler_treat Performance 1. Publisher acks 
 -> broker acks on

    commit (slow but safe)
 -> broker acks on local log append (fast but unsafe)
 -> publisher doesn’t wait for ack (fast but unsafe) 
 2. Don’t fsync, rely on replication for durability
 3. Keep disk access sequential and maximize zero-copy reads
 4. Batch aggressively
  43. @tyler_treat Questions:
 -> How do we ensure continuity of reads/writes?

    -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  44. @tyler_treat Durability 1. Quorum guarantees durability
 -> Comes for free

    with Raft
 -> In Kafka, need to configure min.insync.replicas and acks, e.g.
 topic with replication factor 3, min.insync.replicas=2, and
 acks=all
 2. Disable unclean leader elections
 3. At odds with availability,
 i.e. no quorum == no reads/writes
  45. @tyler_treat caches databases indexes writes writes writes writes Topic: purchases

    Topic: inventory Accounts A-M Accounts N-Z SKUs A-M SKUs N-Z
  46. @tyler_treat High Fan-Out 1. Observation: with an immutable log, there

    are no stale/phantom reads
 2. This should make it “easy” (in theory) to scale to a large number of consumers
 3. With Raft, we can use “non-voters” to act as read replicas and load balance consumers
  47. @tyler_treat Push vs. Pull • In Kafka, consumers pull data

    from brokers • In NATS Streaming, brokers push data to consumers • Design implications: • Fan-out • Flow control • Optimizing for latency vs. throughput • Client complexity
  48. @tyler_treat Competing Goals 1. Performance
 -> Easy to make something

    fast that’s not fault-tolerant or scalable
 -> Simplicity of mechanism makes this easier
 -> Simplicity of “UX” makes this harder 2. Scalability and fault-tolerance
 -> At odds with simplicity
 -> Cannot be an afterthought 3. Simplicity
 -> Simplicity of mechanism shifts complexity elsewhere (e.g. client)
 -> Easy to let server handle complexity; hard when that needs to be
 distributed, consistent, and fast
  49. @tyler_treat “A complex system that works is invariably found to

    have evolved from a simple system that works.”
  50. @tyler_treat Trade-Offs and Lessons Learned 1. Competing goals 2. Aim

    for simplicity 3. You can’t effectively bolt on fault-tolerance
  51. @tyler_treat “A complex system designed from scratch never works and

    cannot be patched up to make it work. You have to start over, beginning with a working simple system.”
  52. @tyler_treat Trade-Offs and Lessons Learned 1. Competing goals 2. Aim

    for simplicity 3. You can’t effectively bolt on fault-tolerance 4. Lean on existing work
  53. @tyler_treat Trade-Offs and Lessons Learned 1. Competing goals 2. Aim

    for simplicity 3. You can’t effectively bolt on fault-tolerance 4. Lean on existing work 5. There are probably edge cases for which you haven’t written tests
  54. @tyler_treat There are many failure modes, and you can only

    write so many tests.
 
 Formal methods and property-based/ generative testing can help.
  55. @tyler_treat Trade-Offs and Lessons Learned 1. Competing goals 2. Aim

    for simplicity 3. You can’t effectively bolt on fault-tolerance 4. Lean on existing work 5. There are probably edge cases for which you haven’t written tests 6. Be honest with your users
  56. @tyler_treat Don’t try to be everything to everyone.
 Be explicit

    about design decisions, trade- offs, guarantees, defaults, etc.