Upgrade to Pro — share decks privately, control downloads, hide ads and more …

When an innocent-looking ListOffsets Call Took ...

When an innocent-looking ListOffsets Call Took Down Our Kafka Cluster

Troubleshooting Case Study in a large-scale Kafka deployment operation at LY

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Transcript

  1. © LY Corporation Haruki Okada When an innocent-looking ListOffsets Call

    Took Down Our Kafka Cluster LY Corporation LYMQ Division
  2. © LY Corporation 2 Self introduction • Haruki Okada (X/GitHub:

    @ocadaruma) • Technical lead of IMF team at LY Corporation • The team is responsible for providing company- wide Kafka platform • Interests • Distributed Systems, Formal methods, …
  3. © LY Corporation 3 Apache Kafka at LY Corporation •

    Kafka is widely adopted by many services for various purposes • We provide multi-tenant large scale Kafka cluster
  4. © LY Corporation 4 Apache Kafka at LY Corporation: Scale

    • Peak Messages Throughput: • 33.6 million messages / sec • Daily incoming messages: 1+ trillion messages • Daily In/Out Bytes: 3 petabytes
  5. © LY Corporation 5 Operating Kafka at scale • Kafka

    is mature, battle-tested software • However, at this scale, we sometimes hit issues that no one have encountered before
  6. © LY Corporation 6 Agenda • Overview of the phenomenon

    • Investigation • Key Takeaways
  7. © LY Corporation 7 Phenomenon • One day, we restarted

    one broker in the cluster to rollout a patch, as usual • Right after that, entire cluster’s performance got degraded
  8. © LY Corporation 8 Phenomenon • Even worse, on the

    restarted broker, all request handler threads were almost exhausted and unable to process any request => Availability damaged
  9. © LY Corporation 9 Why request handlers got busy? •

    Typical cause we can imagine is too many API calls => Not the case. Request rate was stable
  10. © LY Corporation 10 Next step: Check the profile •

    Very fortunately, all our brokers have continuous async-profiler profiling enabled! • As a countermeasure to the past another performance issue… • Refs: “Time travel stack trace analysis with async-profiler” @JJUG CCC 2019 Fall
  11. © LY Corporation 11 The Surprising Finding • Surprisingly, the

    flame graph told that request handlers are almost occupied by ListOffsets API handling
  12. © LY Corporation 12 ListOffsets API • A simple API

    to query partition’s offsets: • Earliest offset • Latest offset • Offset for specific timestamp • Usually considered harmless • Doesn’t touch to actual data • Only requires DESCRIBE permission for the topic
  13. © LY Corporation 13 Which part was time consuming? •

    Most of time, threads were blocked by lazy time-index materialization
  14. © LY Corporation 14 How ListOffsets works • When ListOffsets

    is called with “maxTimestamp” specified, it internally queries the “largest timestamp” of ALL log segment to identify the log segment that contains the target offset
  15. © LY Corporation 15 Index files are lazily loaded •

    In Kafka, time indexes are implemented as memory-mapped file • Since creating mmap has certain overhead and we don’t need opening it for inactive segment usually, they are “lazily” loaded on-demand • Refs: KIP-263: Allow broker to skip sanity check of inactive segments on broker startup
  16. © LY Corporation 16 Suspect: ListOffsets against large partition •

    Large partition has many log segments • To query the offset corresponding to the timestamp, we may need to open many log segment files • Since the broker was just restarted, none of them are opened yet
  17. © LY Corporation 17 The hypothesis proved! • We dumped

    ListOffsets request content into logs and found that a single client was sending ListOffsets with “max-timestamp” query against many partitions, over 50K+ segments at total
  18. © LY Corporation 18 How we handled • Handling: Contacted

    the client and asked stopping it • Permanent Solution: • Not trivial to address fundamentally • Prevent by Authorization => ListOffsets only requires Topic-Describe permission, which we can’t deny • ListOffsets optimization => We have to store another on-disk data only for logging ”largest timestamp”, which is not optimal • Our current strategy (implementation underway): • Detect potentially-risky ListOffsets calls earlier, and to contact the client manually
  19. © LY Corporation 19 Key Takeaways • ListOffsets is innocent-looking

    but actually it might take down the broker easily depending on the usage • Continuous profiling is very useful for troubleshooting • Without this, we would suffer for this issue for much longer