Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Identifying topics that caused actual disk read

Identifying topics that caused actual disk read

Shota Kondo
LINE Corporation

※この発表は以下イベントで発表した内容です
https://kafka-apache-jp.connpass.com/event/284247/

LINE Developers

June 16, 2023
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. Speaker • Shota Kondo • Member of LINE IMF team

    • The team is responsible for developing and maintaining company-wide Kafka platform • Provides multi-tenant shared Kafka cluster
  2. Today’s topic • “Identifying topics that caused actual disk read”

    • Why? • Disk read may have the impact against broker performance
  3. Request handling in Kafka broker • Network thread receive request

    and send response to client after processing • Request handler thread process the request from client actually
  4. Disk read in Fetch request handling • Usually consumers read

    topic data from latest log segments (in page cache) • Sometimes consumer tries to fetch old data not in page cache
  5. Disk read and its impact • Reading data from HDD

    is slower than reading from page cache (memory) • Blocking in network thread affects to latency of subsequent requests
  6. To apply solutions of such performance degradation • Some of

    solutions can be considered • Warming up topic data if it was small enough • Setting smaller log segment size to prevent inode lock contention in xfs during reading topic data • https://speakerdeck.com/line_developers/investigating-kafka-performance-issue- caused-by-lock-contention-in-xfs • For proceeding them, we have to know the topic names • If disk read metrics had fi le name, we can use that
  7. Requirements • Collect the evidence of actual disk read for

    each fi les • Expose following informations as the prometheus metrics • Read bytes as the value • File name as the label • No performance impact against Kafka broker
  8. How to capture the disk read? • Hook the kernel

    function related disk read • Obtain required informations in the hook
  9. How to hook the kernel function? • eBPF (extended Berkley

    Packet Filter) • The feature is provided by Linux kernel • It makes able to hook kernel events without modifying kernel code • bcc (BPF compiler collection) • Toolkit to compile and run eBPF program with Python/Lua
  10. eBPF and BCC • Example code to hook read() system

    call #!/usr/bin/python from bcc import BPF bpf_text=""" int kprobe__sys_read(struct pt_regs *ctx) { bpf_trace_printk("read() syscall was invoked\\n"); return 0; } """ BPF(text=bpf_text).trace_print()
  11. What kernel function should be hooked? • If data resides

    on page cache, 
 then data will be returned without disk read • Need to hook the function that is 
 close to the storage device
  12. generic_make_request() • Kernel function to submit I/O request for devices

    • It looks good to hook this function void generic_make_request(struct bio *bio)
  13. Hook for generic_make_request() struct event_t { SOME_TYPE file; unsigned int

    bytes; }; int kprobe__generic_make_request(struct pt_regs *ctx, struct bio *bio) { /* Extract read file and bytes from argument */ struct event_t event = {}; event.file = FILE; event.bytes = BYTES; /* Pass the data from eBPF program to python script */ events.perf_submit(ctx, &event, sizeof(event)); return 0; }
  14. Does struct bio have file informations…? struct bio { sector_t

    bi_sector; /* device address in 512 byte sectors */ struct bio *bi_next; /* request queue link */ struct block_device *bi_bdev; unsigned long bi_flags; /* status, command, etc */ unsigned long bi_rw; /* bottom bits READ/WRITE, * top bits priority */ unsigned short bi_vcnt; /* how many bio_vec's */ unsigned short bi_idx; /* current index into bvl_vec */ /* Number of segments in this BIO after * physical address coalescing is performed. */ unsigned int bi_phys_segments; unsigned int bi_size; /* residual I/O count */ ... }
  15. Does struct bio have file informations…? • Looks bi_size can

    be used as read bytes • But read fi le can’t be extracted directly
 from this argument • Need to get read fi le from somewhere • Another kernel function in upper layer?
  16. generic_file_aio_read() • Generic fi lesystem read routine • Argument iocb

    has a pointer to fi le struct ssize_t generic_file_aio_read(struct kiocb *iocb, 
 const struct iovec *iov, unsigned long nr_segs, loff_t pos)
  17. Hook for generic_file_aio_read() BPF_HASH(inotbl, u64, unsigned long, INO_TABLE_SIZE); int kprobe__generic_file_aio_read(struct

    pt_regs *ctx, struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { u64 pid_tgid = bpf_get_current_pid_tgid(); unsigned long ino; if (iocb->ki_filp->f_path.dentry->d_inode) { ino = iocb->ki_filp->f_path.dentry->d_inode->i_ino; } else { // Set 0 if it's negative cache ino = 0; } inotbl.insert(&pid_tgid, &ino); return 0; }
  18. We have file information now • File information can be

    get
 from generic_ fi le_aio_read() • Then let’s refer it in
 generic_make_request()
  19. Hook for generic_make_request() int kprobe__generic_make_request(struct pt_regs *ctx, struct bio *bio)

    { // Only account read requests if (op_is_write(op_from_rq_bits(bio->bi_rw))) return 0; u64 pid_tgid = bpf_get_current_pid_tgid(); unsigned long *pino = inotbl.lookup(&pid_tgid); struct event_t event = {}; if (pino) { event.inode = *pino; } else { event.inode = 0; } event.bytes = bio->bi_size; events.perf_submit(ctx, &event, sizeof(event)); return 0; }
  20. Receive data from eBPF program • Disk read stats are

    available now, then let’s just expose the metrics def record_event(cpu, data, size): event = b["events"].event(data) # Accumulate received data from eBPF program and expose as prometheus metrics b = BPF(text=bpf_text) b["events"].open_perf_buffer(record_event) while True: try: b.perf_buffer_poll() except KeyboardInterrupt: exit()
  21. Summary • Disk read in network thread could block request

    processing • per- fi le disk read stats help to identify the topic caused disk read • eBPF provides the way to observe system layer • And it’s not so hard