Identifying topics that caused actual disk read

Shota Kondo LINE Corporation Identifying topics that caused actual disk
read 2023.06.16

Speaker • Shota Kondo • Member of LINE IMF team
• The team is responsible for developing and maintaining company-wide Kafka platform • Provides multi-tenant shared Kafka cluster

Today’s topic • “Identifying topics that caused actual disk read”
• Why? • Disk read may have the impact against broker performance

Disk read and Kafka broker

Request handling in Kafka broker • Network thread receive request
and send response to client after processing • Request handler thread process the request from client actually

Disk read in Fetch request handling • Usually consumers read
topic data from latest log segments (in page cache) • Sometimes consumer tries to fetch old data not in page cache

Disk read and its impact • Reading data from HDD
is slower than reading from page cache (memory) • Blocking in network thread affects to latency of subsequent requests

To apply solutions of such performance degradation • Some of
solutions can be considered • Warming up topic data if it was small enough • Setting smaller log segment size to prevent inode lock contention in xfs during reading topic data • https://speakerdeck.com/line_developers/investigating-kafka-performance-issue- caused-by-lock-contention-in-xfs • For proceeding them, we have to know the topic names • If disk read metrics had fi le name, we can use that

We already have disk read metrics though… device: xxx

Actually we needed is fi le: /data/kafka/…

Then, how to collect per file disk read stats?

Requirements • Collect the evidence of actual disk read for
each fi les • Expose following informations as the prometheus metrics • Read bytes as the value • File name as the label • No performance impact against Kafka broker

How to capture the disk read? • Hook the kernel
function related disk read • Obtain required informations in the hook

How to hook the kernel function? • eBPF (extended Berkley
Packet Filter) • The feature is provided by Linux kernel • It makes able to hook kernel events without modifying kernel code • bcc (BPF compiler collection) • Toolkit to compile and run eBPF program with Python/Lua

eBPF and BCC • Example code to hook read() system
call #!/usr/bin/python from bcc import BPF bpf_text=""" int kprobe__sys_read(struct pt_regs *ctx) { bpf_trace_printk("read() syscall was invoked\\n"); return 0; } """ BPF(text=bpf_text).trace_print()

What kernel function should be hooked? • If data resides
on page cache,   then data will be returned without disk read • Need to hook the function that is   close to the storage device

generic_make_request() • Kernel function to submit I/O request for devices
• It looks good to hook this function void generic_make_request(struct bio *bio)

Hook for generic_make_request() struct event_t { SOME_TYPE file; unsigned int
bytes; }; int kprobe__generic_make_request(struct pt_regs *ctx, struct bio *bio) { /* Extract read file and bytes from argument */ struct event_t event = {}; event.file = FILE; event.bytes = BYTES; /* Pass the data from eBPF program to python script */ events.perf_submit(ctx, &event, sizeof(event)); return 0; }

Does struct bio have file informations…? struct bio { sector_t
bi_sector; /* device address in 512 byte sectors */ struct bio *bi_next; /* request queue link */ struct block_device *bi_bdev; unsigned long bi_flags; /* status, command, etc */ unsigned long bi_rw; /* bottom bits READ/WRITE, * top bits priority */ unsigned short bi_vcnt; /* how many bio_vec's */ unsigned short bi_idx; /* current index into bvl_vec */ /* Number of segments in this BIO after * physical address coalescing is performed. */ unsigned int bi_phys_segments; unsigned int bi_size; /* residual I/O count */ ... }

Does struct bio have file informations…? • Looks bi_size can
be used as read bytes • But read fi le can’t be extracted directly  from this argument • Need to get read fi le from somewhere • Another kernel function in upper layer?

generic_file_aio_read() • Generic fi lesystem read routine • Argument iocb
has a pointer to fi le struct ssize_t generic_file_aio_read(struct kiocb *iocb,   const struct iovec *iov, unsigned long nr_segs, loff_t pos)

Hook for generic_file_aio_read() BPF_HASH(inotbl, u64, unsigned long, INO_TABLE_SIZE); int kprobe__generic_file_aio_read(struct
pt_regs *ctx, struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { u64 pid_tgid = bpf_get_current_pid_tgid(); unsigned long ino; if (iocb->ki_filp->f_path.dentry->d_inode) { ino = iocb->ki_filp->f_path.dentry->d_inode->i_ino; } else { // Set 0 if it's negative cache ino = 0; } inotbl.insert(&pid_tgid, &ino); return 0; }

We have file information now • File information can be
get  from generic_ fi le_aio_read() • Then let’s refer it in  generic_make_request()

Hook for generic_make_request() int kprobe__generic_make_request(struct pt_regs *ctx, struct bio *bio)
{ // Only account read requests if (op_is_write(op_from_rq_bits(bio->bi_rw))) return 0; u64 pid_tgid = bpf_get_current_pid_tgid(); unsigned long *pino = inotbl.lookup(&pid_tgid); struct event_t event = {}; if (pino) { event.inode = *pino; } else { event.inode = 0; } event.bytes = bio->bi_size; events.perf_submit(ctx, &event, sizeof(event)); return 0; }

Receive data from eBPF program • Disk read stats are
available now, then let’s just expose the metrics def record_event(cpu, data, size): event = b["events"].event(data) # Accumulate received data from eBPF program and expose as prometheus metrics b = BPF(text=bpf_text) b["events"].open_perf_buffer(record_event) while True: try: b.perf_buffer_poll() except KeyboardInterrupt: exit()

Finally we get per file disk read stats!! fi le:
/data/kafka/…

Summary • Disk read in network thread could block request
processing • per- fi le disk read stats help to identify the topic caused disk read • eBPF provides the way to observe system layer • And it’s not so hard

Identifying topics that caused actual disk read

Identifying topics that caused actual disk read

LINE Developers

More Decks by LINE Developers

Other Decks in Technology

Featured

Transcript

Shota Kondo LINE Corporation Identifying topics that caused actual disk

Speaker • Shota Kondo • Member of LINE IMF team

Today’s topic • “Identifying topics that caused actual disk read”

Disk read and Kafka broker

Request handling in Kafka broker • Network thread receive request

Disk read in Fetch request handling • Usually consumers read

Disk read and its impact • Reading data from HDD

To apply solutions of such performance degradation • Some of

We already have disk read metrics though… device: xxx

Actually we needed is fi le: /data/kafka/…

Then, how to collect per file disk read stats?

Requirements • Collect the evidence of actual disk read for

How to capture the disk read? • Hook the kernel

How to hook the kernel function? • eBPF (extended Berkley

eBPF and BCC • Example code to hook read() system

What kernel function should be hooked? • If data resides

generic_make_request() • Kernel function to submit I/O request for devices

Hook for generic_make_request() struct event_t { SOME_TYPE file; unsigned int

Does struct bio have file informations…? struct bio { sector_t

Does struct bio have file informations…? • Looks bi_size can

generic_file_aio_read() • Generic fi lesystem read routine • Argument iocb

Hook for generic_file_aio_read() BPF_HASH(inotbl, u64, unsigned long, INO_TABLE_SIZE); int kprobe__generic_file_aio_read(struct

We have file information now • File information can be

Hook for generic_make_request() int kprobe__generic_make_request(struct pt_regs ctx, struct bio bio)

Receive data from eBPF program • Disk read stats are

Finally we get per file disk read stats!! fi le:

Summary • Disk read in network thread could block request