Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Investigating Orphaned Memory Mappings for Kafk...

Investigating Orphaned Memory Mappings for Kafka's Index and TimeIndex Files

This slide deck was presented at Apache Kafka Meetup Japan #16 on June 4, 2025.
In this session, we explore a file-mapping issue in Apache Kafka and explain how we identified and resolved it.

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Transcript

  1. © LY Corporation LY Corporation Masahiro Mori (森 雅弘) Investigating

    Orphaned Memory Mappings for Kafka’s Index and TimeIndex Files
  2. © LY Corporation Masahiro Mori (森 雅弘) Data Pipeline Engineering

    Division, LY Corporation 2 2023 Yahoo! JAPAN (as a new grad) 2023 Yahoo! JAPAN (DataDelivery Dept.) - Kafka DevOps Member 2024~ LY Corporation (DPE) - Kafka DevOps Member © LY Corporation
  3. © LY Corporation Agenda 01 Background: Kafka Index & TimeIndex

    Files 02 Issue Overview: Disk-I/O Spike & Broker Crash 03 Root Cause Analysis: Orphaned Memory Mappings 04 Proposed Solutions 05 Validation: Experiments & Performance Evaluation 06 Conclusion & Next Steps 3
  4. © LY Corporation • Kafka broker stores each partition’s data

    in separate log segments • The segment size is configured by parameters such as log.segment.bytes • Since consumers can start consuming from an arbitrary offset, they need to quickly determine where within a segment to begin reading • ⇒ Use index file and timeindex file • Index file maps record-offsets to byte-offsets in the log segment • TimeIndex file maps timestamps to record-offsets 4 Segment & Indexes Background Segment 0 ・・・ 681 Broker Segment 682 ・・・ 1390 Active Segment 1391 ・・・ 1997 Start Offset End Offset *.log *.index *.timeindex Bytes 0 1048576 offset bytes timestamp offset 1747964243976 128 1747964244506 448 128 262144 448 693118
  5. © LY Corporation • In an actual Kafka directory layout,

    log files for each topic-partition are stored under the directories configured by log.dirs 5 Index & TimeIndex Files Background /kafka-logs/data/sandbox-1 ├── 00000000002258326678.index ├── 00000000002258326678.log ├── 00000000002258326678.timeindex ├── 00000000002259017698.index ├── 00000000002259017698.log ├── 00000000002259017698.timeindex ... > bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files /kafka-logs/data/sandbox-1/00000000002258326678.index Dumping /kafka-logs/data/sandbox-1/00000000002258326678.index offset: 2258326854 position: 172692 offset: 2258326946 position: 284213 offset: 2258327054 position: 433854 offset: 2258327174 position: 588997 offset: 2258327242 position: 778156 offset: 2258327338 position: 876347 offset: 2258327465 position: 1018407 ... > bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files /kafka-logs/data/sandbox- 1/00000000002258326678.timeindex Dumping /kafka-logs/data/sandbox-1/00000000002258326678.timeindex timestamp: 1747958427295 offset: 2258326854 timestamp: 1747958427333 offset: 2258326946 timestamp: 1747958427360 offset: 2258327174 timestamp: 1747958427371 offset: 2258327242 timestamp: 1747958427392 offset: 2258327338 timestamp: 1747958427445 offset: 2258327465 ...
  6. © LY Corporation • During a Kafka upgrade, restarting certain

    brokers in the cluster took significantly longer than expected, and metrics showed a pronounced spike in disk I/O • At the same time, the total number of file descriptors (FDs) in use on the node dropped off sharply 6 Issue 1: Disk-I/O Spike & File Descriptor Release Issue Overview & Initial Observations Disk I/O Allocated FDs
  7. © LY Corporation • We observed that in clusters experiencing

    a disk I/O spike on restart, the number of file descriptors kept growing for over two months • In contrast, healthy clusters released file descriptors within about one day 7 Issue 1: Disk-I/O Spike - Persistent File Descriptor Growth Issue Overview & Initial Observations
  8. © LY Corporation • When we ran lsof, we saw

    that many file descriptors were marked as DEL in the FD column, indicating the underlying files had been deleted • On brokers with a large number of open file descriptors, we counted them and found that roughly 90% pointed to deleted files 8 Issue 1: Disk-I/O Spike - Deleted Files Issue Overview & Initial Observations > sudo cat /proc/$KAFKA_PID/maps ... 7df218400000-7df218e00000 rw-s 00000000 08:11 1268777368 /kafka-logs/data/<topic>-<partition>/00000000011769274742.timeindex.deleted (deleted) 7df219000000-7df219a00000 rw-s 00000000 08:11 1268777367 /kafka-logs/data/<topic>-<partition>/00000000011769274742.index.deleted (deleted) 7df219c00000-7df21a600000 rw-s 00000000 08:11 1269108428 /kafka-logs/data/<topic>-<partition>/00000000213560781794.timeindex.deleted (deleted) 7df21a800000-7df21b200000 rw-s 00000000 08:11 1269108427 /kafka-logs/data/<topic>-<partition>/00000000213560781794.index.deleted (deleted) 7df21b400000-7df21be00000 rw-s 00000000 08:11 1270486651 /kafka-logs/data/<topic>-<partition>/00000000013427564689.timeindex.deleted (deleted) 7df21c000000-7df21ca00000 rw-s 00000000 08:11 1270486650 /kafka-logs/data/<topic>-<partition>/00000000013427564689.index.deleted (deleted) 7df21cc00000-7df21d600000 rw-s 00000000 08:11 1269727492 /kafka-logs/data/<topic>-<partition>/00000000047879681875.timeindex.deleted (deleted) 7df21d800000-7df21e200000 rw-s 00000000 08:11 1269727484 /kafka-logs/data/<topic>-<partition>/00000000047879681875.index.deleted (deleted) > sudo lsof –p 2209787 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java 2209787 kafka cwd DIR 253,0 55 403408548 /usr/apache/kafka/3.5.1 java 2209787 kafka rtd DIR 253,0 329 512 / java 2209787 kafka txt REG 253,0 18544 2080389093 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.352.b08-2.el9_1.x86_64/jre/bin/java java 2209787 kafka DEL REG 8,17 1268777368 /kafka-logs/data/<topic>-<partition>/00000000011769274742.timeindex.deleted java 2209787 kafka DEL REG 8,17 1268777367 /kafka-logs/data/<topic>-<partition>/00000000011769274742.index.deleted java 2209787 kafka DEL REG 8,17 1269108428 /kafka-logs/data/<topic>-<partition>/00000000213560781794.timeindex.deleted
  9. © LY Corporation • The broker suddenly crashed with the

    following error log 9 Issue 2: Broker Crash Issue Overview & Initial Observations [2025-03-29 09:37:03,218] ERROR Error while appending records to <topic>-<partition> in dir /kafka-logs/data ... java.io.IOException: Map failed at java.base/sun.nio.ch.FileChannelImpl.mapInternal(FileChannelImpl.java:1127) at java.base/sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:1032) at org.apache.kafka.storage.internals.log.AbstractIndex.createMappedBuffer(AbstractIndex.java:467) at org.apache.kafka.storage.internals.log.AbstractIndex.createAndAssignMmap(AbstractIndex.java:105) at org.apache.kafka.storage.internals.log.AbstractIndex.<init>(AbstractIndex.java:83) at org.apache.kafka.storage.internals.log.TimeIndex.<init>(TimeIndex.java:65) at org.apache.kafka.storage.internals.log.LazyIndex.loadIndex(LazyIndex.java:242) at org.apache.kafka.storage.internals.log.LazyIndex.get(LazyIndex.java:179) at org.apache.kafka.storage.internals.log.LogSegment.timeIndex(LogSegment.java:146) at org.apache.kafka.storage.internals.log.LogSegment.readMaxTimestampAndOffsetSoFar(LogSegment.java:201) at org.apache.kafka.storage.internals.log.LogSegment.maxTimestampSoFar(LogSegment.java:211) at org.apache.kafka.storage.internals.log.LogSegment.append(LogSegment.java:262) at kafka.log.LocalLog.append(LocalLog.scala:417) ... Caused by: java.lang.OutOfMemoryError: Map failed at java.base/sun.nio.ch.FileChannelImpl.map0(Native Method) at java.base/sun.nio.ch.FileChannelImpl.mapInternal(FileChannelImpl.java:1124) ... 33 more
  10. © LY Corporation • The broker hit the Linux vm.max_map_count

    limit, causing it to crash • ⇒ As a workaround, increase this value • Deleted index files remained memory-mapped, so the total number of mappings continued to grow 10 Issue 2: Broker Crash – vm.max_map_count Issue Overview & Initial Observations https://stackoverflow.com/questions/43042144/kafka-server-failed-to-start-java-io-ioexception-map-failed # before > sysctl vm.max_map_count vm.max_map_count = 262144 # after > sysctl vm.max_map_count vm.max_map_count = 1048576 https://kafka.apache.org/documentation/#os
  11. © LY Corporation • We reviewed past JIRA issues related

    to deleted files and found KAFKA-4614 [1] • Delayed GC unmap of a deleted file can miss the filesystem’s inode cache, trigger disk I/O, and degrade performance • The fix was to use a AbstractIndex.forceUnmap() at file-deletion time, so the mapping is torn down while the cache still exists • The symptom in KAFKA-4614 matches what we observed • However, even though the KAFKA-4614 patch has been merged upstream, the deleted index/time index files are not being unmapped in our case 11 Comparison with KAFKA-4614 Root Cause Analysis [1] https://issues.apache.org/jira/browse/KAFKA-4614 Linux Kafka Long time until GC VM for Kafka index.deleted still mapped & inode exists JVM sys_unlink OffsetIndex.delete() marked as deleted in VFS level Linux Kafka VM for Kafka index.deleted JVM MixedGC sys_munmap FileSystem cache FS has inode cache FileSystem inode cache deleted Disks GC threads block application threads slow disk access
  12. © LY Corporation • We suspected there was a bug

    in the index file deletion logic, so we spun up a dev cluster with aggressive log rotation settings to observe the behavior • We lowered log.segment.bytes and log.retention.ms for rapid turnover • While remote debugging via IDE, we used strace to capture system calls and concurrently inspected /proc/PID/maps • We confirmed that kafka adds the .deleted suffix and schedules the deletion process, and then AbstractIndex.forceUnmap() is called 12 Delete File Unmapping Behavior Root Cause Analysis Breakpoint reached at kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$5(LocalLog.scala:927) deleteSegments: /kafka-logs/data/sandbox-5/00000000000000460772.timeindex.deleted Breakpoint reached at kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$5(LocalLog.scala:927) deleteSegments: /kafka-logs/data/sandbox-4/00000000000000461179.timeindex.deleted Breakpoint reached at org.apache.kafka.storage.internals.log.AbstractIndex.forceUnmap(AbstractIndex.java:404) unmap: /kafka-logs/data/sandbox-5/00000000000000460772.timeindex.deleted Breakpoint reached at org.apache.kafka.storage.internals.log.AbstractIndex.forceUnmap(AbstractIndex.java:404) unmap: /kafka-logs/data/sandbox-4/00000000000000461179.timeindex.deleted Debug Log from IDE [pid 2126035] openat(AT_FDCWD, "/kafka-logs/data/sandbox-5/00000000000000460772.timeindex", O_RDWR|O_CREAT, 0666) = 305 [pid 2126035] mmap(NULL, 12, PROT_READ|PROT_WRITE, MAP_SHARED, 305, 0) = 0x7f40b802a000 ... [pid 2126035] munmap(0x7f40b802a000, 12) = 0 [pid 2126035] unlink("/kafka-logs/data/sandbox-5/00000000000000460772.timeindex.deleted") = 0 strace
  13. © LY Corporation • Observed that some index files—which had

    been unlinked and should have been unmapped— remained mapped in memory • A deeper inspection of /proc/PID/maps revealed that the same index file was mapped twice at different sizes • The first mapping was never unmapped, causing it to linger in memory 13 Duplicate Memory Mappings Root Cause Analysis > sudo cat /proc/$KAFKA_PID/maps | grep deleted 7f3fc1800000-7f3fc2200000 rw-s 00000000 fc:02 9440471 /kafka-logs/data/sandbox-4/00000000000000461285.timeindex.deleted (deleted) 7f3fc2400000-7f3fc2e00000 rw-s 00000000 fc:02 9440459 /kafka-logs/data/sandbox-4/00000000000000461285.index.deleted (deleted) > sudo cat /proc/$KAFKA_PID/maps | grep 236433.timeindex 7f6908c00000-7f6909600000 rw-s 00000000 fc:02 1619450 /kafka-logs/data/sandbox-5/00000000000000236433.timeindex 7f6a2408c000-7f6a2408d000 rw-s 00000000 fc:02 1619450 /kafka-logs/data/sandbox-5/00000000000000236433.timeindex
  14. © LY Corporation All methods on the following slides belong

    to the AbstractIndex class • forceUnmap() method unmaps a MappedByteBuffer instance held in the mmap field of AbstractIndex class • This mmap instance corresponds to the file mapping of the index file seen in procfs • New mmap instances are created in two places • In the constructor: createAndAssignMmap() • When the index file is resized: resize() • Change the file size and recreate the mmap instance 14 forceUnmap() Root Cause Analysis https://github.com/apache/kafka/blob/70d6ff42debf7e17478beb899fb5756bfbdbfbb5/storage/src/main/java/org/apache/kafka/storage/internals/log/AbstractIndex.java public abstract class AbstractIndex implements Closeable { ... private volatile MappedByteBuffer mmap; ... public void forceUnmap() throws IOException { try { ByteBufferUnmapper.unmap(file.getAbsolutePath(), mmap); } finally { mmap = null; } }
  15. © LY Corporation • On index file resize, the old

    memory mapping is never freed • Kafka may resize an index file during segment rotation or when loading an active segment at broker startup • In this process, ftruncate and mmap syscalls are called, but the old map is not free • As mentioned in the comments, the Windows and z/OS kernels cannot truncate while mapped due to specifications, so unmap is performed • On Linux, safeForceUnmap() is not executed 15 Missing munmap on Index File Resize Root Cause Analysis https://github.com/apache/kafka/blob/70d6ff42debf7e17478beb899fb5756bfbdbfbb5/storage/src/main/java/org/apache/kafka/storage/internals/log/AbstractIndex.java public boolean resize(int newSize) throws IOException { int roundedNewSize = roundDownToExactMultiple(newSize, entrySize()); RandomAccessFile raf = new RandomAccessFile(file, “rw”); /* openat */ try { /* Windows or z/OS won‘t let us modify the file length while the file is mmapped :-( */ if (OperatingSystem.IS_WINDOWS || OperatingSystem.IS_ZOS) safeForceUnmap(); /* munmap (didn’t unmapped in Linux) */ raf.setLength(roundedNewSize); /* ftruncate */ this.length = roundedNewSize; mmap = raf.getChannel().map(FileChannel.MapMode.READ_WRITE, 0, roundedNewSize); /* mmap */ return true; protected void safeForceUnmap() { if (mmap != null) { try { forceUnmap(); } catch (Throwable t) { log.error("Error unmapping index {}", file, t); } } } public void forceUnmap() throws IOException { try { ByteBufferUnmapper.unmap(file.getAbsolutePath(), mmap); } finally { mmap = null; } }
  16. © LY Corporation 1. Patch Kafka Code • Explicitly call

    munmap during resizing an index file as well as Windows and z/OS 2. JVM Parameter Tuning • Not all clusters exhibit this issue • In healthy clusters, the Mixed GC will periodically reclaim mappings for deleted files • Older mmaps are not collected by the young GC • JVM tuning might alleviate the symptoms temporarily, but since we saw the problem without any JVM-side changes, it could recur under different workloads or configurations ⇒ We choose to patch the Kafka code as a long-term solution 16 Proposed Solutions
  17. © LY Corporation • In resize(), call safeForceUnmap() before ftruncate

    and remapping 17 Applied Changes: AbstractIndex.resize(), kafka v3.8.1 Validation --- a/storage/src/main/java/org/apache/kafka/storage/internals/log/AbstractIndex.java +++ b/storage/src/main/java/org/apache/kafka/storage/internals/log/AbstractIndex.java @@ -198,9 +198,8 @@ public boolean resize(int newSize) throws IOException { try { int position = mmap.position(); - /* Windows or z/OS won't let us modify the file length while the file is mmapped :-( */ - if (OperatingSystem.IS_WINDOWS || OperatingSystem.IS_ZOS) - safeForceUnmap(); + safeForceUnmap(); raf.setLength(roundedNewSize); this.length = roundedNewSize; mmap = raf.getChannel().map(FileChannel.MapMode.READ_WRITE, 0, roundedNewSize);
  18. © LY Corporation • While researching how to safely lock

    mmap instances, we discovered KAFKA-7442 [1] • This issue reported the same problem we encountered, but the proposed PR was rejected on safety grounds • ⇒ Acquire a lock at every point where the mmap instance is accessed 18 Applied Changes: lock, kafka v3.8.1 Validation // storage/src/main/java/org/apache/kafka/storage/internals/log/OffsetIndex.java public OffsetPosition lookup(long targetOffset) { lock.lock(); try { ByteBuffer idx = mmap().duplicate(); int slot = largestLowerBoundSlotFor(idx, targetOffset, IndexSearchType.KEY); if (slot == -1) return new OffsetPosition(baseOffset(), 0); else return parseEntry(idx, slot); } finally { lock.unlock(); } } [1] https://issues.apache.org/jira/browse/KAFKA-7442
  19. © LY Corporation • Confirmed that no memory mappings remain

    for deleted index/timeindex files • Verified from broker metrics and logs that there are no adverse side effects • Request handler and network processor thread idle percentages remain on par with other brokers • No measurable impact on CPU, disk I/O, or memory utilization 19 Results Validation > sudo cat /proc/775707/maps | awk '/deleted/ && /index/' | wc -l 0
  20. © LY Corporation • Issues • Some Kafka clusters experienced

    high disk I/O spikes on broker restart/shutdown • Brokers occasionally crashed after hitting the vm.max_map_count limit • Root cause • Like KAFKA-4614, deleted metadata files were still held memory-mapped • Variations in traffic and JVM tuning led to very infrequent Mixed GC runs, so mappings weren’t reclaimed in time • ⇒ Fundamental flaw: the index file resize logic never unmaps the old memory buffer • Solution • Before resizing, invoke safeForceUnmap() to forcibly unmap the old buffer and add locking around all mmap accesses • Results • Confirmed the fix eliminates the issue with no measurable performance impact • Next steps • Prepare and submit the patch for inclusion in the upstream Kafka project 20 Conclusion & Next steps