Investigating Orphaned Memory Mappings for Kafka's Index and TimeIndex Files
This slide deck was presented at Apache Kafka Meetup Japan #16 on June 4, 2025.
In this session, we explore a file-mapping issue in Apache Kafka and explain how we identified and resolved it.
in separate log segments • The segment size is configured by parameters such as log.segment.bytes • Since consumers can start consuming from an arbitrary offset, they need to quickly determine where within a segment to begin reading • ⇒ Use index file and timeindex file • Index file maps record-offsets to byte-offsets in the log segment • TimeIndex file maps timestamps to record-offsets 4 Segment & Indexes Background Segment 0 ・・・ 681 Broker Segment 682 ・・・ 1390 Active Segment 1391 ・・・ 1997 Start Offset End Offset *.log *.index *.timeindex Bytes 0 1048576 offset bytes timestamp offset 1747964243976 128 1747964244506 448 128 262144 448 693118
brokers in the cluster took significantly longer than expected, and metrics showed a pronounced spike in disk I/O • At the same time, the total number of file descriptors (FDs) in use on the node dropped off sharply 6 Issue 1: Disk-I/O Spike & File Descriptor Release Issue Overview & Initial Observations Disk I/O Allocated FDs
a disk I/O spike on restart, the number of file descriptors kept growing for over two months • In contrast, healthy clusters released file descriptors within about one day 7 Issue 1: Disk-I/O Spike - Persistent File Descriptor Growth Issue Overview & Initial Observations
following error log 9 Issue 2: Broker Crash Issue Overview & Initial Observations [2025-03-29 09:37:03,218] ERROR Error while appending records to <topic>-<partition> in dir /kafka-logs/data ... java.io.IOException: Map failed at java.base/sun.nio.ch.FileChannelImpl.mapInternal(FileChannelImpl.java:1127) at java.base/sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:1032) at org.apache.kafka.storage.internals.log.AbstractIndex.createMappedBuffer(AbstractIndex.java:467) at org.apache.kafka.storage.internals.log.AbstractIndex.createAndAssignMmap(AbstractIndex.java:105) at org.apache.kafka.storage.internals.log.AbstractIndex.<init>(AbstractIndex.java:83) at org.apache.kafka.storage.internals.log.TimeIndex.<init>(TimeIndex.java:65) at org.apache.kafka.storage.internals.log.LazyIndex.loadIndex(LazyIndex.java:242) at org.apache.kafka.storage.internals.log.LazyIndex.get(LazyIndex.java:179) at org.apache.kafka.storage.internals.log.LogSegment.timeIndex(LogSegment.java:146) at org.apache.kafka.storage.internals.log.LogSegment.readMaxTimestampAndOffsetSoFar(LogSegment.java:201) at org.apache.kafka.storage.internals.log.LogSegment.maxTimestampSoFar(LogSegment.java:211) at org.apache.kafka.storage.internals.log.LogSegment.append(LogSegment.java:262) at kafka.log.LocalLog.append(LocalLog.scala:417) ... Caused by: java.lang.OutOfMemoryError: Map failed at java.base/sun.nio.ch.FileChannelImpl.map0(Native Method) at java.base/sun.nio.ch.FileChannelImpl.mapInternal(FileChannelImpl.java:1124) ... 33 more
limit, causing it to crash • ⇒ As a workaround, increase this value • Deleted index files remained memory-mapped, so the total number of mappings continued to grow 10 Issue 2: Broker Crash – vm.max_map_count Issue Overview & Initial Observations https://stackoverflow.com/questions/43042144/kafka-server-failed-to-start-java-io-ioexception-map-failed # before > sysctl vm.max_map_count vm.max_map_count = 262144 # after > sysctl vm.max_map_count vm.max_map_count = 1048576 https://kafka.apache.org/documentation/#os
to deleted files and found KAFKA-4614 [1] • Delayed GC unmap of a deleted file can miss the filesystem’s inode cache, trigger disk I/O, and degrade performance • The fix was to use a AbstractIndex.forceUnmap() at file-deletion time, so the mapping is torn down while the cache still exists • The symptom in KAFKA-4614 matches what we observed • However, even though the KAFKA-4614 patch has been merged upstream, the deleted index/time index files are not being unmapped in our case 11 Comparison with KAFKA-4614 Root Cause Analysis [1] https://issues.apache.org/jira/browse/KAFKA-4614 Linux Kafka Long time until GC VM for Kafka index.deleted still mapped & inode exists JVM sys_unlink OffsetIndex.delete() marked as deleted in VFS level Linux Kafka VM for Kafka index.deleted JVM MixedGC sys_munmap FileSystem cache FS has inode cache FileSystem inode cache deleted Disks GC threads block application threads slow disk access
in the index file deletion logic, so we spun up a dev cluster with aggressive log rotation settings to observe the behavior • We lowered log.segment.bytes and log.retention.ms for rapid turnover • While remote debugging via IDE, we used strace to capture system calls and concurrently inspected /proc/PID/maps • We confirmed that kafka adds the .deleted suffix and schedules the deletion process, and then AbstractIndex.forceUnmap() is called 12 Delete File Unmapping Behavior Root Cause Analysis Breakpoint reached at kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$5(LocalLog.scala:927) deleteSegments: /kafka-logs/data/sandbox-5/00000000000000460772.timeindex.deleted Breakpoint reached at kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$5(LocalLog.scala:927) deleteSegments: /kafka-logs/data/sandbox-4/00000000000000461179.timeindex.deleted Breakpoint reached at org.apache.kafka.storage.internals.log.AbstractIndex.forceUnmap(AbstractIndex.java:404) unmap: /kafka-logs/data/sandbox-5/00000000000000460772.timeindex.deleted Breakpoint reached at org.apache.kafka.storage.internals.log.AbstractIndex.forceUnmap(AbstractIndex.java:404) unmap: /kafka-logs/data/sandbox-4/00000000000000461179.timeindex.deleted Debug Log from IDE [pid 2126035] openat(AT_FDCWD, "/kafka-logs/data/sandbox-5/00000000000000460772.timeindex", O_RDWR|O_CREAT, 0666) = 305 [pid 2126035] mmap(NULL, 12, PROT_READ|PROT_WRITE, MAP_SHARED, 305, 0) = 0x7f40b802a000 ... [pid 2126035] munmap(0x7f40b802a000, 12) = 0 [pid 2126035] unlink("/kafka-logs/data/sandbox-5/00000000000000460772.timeindex.deleted") = 0 strace
been unlinked and should have been unmapped— remained mapped in memory • A deeper inspection of /proc/PID/maps revealed that the same index file was mapped twice at different sizes • The first mapping was never unmapped, causing it to linger in memory 13 Duplicate Memory Mappings Root Cause Analysis > sudo cat /proc/$KAFKA_PID/maps | grep deleted 7f3fc1800000-7f3fc2200000 rw-s 00000000 fc:02 9440471 /kafka-logs/data/sandbox-4/00000000000000461285.timeindex.deleted (deleted) 7f3fc2400000-7f3fc2e00000 rw-s 00000000 fc:02 9440459 /kafka-logs/data/sandbox-4/00000000000000461285.index.deleted (deleted) > sudo cat /proc/$KAFKA_PID/maps | grep 236433.timeindex 7f6908c00000-7f6909600000 rw-s 00000000 fc:02 1619450 /kafka-logs/data/sandbox-5/00000000000000236433.timeindex 7f6a2408c000-7f6a2408d000 rw-s 00000000 fc:02 1619450 /kafka-logs/data/sandbox-5/00000000000000236433.timeindex
to the AbstractIndex class • forceUnmap() method unmaps a MappedByteBuffer instance held in the mmap field of AbstractIndex class • This mmap instance corresponds to the file mapping of the index file seen in procfs • New mmap instances are created in two places • In the constructor: createAndAssignMmap() • When the index file is resized: resize() • Change the file size and recreate the mmap instance 14 forceUnmap() Root Cause Analysis https://github.com/apache/kafka/blob/70d6ff42debf7e17478beb899fb5756bfbdbfbb5/storage/src/main/java/org/apache/kafka/storage/internals/log/AbstractIndex.java public abstract class AbstractIndex implements Closeable { ... private volatile MappedByteBuffer mmap; ... public void forceUnmap() throws IOException { try { ByteBufferUnmapper.unmap(file.getAbsolutePath(), mmap); } finally { mmap = null; } }
memory mapping is never freed • Kafka may resize an index file during segment rotation or when loading an active segment at broker startup • In this process, ftruncate and mmap syscalls are called, but the old map is not free • As mentioned in the comments, the Windows and z/OS kernels cannot truncate while mapped due to specifications, so unmap is performed • On Linux, safeForceUnmap() is not executed 15 Missing munmap on Index File Resize Root Cause Analysis https://github.com/apache/kafka/blob/70d6ff42debf7e17478beb899fb5756bfbdbfbb5/storage/src/main/java/org/apache/kafka/storage/internals/log/AbstractIndex.java public boolean resize(int newSize) throws IOException { int roundedNewSize = roundDownToExactMultiple(newSize, entrySize()); RandomAccessFile raf = new RandomAccessFile(file, “rw”); /* openat */ try { /* Windows or z/OS won‘t let us modify the file length while the file is mmapped :-( */ if (OperatingSystem.IS_WINDOWS || OperatingSystem.IS_ZOS) safeForceUnmap(); /* munmap (didn’t unmapped in Linux) */ raf.setLength(roundedNewSize); /* ftruncate */ this.length = roundedNewSize; mmap = raf.getChannel().map(FileChannel.MapMode.READ_WRITE, 0, roundedNewSize); /* mmap */ return true; protected void safeForceUnmap() { if (mmap != null) { try { forceUnmap(); } catch (Throwable t) { log.error("Error unmapping index {}", file, t); } } } public void forceUnmap() throws IOException { try { ByteBufferUnmapper.unmap(file.getAbsolutePath(), mmap); } finally { mmap = null; } }
munmap during resizing an index file as well as Windows and z/OS 2. JVM Parameter Tuning • Not all clusters exhibit this issue • In healthy clusters, the Mixed GC will periodically reclaim mappings for deleted files • Older mmaps are not collected by the young GC • JVM tuning might alleviate the symptoms temporarily, but since we saw the problem without any JVM-side changes, it could recur under different workloads or configurations ⇒ We choose to patch the Kafka code as a long-term solution 16 Proposed Solutions
mmap instances, we discovered KAFKA-7442 [1] • This issue reported the same problem we encountered, but the proposed PR was rejected on safety grounds • ⇒ Acquire a lock at every point where the mmap instance is accessed 18 Applied Changes: lock, kafka v3.8.1 Validation // storage/src/main/java/org/apache/kafka/storage/internals/log/OffsetIndex.java public OffsetPosition lookup(long targetOffset) { lock.lock(); try { ByteBuffer idx = mmap().duplicate(); int slot = largestLowerBoundSlotFor(idx, targetOffset, IndexSearchType.KEY); if (slot == -1) return new OffsetPosition(baseOffset(), 0); else return parseEntry(idx, slot); } finally { lock.unlock(); } } [1] https://issues.apache.org/jira/browse/KAFKA-7442
for deleted index/timeindex files • Verified from broker metrics and logs that there are no adverse side effects • Request handler and network processor thread idle percentages remain on par with other brokers • No measurable impact on CPU, disk I/O, or memory utilization 19 Results Validation > sudo cat /proc/775707/maps | awk '/deleted/ && /index/' | wc -l 0
high disk I/O spikes on broker restart/shutdown • Brokers occasionally crashed after hitting the vm.max_map_count limit • Root cause • Like KAFKA-4614, deleted metadata files were still held memory-mapped • Variations in traffic and JVM tuning led to very infrequent Mixed GC runs, so mappings weren’t reclaimed in time • ⇒ Fundamental flaw: the index file resize logic never unmaps the old memory buffer • Solution • Before resizing, invoke safeForceUnmap() to forcibly unmap the old buffer and add locking around all mmap accesses • Results • Confirmed the fix eliminates the issue with no measurable performance impact • Next steps • Prepare and submit the patch for inclusion in the upstream Kafka project 20 Conclusion & Next steps