Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming I/O

Streaming I/O

Thomas Munro

June 01, 2024

More Decks by Thomas Munro

Other Decks in Programming


  1. PGConf.dev 2024 | Vancouver, Canada | Thomas Munro | Open

    source database hacker at Microsoft Streaming I/O New abstractions for e ffi cient fi le I/O
  2. Talk • Part I: Review OS I/O facilities • Part

    II: Read streams • Part III: Write streams • Part IV: Looking ahead
  3. • bu ff ered I/O vs
 direct I/O • vectored

    I/O AKA scatter/gather I/O • synchronous I/O vs asynchronous I/O Database I/O Programming
  4. read, write, worksync, iowait MULTICS (’65) UNIX (’69) read, write

    POSIX (’93) aio_read, aio_write, aio_suspend, RT signals ’80s, ’90s: p… = with position
 …v = vectored
 p…v = both Linux (’19) io_uring O_DIRECT Linux (’03?) libaio + kernel support POSIX giveth, but nobody noticeth MULTICS giveth UNIX taketh away Linux giveth (again), but this time everybody liketh! IBM S/360 (’65) DEC RX11 (’71) VMS (’77) NT (’93) AIO!
  5. What are the p… functions for? Primary motivation was threads

    (not relevant to us, yet) ssize_t read(int filedes, void *buf, size_t nbyte) ssize_t pread(int filedes, void *buf, size_t nbyte, off_t offset) • The concept of “current file position” doesn’t fit well with multi-threaded applications sharing “seekable” file descriptors • PostgreSQL 12 switched from lseek + read/write to pread/pwrite, removing extra system calls and book-keeping in vfd layer • Required wrapper function for Windows; unfortunately the wrappers are imperfect and change the current position, so we use the names pg_pread and pg_pwrite to remember that!
  6. Who needs vectored I/O? Systems that manage their own bu

    ff er
 pool (basically, databases) • We want to read contiguous blocks of a file into memory in one op • Replacement algorithm doesn’t try to find contiguous memory blocks (and shouldn’t!) • Can also map to single physical I/O ssize_t pread (int filedes, void *buf, size_t nbytes, off_t offset) ssize_t preadv(int filedes, struct iovec *iov, int iovcnt, off_t offset) struct iovec { void *iov_base; size_t iov_len; };
  7. • POSIX hasn’t standardised p…v yet (despite having p… and

    …v!), but every Unix now has them and they are proposed for POSIX issue 8 at https://austingroupbugs.net/view.php?id=1832 • Windows doesn’t have a general enough* equivalent operation, so we have pg_preadv and pg_pwritev macros that just fall back to looping over pg_pread and pg_pwrite on that OS.
 *The native scatter/gather operations work only on non-buffered, overlapped (asynchronous) file handles. We might be able to use them in future work… they were made for databases! De facto Unix portability (aliquando de jure)
  8. Who wants direct I/O? Systems that manage their own bu

    ff er pool (basically, databases*) • Buffering costs CPU (extra copying) • Buffering costs RAM (extra copy) • Read-ahead and write-behind heuristics not good enough • Buffered I/O is still good for many uses cases though! • Direct I/O requires tuning (*Also big data systems with extremely low access frequency) 4k 3k 2k 1G 2G 0G TPS
  9. - https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics “The exact semantics of Direct I/O (O_DIRECT) are

    not well specified. It is not a part of POSIX, or SUS, or any other formal standards specification. The exact meaning of O_DIRECT has historically been negotiated in non-public discussions between powerful enterprise database companies and proprietary Unix systems, and its behaviour has generally been passed down as oral lore rather than as a formal set of requirements and specifications.”
  10. • On OSes that support O_DIRECT (most!), it doesn’t work

    on all file systems • PostgreSQL 16 added debug_io_direct=data,wal for testing • Currently meeting all known alignment requirements • Early testing already revealed some interesting problems (btrfs checksums can corrupt themselves when we set hint bits!)
  11. Who wants asynchronous I/O? People using direct I/O! • Wtihout

    buffering, you lose read-ahead heuristics • We ought to be able to do better read-ahead and write-behind than the kernel anyway, we have much more information (cf. “advice”-based prefetching + sync_file_range()) • Minimum solution is to have I/O worker threads/processes running preadv/pwritev (at application level, or inside libc/librt) • Modern (and ancient) OSes offer ways to skip the scheduling and IPC overheads
  12. submission queue entries completion queue entries • io_uring_enter(): initiate and/or

    wait for many operations • Start multiple operations at once by writing them into a submission queue in user space memory and then telling the kernel • Consume completion notifications, either directly from userspace memory if possible, or by waiting if not • Directly modelled on kernel/device communication on the “other side” in hardware
  13. pread(8KB) pread(8KB) pread(8KB) pread(8KB) pread(8KB) preadv(128KB) preadv(128KB) preadv(128KB) preadv(128KB) preadv(128KB)

    preadv(128KB) preadv(128KB) preadv(128KB) io_uring_enter() io_uring_enter() [v15] Block-at-a-time, synchronous, bu ff ered [v17] Larger vectored I/O, optional direct I/O [WIP] I/O calls running ahead of time in worker processes [WIP] Kernel managed asynchronous I/O, no extra process overheads (O ffl oaded to worker)
  14. storage DMA engine SQ CQ NVMe io_uring SQ CQ direct

    I/O DMA SELECT COUNT(*) Lofty ideal Zero copy, zero wait, 100% on CPU bu ff er pool Video credit: Wallace & Gromit “The Wrong Trousers”, 1993, by Aardman
 Cycle loop from giphy.com
  15. “Reading” blocks of relation data A very common operation •

    PostgreSQL works in terms of 8KB blocks, calling ReadBuffer(relation identifier, block_number) to access each one • If the buffer is already in the buffer pool, it is pinned • If the buffer is not already in the buffer pool, it must be loaded first, possibly after evicting something else • In order to build larger I/Os and start the physical I/O asynchronously, we need to find all the places that do that, and somehow convince them to participate in a new grouping scheme
  16. for (i = 0; i < nblocks; ++i) { buf

    = ReadBuffer(…, i); ReleaseBuffer(buf); } static BlockNumber my_blocknum_callback(void *private_data); stream = read_stream_begin_relation(…, my_blocknum_callback, &my_callback_state, …); for (i = 0; i < nblocks; ++i) { buf = read_stream_next(stream); ReleaseBuffer(buf); } read_stream_end(stream); } io_combine_limit block numbers are pulled in here pinned buffers are pulled out here combined blocks read in with preadv()
  17. static BlockNumber my_blocknum_callback(void *private_data); stream = read_stream_begin_relation(…, my_blocknum_callback, &my_callback_state, …);

    for (i = 0; i < nblocks; ++i) { buf = read_stream_next(stream); ReleaseBuffer(buf); } read_stream_end(stream); } e ff ective_io_concurrency non-sequential block numbers hinted to kernel with posix_fadvise() preadv() deferred until absolutely necessary, so the hint as a good chance of working! By issuing POSIX_FADV_WILLNEED as soon as possible and preadv() as late as possible, we get a sort of poor man’s asynchronous I/O.
  18. 0 1 2 3 4 5 Arithmetic-driven:
 seq scan (17

 ANALYZE sampling (17 beta) Data-driven:
 bitmap heapscan (WIP)
 recovery (WIP) 0 1 2 3 4 5
  19. Ramping up and down • A stream doesn’t generally know

    if a SELECT … LIMIT 1 might not want more than one block, so it starts out reading just a single block and increases the look ahead distance only while that seems to be useful. • When regions of already-cached blocks are encountered, the look ahead distances slowly decreases. • In this way we don’t pay extra overheads such as extra pins and book keeping unless there is some bene fi t to it.
  20. A B C 1 io_combine_limit K * e ff ective_io_concurrency

    Tuning the look-ahead distance All cached Sequential I/O pattern detected: currently no point in look ahead further than io_combine_limit Random I/O pattern detected: currently fadvise used to control concurrency Distance moves up and down in response to randomness, hits and misses
  21. Fully cached data Even without I/O, there may be some

    gains available • If you can predict future, you can also prefetch memory. Early experiments show sign fi cant gains. • In the future, could we perform more e ffi cient bu ff er mapping table lookups (combining neighbouring blocks in tree-based lookup? batched partition locking? SIMD hash computation? …) Experim ental
  22. Out-of-order streams • The current behaviour of a read stream

    is to emit bu ff ers in the order that the callback emitted block numbers • Not all users require ordered data, and then cached blocks might as well be reordered before blocks that require I/O, to give the I/O the greatest chance of completing. This speeds up partially cached scans in early experiments. Experim ental
  23. Outer types of read streams • Raw fi les? •

    Relations but reading to temporarty bu ff ers? • Sockets? Future w ork
  24. Where can we write? • While reading happens all over

    the tree, the places that write data are few, and contained in just a few places: • Checkpointer: bu ff ers already sorted, read to be combined by write stream • Background writer • Regular backend, when evicting dirty bu ff ers • Non-shared temporary bu ff ers for CREATE INDEX and other bulk writes W IP
  25. pwrite(8KB) pwrite(8KB) pwrite(8KB) pwrite(8KB) pwrite(8KB) pwritev(128KB) pwritev(128KB) sync_file_range() pwritev(128KB) io_uring_enter()

    io_uring_enter() (O ffl oaded to worker) CHECKPOINT COPY CREATE INDEX W IP pwritev(128KB) pwritev(128KB) sync_file_range() pwritev(128KB)
  26. pwrite(8KB) pread(8KB) pwrite(8KB) pread(8KB) pwrite(8KB) pwritev(128KB) preadv(128KB) pwritev(128KB) preadv(128KB) pwritev(128KB)

    preadv(128KB) pwritev(128KB) preadv(128KB) io_uring_enter() io_uring_enter() (O ffl oaded to worker) VACUUM W IP
  27. ReadStream used by a consumer like VACUUM that dirities the

    buffers WriteStream inside BufferStrategy, combining writes buffers that are often sequential block numbers. Strategy write-behind Experim ental
  28. Controlling OS buffered writeback rate • When using bu ff

    ered I/O on Linux, PostgreSQL calls sync_ fi le_range() after writing {backend,checkpoint,bgwriter}_ fl ush_after bytes • This is necessary because Linux is very lazy about performing writes, so we have to insist; this is sort of the opposite of POSIX_FADV_WILLNEED! • We can subsume the existing “WritebackContext” mechanism into WriteStream W IP
  29. • v17 BETA 1 has basic synchronous ReadStream support (by

    me), currently used only by ANALYZE (Nazir Bilal Yavuz) and heapam Sequential Scan (Melanie Plageman) • Several more ReadStream user patches will be proposed for v18 • Can you see more places that can be streamified? (Access methods, …) • Current streaming benefits only uses synchronous I/O; in future proposals we might call that mode io_method=synchronous (?) • Andres Freund is working on true asynchronous subsystem (where all these ideas started); little/no changes should be needed to users of ReadStream to benefit from io_method=worker,io_uring
  30. fin