Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Investigating Multithreaded PostgreSQL

Investigating Multithreaded PostgreSQL

Avatar for Thomas Munro

Thomas Munro

May 17, 2025
Tweet

More Decks by Thomas Munro

Other Decks in Programming

Transcript

  1. Thomas Munro Free range open source database hacker at Microsoft

    pgconf.dev 2025 Montréal Investigating Multithreaded PostgreSQL Concepts | Architectures | WIP
  2. A long time ago in a galaxy far, far away…

    • Some Unixen had developed their own incompatible threading APIs, some had none, so portability was thorny • POSIX standardised <pthread.h> in 1995 • Windows’ threading API appeared in 1993 • C standardised <threads.h> in 2011, but it is still missing from at least one important system - The Implementation of POSTGRES, Stonebraker et al, covering 1985-1990
  3. Start a process by forking/cloning the current process Start a

    di ff erent executable in a child process Start a thread in the current process POSIX fork() fork() + exec()
 vfork() + exec()
 posix_spawn() pthread_create() Win32 CreateProcess() CreateThread() 1 2 3
  4. • fork() concept existed already in other systems, but early

    Unix simpli fi ed to the extreme • Simple interface led to complex interactions with other features, slowness,
 new variants like vfork(), rfork(), clone() • New systems should o ff er #2 and #3 only: a posix_spawn()-style interface for subprograms and otherwise threads
  5. fork() Create a new process like the parent • Returns

    child PID in parent process, 0 in child process • Copies* MAP_PRIVATE mappings (code, variables, stack, heap…) • Shares MAP_SHARED mappings • Duplicates fi le descriptors • Other kernel resources and properties are … complicated /bin/foo /bin/foo /lib/libfoo.so /lib/libfoo.so heap heap shmem shmem 3 → fi le 3 → fi le *see next… 1
  6. Copy-on-write • In PDP Unix, all process memory was literally

    copied • Since the VAX era: copy-on-write (see also: overcommit) • The page table is still often copied on fork() and occupies memory • Linux huge pages share page tables*, but work is ongoing for default pages • Number of pages involved depends on con fi guration, libraries, etc and can be very large! *perhaps not as well as it could?
  7. Simulating fork() on Windows Enough to make PostgreSQL work if

    you’re lucky • The only way to create a child process is CreateProcess(), which shares selected handles but not memory with the child • The contents of the important global variables have to be restored by hand, libraries initialize themselves from scratch, etc… • The addresses of private mappings may be di ff erent, but we don’t care about those • The main shared memory region must be at the same address in all backends, so we jump through a number of hoops, even retrying • Reported to add ~40ms to parallel queries with tiny memory map, probably much more depending on con fi gured libraries and page count foo.exe foo.exe foo.dll foo.dll heap heap shmem shmem 3 → fi le 3 → fi le 1
  8. fork() + exec() Create a new process to run a

    di ff erent program /bin/foo /bin/foo /lib/libfoo.so /lib/libfoo.so heap heap shmem shmem 3 → fi le 3 → fi le /bin/bar • In pre-VAX Unixen, very slow due to wasted temporary copy • Even today, if page tables can’t be shared, must perform copying in proportion to the number of pages, just to throw the copy away /lib/libbar.so 2
  9. vfork() + exec() Create a new process to run a

    di ff erent program… faster • Borrow the parent’s memory map to skip useless overheads • Child is only allowed to call exec() or exit(), and control doesn’t return to the parent until then (among other reasons, it shares the same stack so concurrency can’t work) • POSIX removed vfork() and supplied posix_spawn(), but it lives on as an implementation detail called by many libc system(), popen(), posix_spawn() /bin/foo /bin/foo /lib/libfoo.so /lib/libfoo.so heap heap shmem shmem 3 → fi le 3 → fi le /bin/bar 2 /lib/libbar.so
  10. pthread_create() Create a new thread of execution in the current

    process • Identi fi ed by opaque pthread_t handles • Global variables, fi le descriptors accessible (with signi fi cant new problems, see later) • Signals delivered to a process are handled by any thread not blocking, but can also be sent to a pthread_t within the process • Context switching between those threads may be more e ffi cient /bin/foo /lib/libfoo.so heap shmem 3 → fi le 3
  11. C11 threads • Missing on macOS, but clues spotted that

    it is coming • MSVC 2022, but missing in MinGW (?) • pthread_barrier_t equivalent missing (but easily implemented) • Some static initialisers missing, despite being implementable on POSIX and Windows
  12. C11 atomics • Widely supported, but needs MSVC 2022 on

    Windows • All of port/atomics.h except spin lock delays can be redirected to <stdatomics.h>, deleting a lot of code • Determining performance, and correctness implications of generated code, on many systems would take some e ff ort, once we’re ready to consider C11 (Orthogonal really, just nearby)
  13. Processes vs asynchronous I/O Cross-process I/O completion hurts • Backends

    need to be able to consume completions for I/Os started by other backends. io_uring can do that if you share user space queue and descriptor. • Designers of other relevant APIs didn’t conceive of such madness: • I/O Completion Ports (Windows) • IoRing (Windows) • POSIX AIO (FreeBSD and maybe more) • You can make them work with enough engineering and some performance loss (prototypes exist), but…
  14. Other things you can’t do with processes • Windows’ futexes

    don’t work cross-process • macOS and some others don’t support unnamed semaphores cross-process (pshared=1 is optional in POSIX) • Ditto for POSIX mutexes, condition variables, barriers (PTHREAD_PROCESS_SHARED is optional in POSIX)
  15. • 2007 overview of database architectural choices • Some of

    the following citations are thus surely obsolete by now… except the ones about us
  16. process socket thread session core core socket thread session socket

    thread session socket thread session Realistic
 thread m odel
  17. • Mostly obsolete* concept from uniprocessor days, generalized to so

    called M:N thread model on top of kernel thread • Context switching was managed by libc or application with obsolete getcontext() etc functions or horrible non-portable code • Windows has “ fi bers” *as far as C is concerned Green thread 1: main() foo() foox() Green thread 2: main() bar() barx() Bad idea
  18. /* * A struct encapsulating some elements of a user's

    session. For now this * manages state that applies to parallel query, but in principle it could * include other things that are currently global variables. */ typedef struct Session { dsm_segment *segment; /* The session-scoped DSM segment. */ dsa_area *area; /* The session-scoped DSA area. */ /* State managed by typcache.c. */ struct SharedRecordTypmodRegistry *shared_typmod_registry; dshash_table *shared_record_table; dshash_table *shared_typmod_table; } Session; /* GUCs */ -int io_method = DEFAULT_IO_METHOD; -int io_max_concurrency = -1; +postmaster_guc int io_method = DEFAULT_IO_METHOD; +postmaster_guc int io_max_concurrency = -1; /* global control for AIO */ -PgAioCtl *pgaio_ctl; +pg_global PgAioCtl *pgaio_ctl; /* current backend's per-backend state */ -PgAioBackend *pgaio_my_backend; +session_local PgAioBackend *pgaio_my_backend;
  19. Attempts using global → thread_local • Heikki’s current branch: https://github.com/hlinnaka/postgres/tree/threading

    • Postgres Pro prototype • CMU Peloton (also ported to C++, another thing Berkeley POSTGRES deferred) • Multiple projects to port to Windows via that route • A developer who gave up two years ago after reading “Features we don’t want” on our Wiki! • A report of a commercial product in Japan (anyone know what that is?) • I myself prototyped hack-grade parallel query with threads • Probably many more!
  20. … postmaster checkpointer backend 1 … supervisor checkpointer backend 1

    … … <system service manager> checkpointer backend 1 …
  21. WIP

  22. General goals • Support both process and thread mode for

    some time, so that extensions have time to adjust • Be fully portable • Stabilise, study
  23. Non-goals for prototype In other words, be as much like

    process model as possible • Sharing fi le descriptors • Sharing relcache, syscache • Sharing MemoryContexts between backends • Removing DSM, DSA • Removing all the serialization of state for parallel query workers • Removing the fake signal system from Windows backends
  24. pg_threads.h Minimal harmonisation of POSIX and Windows APIs • Even

    if we adopted other parts of C11, it seems a bit too soon to use <threads.h> • Use C11 as naming guide, but add pg_ pre fi xes • Require a way to implement pg_thread_local • Add strangely missing static initializer macros (eg PG_MTX_STATIC_INIT) • Patch previously proposed (CF #5194), will repost improved update soon • Works out net zero-ish in line count because it obsoletes thread portability wrappers in pgbench, ecpg, libpq; more such opportunities exist
  25. Removing non-mt safe code • Many patches committed, eg: •

    Parser thread-safety • Removing dependencies on the global locale, preferring _l() functions, various workaround • Using _r() functions • Removing static bu ff ers • Work continues!
  26. Signalectomy • SendInterrupt() proposal that will remove ProcSignal, CF #5118

    • Experimental work tries other wakeup mechanisms for interrupts (pipes, futexes, custom io_uring, kqueue, iocp user events) • Many cases of ad hoc signals are documented at:
 
 https://wiki.postgresql.org/wiki/Signals