on python-ideas • Whirlwind of discussion relating to new async APIs over October • Twisted folk were involved • Outcome: o PEP-3156: Asynchronous I/O Support Rebooted o Tulip • Both spearheaded by Guido
o No need to check if something is ready for reading or writing o Underlying network scaffolding code does that for you o Invokes your completion-oriented methods when appropriate • ….but UNIX is inherently readiness-oriented • Quick summary of UNIX I/O: o read() and write(): • No data available for reading? Block! • No buffer space left for writing? Block! o Not suitable when serving more than one client o (A blocked process is only unblocked when data is available for reading or buffer space is available for writing) • How do you serve multiple clients?
o Single server process sits in an accept() loop o fork() child process to handle new connections o One process per connection, doesn’t scale well • Threadpools, one thread per connection o Popular with Java, late 90s, early 00s o Simplified programming logic o Client classes could issue blocking reads/writes o Only the blocking thread would be suspended o Still has scaling issues • Single-threaded server, non-blocking I/O o Sockets set to non-blocking • Allows you to inquire whether a read or write would block (“readiness”) • ….and avoid it if so (and move onto the next client) o Requires an I/O multiplexing method • Ability to query the readiness of multiple sockets at once
4.2 (1984) o Pass in a set of file descriptors you’re interested in (reading/writing/exceptional conditions) o Set of file descriptors = bit fields in array of integers o Fine for small sets of descriptors, didn’t scale well • poll() o AT&T System V (1983) o Pass in an array of “pollfds”: file descriptor + interested events o Scales a bit better than select() • Both methods had O(n) kernel (and user) overhead o Entire set of fds you’re interested in passed to kernel on each invocation o Kernel has to enumerate all fds – also O(n) o ….and you have to enumerate all results – also O(n) o Expensive when you’re monitoring tens of thousands of sockets, and only a few are “ready”; you still need to enumerate your entire set to find the ready ones
handle thousands of simultaneous clients • select()/poll() becoming bottlenecks • C10K problem (Kegel) • Lots of seminal papers started coming out • Notable: o Banga et al: • “A Scalable and Explicit Event Delivery Mechanism for UNIX” • June 1999 USENIX, Monterey, California
FreeBSD: kqueue o Linux: epoll o Solaris: /dev/poll • One thing they had in common: separate declaration of interest from inquiry about readiness o Register the set of file descriptors you’re interested in ahead of time o Kernel gives you back an identifier for that set o You pass in that identifier when querying readiness • Benefits: o Kernel work when checking readiness is now O(1) • epoll and kqueue quickly became the preferred methods for I/O multiplexing
node.js … • All single-threaded, all use non-blocking sockets • Event loop ties everything together o It’s literally an endless loop that runs until program termination o Calls an I/O multiplexing method upon each “run” of the loop • epoll/kqueue preferred, fallback to poll, then select o Enumerate entire set of file descriptors • Data ready for reading without blocking? Great! o read() it, then invoke the relevant protocol.data_received() • Data can be written without blocking? Great! Write it! • Nothing to do? Fine, skip to the next file descriptor.
Completion-oriented protocol classes • Implementation details: Single-threaded* server + Non-blocking sockets + Event loop + I/O multiplexing method = asynchronous I/O! ([*] Not entirely true; separate threads are used, but only to encapsulate blocking calls that can’t be done in a non-blocking fashion. They’re still subject to the GIL.)
+ non-blocking socket + event loop + I/O multiplex via kqueue/epoll o Well suited to Linux, BSD, OS X o ….but there’s actually nothing asynchronous about it at all • What about other operating systems? o Windows NT 3.x (mid 90s) • Overlapped I/O (facilitated asynchronous I/O) • I/O Completion Ports (IOCP) • Kernel/Executive architecture promoted tight coupling between threads, I/O and synchronization primitives o AIX 5.3 (2004) • Implemented IOCP (API identical to Windows) o CreateIoCompletionPort o GetQueuedCompletionStatus etc • Coupled it with AIO o Solaris 10 (2005) • Event ports • Same sort of goal, simpler (more UNIX-like) interface
al) • “Despised all things UNIX” o On Unix process I/O model: • "getta byte, getta byte, getta byte byte byte“ • Got a call from Bill Gates in the late 80s o “Wanna’ build a new OS?” • Led development of Windows NT • Vastly different approach to threading, kernel objects, synchronization primitives and I/O mechanisms (versus POSIX/UNIX) • (What works well on UNIX does not work well on Windows, and vice versa.)
Threads have been first class citizens since day 1 (not bolted on as an afterthought) • Designed to be programmed in a completion- oriented fashion • Overlapped I/O + IOCP + threads + kernel synchronization primitives = excellent combo for achieving high performance
not to do… • Penultimate place: o One thread per connection, blocking I/O calls • Tied for last place: o accept() -> fork() o Single-thread, non-blocking sockets, event loop, I/O multiplex system call • The best option on Linux/BSD is the absolute worst option on Windows o Windows doesn’t have a kqueue/epoll equivalent (nor should it) o So you’re stuck with select()…
on Windows! • And we’re using it in a single-thread, with non- blocking sockets, via an event loop, in an entirely readiness-oriented fashion… • All in an attempt to simulate asynchronous I/O… • So we can drive completion-oriented protocols… • Instead of using the native Windows facilities for achieving high-performance native asynchronous I/O… • ….keeping in mind these native facilities are already inherently completion-oriented?
posited some alternate implementation ideas for asynchronous I/O on python-ideas that were better suited to Windows (and AIX and Solaris) o Keep the completion-oriented APIs o Use Vista+ threadpool and IOCP facilities in lieu of a select() event loop • I actually had an even more radical long term goal in mind: o Oh, while we’re at it, come up with a way for these threads, executing IOCP callbacks, can actually run Python code concurrently, across multiple cores, all within the same process/interpreter. o i.e. solve the GIL issue • But the proposal was far-fetched enough as it was, so I kept that part to myself • Response: predominantly skepticism, one or two lukewarm, rest uninterested
up my sleeve, so I decided to work on this full-time • The aim was simple: o Keep the completion-oriented protocol classes o Focus on exploiting the stateless nature of the vast majority of TCP/IP services (HTTP server is a perfect example) o Leverage contemporary (Vista+) techniques for handling socket I/O • Vista thread pools • Interlocked facilities • IOCP and overlapped I/O • AcceptEx, DisconnectEx, TransmitFile etc o Figure out a way to get around the GIL, so that the callbacks could be executed within the same Python interpreter, on multiple threads, across multiple cores, concurrently. • Without impeding performance of normal single-threaded code (like previous GIL removal attempts)
working over the course of about three months • Two distinct parts to the work: o “Parallelizing” the interpreter • Allowing multiple threads to run CPython internals concurrently • ….without removing the GIL or impede single-threaded performance o Asynchronous API exposed to Python code • Leverages the parallel facilities above • Allows code to execute concurrently across all cores • Tight integration with platform support for asynchronous I/O • I’ve had a vague idea for how to go about the parallel aspect for a few years • The async discussions on python-ideas provided the motivation to tie both things together
of arbitrary “work”: o Calls func(args, kwds) from a parallel thread* • Submission of timers: • Calls func(args, kwds) from a parallel thread some ‘time’ in the future, or every interval. • Submission of “waits”: • Calls func(args, kwds) from a parallel thread when ‘obj’ is signalled
InterlockedFlushSList() o QueryDepthSList() o InterlockedPushEntrySList() o InterlockedPushListSList() o InterlockedPopEntrySlist() • Critical Sections: o InitializeCriticalSectionAndSpinCount() o EnterCriticalSection() o LeaveCriticalSection() o TryEnterCriticalSection()
o AcquireSRWLockShared() o AcquireSRWLockExclusive() o ReleaseSRWLockShared() o ReleaseSRWLockExclusive() o TryAcquireSRWLockExclusive() o TryAcquireSRWLockShared() • One-time initialization: o InitOnceBeginInitialize() o InitOnceComplete()
CreateEvent() o SetEvent() o WaitForSingleObject() o WaitForMultipleObjects() o SignalObjectAndWait() • Thread pool facilities (Vista+) o TrySubmitThreadpoolCallback() o StartThreadpoolIo() o CloseThreadpoolIo() o CancelThreadpoolIo() o DisassociateCurrentThreadFromCallback() o CallbackMayRunLong() o CreateThreadpoolWait() o SetThreadpoolWait()
AcceptEx() o WSAEventSelect(FD_ACCEPT) o DisconnectEx(TF_REUSE_SOCKET) o Overlapped WSASend() o Overlapped WSARecv() o Tight integration between async socket I/O, I/O completion ports, and threadpool facilities (StartThreadpoolIo() etc) • Future enhancements with Registered I/O (Win 8+) • Main takeaway: o All of that stuff is very useful, and used by PyParallel o Didn’t have to write any of it myself; could concentrate on problem at hand o Wouldn’t have had that luxury if I were trying to prototype on Linux/BSD
work: o No GIL removal • This was previously tried and rejected • Required fine-grained locking throughout the interpreter • Mutexes are expensive • Single-threaded execution significantly slower o Not using PyPy’s approach via Software Transactional Memory (STM) • Huge overhead • 64 threads trying to write to something, 1 wins, continues • 63 keep trying • 63 bottles of beer on the wall… • Doesn’t support “free threading” o Existing code using threading.Thread won’t magically run on all cores o You need to use the new async APIs
execution o In comparison to existing Python: the thing that runs when the GIL is held o Only runs when parallel contexts aren’t executing • Parallel contexts o Created in the main-thread o Only run when the main-thread isn’t running o Read-only visibility to the global namespace established in the main- thread • Common phrases: • “Is this a main thread object?” • “Are we running in a parallel context?” • “Was this object created from a parallel context?”
for the `work` callback • async.run() o Main-thread suspends o Parallel contexts allowed to run o Automatically executed across all cores (when sufficient work permits) o When all parallel contexts complete, main thread resumes, async.run() returns • ‘a’ = main thread object • ‘b = a * 1’ o Executed from a parallel context o ‘b’ = parallel context object import async a = 1 def work(): b = a * 1 async.submit_work(work) async.run()
• Multiple parallel contexts can run concurrently on separate cores • Windows takes care of all the thread stuff for us o Thread pool creation o Dynamically adjust number of threads based on load and physical cores o Cache/NUMA-friendly thread scheduling/dispatching • Parallel threads execute the same interpreter, same ceval loop, same view of memory as the main thread etc • But the CPython interpreter isn’t thread safe! o Global statics used frequently (free lists) o Reference counting isn’t atomic o Objects aren’t protected by locks o Garbage collection definitely isn’t thread safe • You can’t have one thread performing a GC run, deallocating objects, whilst another thread attempts to access said objects concurrently o Creation of interned strings isn’t thread safe o Bucket memory allocator isn’t thread safe o Arena memory allocator isn’t thread safe
interpreter assumes it’s the only thread running (if it has the GIL held) • The only possible way of allowing multiple threads to run the same interpreter concurrently would be to add fine-grained locking to all of the above • This is what Greg Stein did ~13 years ago o Introduced fine-grained locks in lieu of a Global Interpreter Lock o Locking/unlocking introduced huge overhead o Single-threaded code 40% slower
serves a very useful purpose • Instead, intercept all thread-sensitive calls: o Reference counting • Py_INCREF/DECREF/CLEAR o Memory management • PyMem_Malloc/Free • PyObject_INIT/NEW o Free lists o Static C globals o Interned strings • If we’re the main thread, do what we normally do • However, if we’re a parallel thread, do a thread- safe alternative
thread, do X, if not, do Y” o X = thread-safe alternative o Y = what we normally do • “If we’re a parallel thread” o Thread-sensitive calls are ubiquitous o But we want to have a negligible performance impact o So the challenge is how quickly can we detect if we’re a parallel thread o The quicker we can detect it, the less overhead incurred
#define Py_PXCTX (Py_MainThreadId != _Py_get_current_thread_id()) • What’s so special about _Py_get_current_thread_id()? o On Windows, you could use GetCurrentThreadId() o On POSIX, pthread_self() • Unnecessary overhead (this macro will be everywhere) • Is there a quicker way? • Can we determine if we’re running in a parallel context without needing a function call?
{ \ + if (Py_PXCTX) \ + _Px_ForgetReference(op); \ + else \ + _Py_INC_TPFREES(op); \ + break; \ + } while (0) + +#endif /* WITH_PARALLEL */ • Py_PXCTX == (Py_MainThreadId == __readfsdword(0x48)) • Overhead reduced to a couple more instructions and an extra branch (cost of which can be eliminated by branch prediction) • That’s basically free compared to STM or fine-grained locking
Py_PXCTX for normal single-threaded code o GIL removal: 40% overhead o PyPy’s STM: “2x-to-5x slower” • Only touches a relatively small amount of code o No need for intrusive surgery like re-writing a thread-safe bucket memory allocator or garbage collector • Keeps GIL semantics o Important for legacy code o 3rd party libraries, C extension code • Code executing in parallel context has full visibility to “main thread objects” (in a read-only capacity, thus no need for locks) • Parallel contexts are intended to be shared-nothing o Full isolation from other contexts o No need for locking/mutexes
alternatives • First step was attacking memory allocation o Parallel contexts have localized heaps o PyMem_MALLOC, PyObject_NEW etc all get returned memory backed by this heap o Simple block allocator • Blocks of page-sized memory allocated at a time (4k or 2MB) • Request for 52 bytes? Current pointer address returned, then advanced 52 bytes • Cognizant of alignment requirements • What about memory deallocation? o Didn’t want to write a thread-safe garbage collector o Or thread-safe reference counting mechanisms o And our heap allocator just advances a pointer along in blocks of 4096 bytes o Great for fast allocation o Pretty useless when you need to deallocate
blocks are done from a single heap o Allocated via HeapAlloc() • These parallel contexts aren’t intended to be long- running bits of code/algorithm • Let’s not free() anything… • ….and just blow away the entire heap via HeapFree() with one call, once the context has finished
than the allocator) o Good fit for the intent of parallel context callbacks • Execution of stateless Python code • No mutation of shared state • The lifetime of objects created during the parallel context is limited to the duration of that context • Cons: o You technically couldn’t do this: def work(): for x in xrange(0, 1000000000): … o (Why would you!)
first place? • Because the memory for objects is released when the object’s reference count goes to 0 • But we release all parallel context memory in one fell swoop once it’s completed • And objects allocated within a parallel context can’t “escape” out to the main-thread o i.e. appending a string from a parallel context to a list allocated from the main thread • So… there’s no point referencing counting objects allocated within parallel contexts!
objects we may interact with? • Well all main thread objects are read-only • So we can’t mutate them in any way • And the main thread doesn’t run whilst parallel threads run • So we don’t need to be worried about main thread objects being garbage collected when we’re referencing them • So… no need for reference counting of main thread objects when accessed within a parallel context!
of the parallel context’s life • And we don’t do any reference counting anyway • Then there’s no possibility for circular references • Which means there’s no need for garbage collection! • ….things just got a whole lot easier!
incredibly simple o Bump a pointer o (Occasionally grab another page-sized block when we run out) • Simple = fast • Memory deallocation is done via one call: HeapFree() • No reference counting necessary • No garbage collection necessary • Negligible overhead from the Py_PXCTX macro • End result: Python code actually executes faster within parallel contexts than main-thread code • ….and can run concurrently across all cores, too!
was allow the callbacks for completion-oriented protocols to execute concurrently import async class Disconnect: pass server = async.server(‘localhost’, 8080) async.register(transport=server, protocol=Disconnect) async.run() • Let’s review some actual protocol examples o Keep in mind that all callbacks are executed in parallel contexts o If you have 8 cores and sufficient load, all 8 cores will be saturated • We use AcceptEx to pre-allocate sockets ahead of time o Reduces initial connection latency o Allows use of IOCP and thread pool callbacks to service new connections o Not subject to serialization limits of accept() on POSIX • And WSAAsyncSelect(FD_ACCEPT) to notify us when we need to pre-allocate more sockets
time a callback is run, memory is allocated • Memory is only freed when the context is finished • Contexts are considered finished when the client disconnects • ….that’s not a great combo
PyObject_CallObject) • It takes a “heap snapshot” • Each snapshot is paired with a corresponding “heap rollback” • Can be nested (up to 64 times): snapshot1 = heap_snapshot() snapshot2 = heap_snapshot() # do work heap_rollback(snapshot2) heap_rollback(snapshot1)
machinery • A rollback simply rolls the pointers back in the heap to where they were before the callback was invoked • Side effect: very cache and TLB friendly o Two invocations of data_received(), back to back, essentially get identical memory addresses o All memory addresses will already be in the cache o And if not, they’ll at least be in the TLB (a TLB miss can be just as expensive as a cache miss)
performance requirements/preferences: o Low latency preferred o High concurrency preferred o High throughput preferred • What control do we have over latency, concurrency and throughput? • Asynchronous versus synchronous: o An async call has higher overhead compared to a synchronous call • IOCP involved • Thread dispatching upon completion o If you can perform a synchronous send/recv at the time, without blocking, that will be faster • How do you decide when to do sync versus async?
as a connection is made • Sends a line as soon as that line has sent • ….sends a line as soon as that next line has sent • ….and so on • Always wants to send something • PyParallel term for this: I/O hog
for PxSocket_Send, PxSocket_Recv • Chargen forced a rethink • If we have four cores, but only one client connected, there’s no need to do async sends o A synchronous send is more efficient o Affords lower latency, higher throughput • But chargen always wants to do another send when the last send completed • If we’re doing a synchronous send from within PxSocket_Send… doing another send will result in a recursive call to PxSocket_Send again • Won’t take long before we exhaust our stack
single method that has all possible socket functionality inlined • Single function = single stack = no stack exhaustion • Allows us to dynamically choose optimal I/O method (sync vs async) at runtime o If active client count < available CPU cores - 1: try sync first, fallback to async after X sync EWOULDBLOCKs • Reduced latency • Higher throughput • Reduced concurrency o If active client count >= available CPU cores - 1: immediately do async • Increased latency • Lower throughput • Better concurrency • We also detect how many active I/O hogs there are (in total), and whether this protocol is an I/O hog, and factor that into the decision • Protocols can also provide a hint: class HttpServer: concurrency = True class FtpServer: throughput = True
explicit send/write, i.e. o No transport.write(data) like with Tulip/Twisted • You “send” by returning a “sendable” Python object from the callback o PyBytesObject o PyByteArray o PyUnicode • Supporting only these types allow for a cheeky optimisation: o The WSABUF’s len and buf members are pointed to the relevant fields of the above types; no copying into a separate buffer needs to take place
all your data at once (not a bad thing), not trickle it out through multiple write()/flush() calls • Forces you to leverage send_complete() if you want to send data back-to-back (like chargen) • send_complete() clarification: o What it doesn’t mean: other side got it o What it does mean: send buffer is empty (became bytes on a wire) o What it implies: you’re free to send more data if you’ve got it, it won’t block • Nice side effects of this arrangement: o No need to buffer anything internally o No need for producer/consumer relationships like in Twisted/Tulip • pause_producing()/stop_consuming() o No need to deal with buffer overflows when you’re trying to send lots of data to a slow client – the protocol essentially buffers itself automatically o Keeps a tight rein on memory use o Will automatically trickle bytes over a link, to completely saturating it
demo coming up: o One python_d.exe process o Constant memory use o CPU use proportional to concurrent client count (1 client = 25% CPU use) o Every 10,000 sends, a status message is printed • Depicts dynamically switching from synchronous sends to async sends • Illustrates awareness of active I/O hogs • Environment: o Macbook Pro, 8 core i7 2.2GHz, 8GB RAM o 1-5 netcat instances on OS X o Windows 7 instance running in Parallels, 4 cores, 3GB
PyParallel… • You’re only sending 73 bytes at a time • The CPU time required to generate those 73 bytes is not negligible (compared to the cost of sending 73 bytes) o Good simulator of real world conditions, where the CPU time to process a client request would dwarf the IO overhead communicating the result back to the client • With a default send socket buffer size of 8192 bytes and a local netcat client, you’re never going to block during send() • Thus, processing a single request will immediately throw you into a tight back-to-back send/callback loop, with no opportunity to service other clients (when doing synchronous sends) • Highlighted all sorts of problems I needed to solve before moving on to something more useful: the async HTTP server
http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Lib/async/http/server.py • Final piece of the async “proof-of-concept” • PxSocket_IOLoop modified to optimally support TransmitFile o Windows equivalent to POSIX sendfile() o Serves file content directly from file system cache, very efficient o Tight integration with existing IOCP/threadpool support
work highlighted a flaw in the thread-local redirection of interned strings and heap snapshot/rollback logic • I had already ensured the static global string intern stuff was being intercepted and redirected to a thread-local equivalent when in a parallel context • However, string interning involves memory allocation, which was being fulfilled from the heap associated with the active parallel context • Interned strings persist for the life of the thread, though, parallel context heap allocations got blown away when the client disconnected
previously implemented- then-abandoned support for a thread-local heap: PyAPI_FUNC(int) _PyParallel_IsTLSHeapActive(void); PyAPI_FUNC(int) _PyParallel_GetTLSHeapDepth(void); PyAPI_FUNC(void) _PyParallel_EnableTLSHeap(void); PyAPI_FUNC(void) _PyParallel_DisableTLSHeap(void); • Prior to interning a string, we check to see if we’re a parallel context, if we are, we enable the TLS heap, proceed with string interning, then disable it. • The parallel context _PyHeap_Malloc() method would divert to a thread-local equivalent if the TLS heap was enabled • Ensured that interned strings were always backed by memory that wasn’t going to get blown away when a context disappears
[] def work(): timestamp = async.rdtsc() foo.append(timestamp) async.submit_work(work) async.run() • That is, how do you handle either: o Mutating a main-thread object from a parallel context o Persisting a parallel context object outside the life of the context • That was a big showstopper for the entire three months • Came up with numerous solutions that all eventually turned out to have flaws
all sorts of things in place all over the code base to try and detect/intercept the previous two occurrences • Had an epiphany shortly after PyCon 2013 (when this work was first presented) • The solution is deceptively simple: o Suspend the main thread before any parallel threads run. o Just prior to suspension, write-protect all main thread pages o After all the parallel contexts have finished, return the protection to normal, then resume the main thread • Seems so obvious in retrospect!
(write) to a main-thread allocated object, a general protection fault will be issued • We can trap that via Structured Exception Handlers o (Equivalent to a SIGSEV trap on POSIX) • By placing the SEH trap’s __try/__except around the main ceval loop, we can instantly convert the trap into a Python exception, and continue normal execution o Normal execution in this case being propagation of the exception back up through the parallel context’s stack frames, like any other exception • Instant protection against all main-thread mutations without needing to instrument *any* of the existing code
(which essentially calls malloc() for everything) • For VirtualProtect() calls to work efficiently, we’d need to know the base address ranges of main thread memory allocations – this doesn’t fit well with using malloc() for everything o Every pointer + size would have to be separately tracked and then fed into VirtualProtect() every time we wanted to protect pages • Memory protection is a non-trivial expense o For each address passed in (base + range), OS has to walk all affected page tables and alter protection bits • I employed two strategies to mitigate overhead: o Separate memory allocation into two phases: reservation and commit. o Use large pages.
separate to committing it • Reserved memory is free; no actual memory is used until you subsequently commit a range (from within the reserved range) • This allows you to reserve, say, 1GB, which gives you a single base address pointer that covers the entire 1GB range • ….and only commit a fraction of that initially, say, 256KB • This allows you to toggle write-protection on all main thread pages via a single call to VirtualProtect() via the base address call • Added benefit: easily test origin of an object by masking its address against known base addresses
page size for both is 4KB) • Large pages provide significant performance benefits by minimizing the number of TLB entries required for a process’s virtual address space • Fewer TLB entries per address range = TLB can cover greater address range = better TLB hit ratios = direct impact on performace (TLB misses are very costly) • Large pages also means the OS has to walk significantly fewer page table entries in response to our VirtualProtect() call
of the current memory model o Ideas for new set of interlocked data types • Continued work on memory management enhancements o Use context managers to switch memory allocation protocols within parallel contexts o Rust does something similar in this area • Integration with Numba o Parallel callbacks passed off to Numba asynchronously o Numba uses LLVM to generate optimized version o PyParallel atomically switches the CPython version with the Numba version when ready
destination o 1:m support o Provide similar ZeroMQ bridge/fan-out/router functionality • This would provide a nice short-term option for leveraging PyParallel for computation/parallel task decomposition o Bridge different protocols together o Each protocol represents a stage in a parallel pipeline o Leverage socket I/O for sharing of data o Increased overhead in copying data everywhere o But vastly simplified memory model o (And no need for synchronization primitives) o This is how ZeroMQ does “parallel computation”
way for parallel callbacks to efficiently enqueue UI actions (performed by a single UI thread) • NUMA-aware memory allocators • CPU/core-aware thread affinity • Integrating Windows 8’s registered I/O support • Multiplatform support: o MegaPipe for Linux looks promising o GCD on OS X/FreeBSD o IOCP on AIX o Event ports for Solaris
back into the CPython tree o Although started as a proof-of-concept, I believe it is Python’s best option for exploiting multiple cores in the short term (without impeding single thread performance) o This is going to be critical over the next 5-10 years • Lot of work required before that could take place • Python 4.x perhaps?