Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parallelizing the Python Interpreter: An Altern...

Parallelizing the Python Interpreter: An Alternate Approach to Async

This is the short overview presentation I gave to Python developers at the PyCon 2013 Language Summit.

Trent Nelson

March 14, 2013
Tweet

More Decks by Trent Nelson

Other Decks in Programming

Transcript

  1. PARALLELIZING THE PYTHON INTERPRETER An alternate approach to async PyCon

    2013 Language Summit Trent Nelson trent@snakebite.org @trentnelson
  2. Background/Motivation •  Two parts: •  “Parallelizing” the interpreter •  Allowing

    multiple threads to run CPython internals concurrently •  True “asynchronicity” •  Leverage the parallel facilities above to expose a “truly asynchronous” async API to Python code •  (Allow Python code to run concurrently within the same interpreter) •  I’ve had a vague idea for how to go about the parallel aspect for a while •  The async discussions on python-ideas last year provided the motivation to try and tie both things together •  Also figured it would be a great pet project to better familiarize myself with CPython internals
  3. Details •  Have been working on it since late-December-ish (full

    time until mid Feb-ish) •  It all lives in hg.python.org/sandbox/trent, px branch •  Include/object.h •  Include/pyparallel.h •  Python/pyparallel.c & pyparallel_private.h •  Vast majority of code •  The code quality is… ultra proof-of-concept/prototype/ hackfest •  Think of a how a homeless hoarder would treat his shopping cart •  Everything gets added, nothing gets removed •  ….so if you’re hoping to quickly grok how it all works by reviewing the code… you’re going to have a bad time. •  Only works on Vista+ (for now) •  Aim is to get everything working first, then refactor/review, review support for other platforms etc
  4. How it’s exposed to Python: The Async Façade •  Submission

    of arbitrary “work”: async.submit_work(func, args, kwds, callback=None, errback=None) •  Calls func(args, kwds) from a parallel thread* •  Submission of timers: async.submit_timer(time, func, args, kwds, …) async.submit_timer(interval, func, args, kwds, …) •  Calls func(args, kwds) from a parallel thread some ‘time’ in the future, or every interval. •  Submission of “waits”: async.submit_wait(obj, func, args, kwds, …) •  Calls func(args, kwds) from a parallel thread when ‘obj’ is signalled
  5. The Async Façade (cont.) •  Asynchronous file I/O f =

    async.open(filename, ‘wb’) async.write(f, buf, callback=…, errback=…) •  Asynchronous client/server services client = async.client(host, port) server = async.server(host, port) async.register(transport=client, protocol=…) async.register(transport=server, protocol=…) async.run()

  6. Chargen: Character Generator def chargen(lineno, nchars=72): start = ord(' ')

    end = ord('~') c = lineno + start while c > end: c = (c % end) + start b = bytearray(nchars) for i in range(0, nchars-2): if c > end: c = start b[i] = c c += 1 b[nchars-1] = ord('\n') return b
  7. Demo: An Async Chargen Service import async class Chargen: def

    initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3', 20019) async.register(transport=server, protocol=Chargen) async.run()
  8. Things to note with the demo… •  Constant memory use

    •  Only one python_d.exe process… •  ….exploiting all cores •  ....without any explicit “thread” usage (i.e. no threading.Thread instances) •  (Not that threading.Thread instances would automatically exploit all cores)
  9. How it works… •  No GIL removal •  No fine-grained

    locking •  Not using STM •  No “free threading” •  i.e. concurrency isn’t achieved by exposing a threading.Thread-type object to Python code •  Negligible overhead to single-threaded execution
  10. “Intercept thread-sensitive calls” •  Anything that the GIL is intended

    to protect •  Py_INCREF/DECREF/CLEAR •  (Anything to do with reference counting.) •  PyMem_Malloc/Free, PyObject_INIT/NEW •  (Anything to do with memory allocation/deallocation.) •  Free lists, globals, etc
  11. “If we’re a parallel thread, do X, if not, do

    Y” •  Y = do what we normally do. •  X = a thread-safe, parallel-specific alternative •  (The topic for another presentation.)
  12. “If we’re a parallel thread, do X, if not, do

    Y” •  “Thread-sensitive” calls are ubiquitous •  The challenge then becomes how quickly you can detect if we’re a parallel thread •  The quicker you can detect it, the less overhead incurred by the whole approach
  13. Py_PXCTX •  The magic macro: <Include/pyparallel.h> #define Py_PXCTX (Py_MainThreadId !=

    _Py_get_current_thread_id()) •  What’s special about _Py_get_current_thread_id()? •  On Window you could use GetCurrentThreadId() •  On POSIX, pthread_self() •  But that involves a syscall/libc trip •  Is there a quicker way?
  14. Windows solution: the TEB #ifdef WITH_INTRINSICS # ifdef MS_WINDOWS #

    include <intrin.h> # if defined(MS_WIN64) # pragma intrinsic(__readgsdword) # define _Py_get_current_process_id() (__readgsdword(0x40)) # define _Py_get_current_thread_id() (__readgsdword(0x48)) # elif defined(MS_WIN32) # pragma intrinsic(__readfsdword) # define _Py_get_current_process_id() (__readfsdword(0x20)) # define _Py_get_current_thread_id() (__readfsdword(0x24))
  15. The (amd64) POSIX solution •  __read[fg]sbase() •  Guarantees to return

    a unique value for each thread •  (Whether or not that “unique value” is a thread id is another matter)
  16. Py_PXCTX typical usage •  (Py_MainThreadId == __readfsdword(0x48)) -#define _Py_ForgetReference(op) _Py_INC_TPFREES(op)

    +#define _Py_ForgetReference(op) \ + do { \ + if (Py_PXCTX) \ + _Px_ForgetReference(op); \ + else \ + _Py_INC_TPFREES(op); \ + break; \ + } while (0) + +#endif /* WITH_PARALLEL */ •  So, overhead is reduced to a couple more instructions and an extra branch (cost of which minimized by branch prediction (in most cases)) •  That is basically nothing compared to STM or fine-grained locking
  17. Demo: An Async Chargen Service import async class Chargen: def

    initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3', 20019) async.register(transport=server, protocol=Chargen) async.run()
  18. “Send Complete”: Clarification •  Called when a send() (WSASend()) call

    completes (either synchronously or asynchronously) •  What it doesn’t mean: •  The other side definitely got it (they could have closed the connection) •  What it does mean: •  All the data you tried to send successfully became bytes on a wire •  Send socket buffer is empty •  What it implies: •  You’re free to send more data if you’ve got it. •  Why it’s useful: •  Eliminates the need for a producer/consumer relationship •  i.e. pause_producing()/stop_consuming() •  No need to buffer anything internally
  19. Things to note: def chargen def chargen(lineno, nchars=72): start =

    ord(' ') end = ord('~') c = lineno + start while c > end: c = (c % end) + start b = bytearray(nchars) for i in range(0, nchars-2): if c > end: c = start b[i] = c c += 1 b[nchars-1] = ord('\n') return b •  No blocking calls •  Will happily consume all CPU when called in a loop
  20. Things to note: class Chargen •  No explicit send() • 

    “Sending” is achieved by returning a “sendable” object •  PyBytesObject •  PyByteArray •  PyUnicode •  Callable PyObject that returns one of the above import async class Chargen: def initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3’, 20019) async.register(transport=server, protocol=Chargen) async.run()
  21. Things to note: class Chargen •  Next “send” initiated from

    send_complete() •  ….which generates another send_complete() •  ….and so on •  It’s an “IO hog” •  Always wants to send something when given the chance import async class Chargen: def initial_bytes_to_send(self): return chargen(0) def send_complete(self, transport, send_id): return chargen(send_id) server = async.server('10.211.55.3’, 20019) async.register(transport=server, protocol=Chargen) async.run()
  22. Why chargen is such a good demo •  You’re only

    sending 73 bytes at a time •  The CPU time required to generate those 73 bytes is not negligible (compared to the cost of sending 73 bytes) •  Good simulator of real world conditions, where the CPU time to process a client request would dwarf the IO overhead communicating the result back to the client •  With a default send socket buffer size of 8192 bytes and a local netcat client, you’re never going to block during send() •  Thus, processing a single request will immediately throw you into a tight back-to-back send/callback loop, with no opportunity to service other clients (when doing synchronous sends)
  23. Summary •  I suuuuuuucked at CPython internals when I started

    •  “Why do we bother INCREF/DECREFing Py_None?!” •  Probably should have picked an easier “pet project” than parallelizing the interpreter •  I’m really happy with progress so far though: exploit multiple cores without impacting single-threaded performance! •  Lots of work still to do •  The “do X” part (the thread-safe alternatives to “do Y”) is a huge topic that I’m planning on tackling in subsequent presentations •  How the async client/server stuff is implemented is a huge separate topic too
  24. Pre-emptive Answers to your Questions •  I’ve encountered the problem,

    spent weeks debugging it, and come up with a solid solution •  I’ve got a temporary solution in place and a better long term solution in mind •  I’ve thought about the problem, it can probably be exploited to instantly crash the interpreter, and I’m currently dealing with it by not doing that •  “Doctor, it hurts when I do this.” •  I have no idea and the very problem you speak of could threaten the viability of the entire approach