~end 2014 • Did some presentations that were well received • https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploited-all- cores • (154 slide brain dump) • https://speakerdeck.com/trent/parallelism-and-concurrency-with-python • Started sprinting on code again 18th Dec last year (2014) • Focused on getting things stable under load for the TechEmpower Frameworks Benchmark • https://www.techempower.com/blog/2014/05/01/framework-benchmarks- round-9/
World!’ } • http://server:8080/plaintext • -> HTTP response with ‘Hello World!’ as the body. • Tested via wrk, e.g.: % ./wrk --latency --connections 16 --threads 16 --duration 30 http://server:8080/json • Is it simple and riddled with flaws when you start poking at it? Yes. • Is it still useful? Also yes.
in particular, maintaining low- latency in high load (very good 99%-ile latency) • It optimally uses underlying hardware • Exploits all CPU cores, scales linearly with additional cores • Memory use proportional to concurrent client count • 50,000 concurrent clients ~= 3GB • Very low kernel overhead, e.g. profiling shows ~98% time in user space when under load, only %2 kernel overhead • (Although that’s just a side effect of optimally using Windows facilities for high performance I/O, not necessarily anything clever I’ve done in PyParallel.)
that hard to do fast… • Wanted a better demo for PyParallel that leveraged benefits • Wikipedia search! • Downloaded Wikipedia (enwiki-20150205-pages-articles.xml -> 50GB) • Extracted byte offsets of all the <title>xyz</title> entries (~15 million titles) • Created a digital search tree (datrie, cython module), mapping titles to byte offsets • Created a NumPy array of 64-bit unsigned ints storing all the offsets. • Wrote a little HTTP server wrapper
starting with ‘Python’ • Get the next byte offset of the next page via offsets.searchsorted() • Adjust them (start – 7 bytes, end – 11 bytes) • …and return a JSON of [ [‘<title>’, starting_byte, ending_byte] ] • Web client then does a HTTP ranged request against /xml to grab the relevant XML fragment • Or, http://laptop/wiki/Python for exact lookup Does all of the above but issues the range request as well if there’s an exact hit, returning the fragment in one • Non-trivial app that does something half useful, good use case • Exercises external C modules (NumPy, Cythonized datrie, etc) • Not something you could easily do with existing solutions (multiprocessing) (Could you?)
wrapper around C libdatrie) is exhibiting odd behavior (crashing) upon subsequent trie lookups in parallel contexts • Showing signs that make me suspect static memory quirks (similar to the Unicode interning), easy enough to work around • Other than that everything else works (NumPy will happily work in parallel contexts against arrays allocated by the main thread)
What else did I break? • PyObject struct • Generators • What are the current code restrictions? • What could we do for core Python in the short term, such that this could possibly something we adopt in the long term? • Extend the new memory allocator API to include reference counting, perhaps