Designed a packet processing framework using heterogeneous processors (CUDA GPUs + Intel Xeon Phi) – 80 Gbps on a single x86 Linux server (Intel DPDK + NBA framework) – about 48K lines of C++ – https://github.com/anlab-kaist/NBA § CTO at Lablup Inc. • Developing a distributed sand-boxed code execution service – Python + ZeroMQ + AsyncIO – http://www.lablup.com 2 / 41
of socket programming • Basic experience on building server applications (e.g., echo server) • Understandings of network stack (e.g., TCP / IP) • Familiarity with Python standard library • Understandings of multi-threading in operating systems § Python version for this talk: 3.5.2+ 4 / 41
event loop implementation § Generators, Coroutines, and Python asyncio § Tips for learning and using asyncio § Achieving high-performance with asyncio § Alternative approach: PyParallel § Closing 5 / 41
peers § What do we need? • Reliable data communication • Multiplexing multiple communication channels § Who is responsible for multiplexing? • Operating systems • Programming languages & runtimes • You?! 6 / 41
served by a single server § Why difficult? • “This has not yet become popular in Unix, probably because few operating systems support asynchronous I/O, also possibly because it (like non-blocking I/O) requires rethinking your application.” 7 / 41
epoll / kqueue Method Create a new parallel context for every new connection Get “ready to use” file descriptors from a set of file descriptors to monitor Advantages Simple to write programs using blocking calls Less performance overheads (depending on underlying kernel implementation) Disadvantages Context switching overheads & High memory consumption Difficult to write programs due to manual context tracking 8 / 41
of per-connection contexts: • The number of bytes sent/received • Which steps to execute § We have to deal with not only sockets but also: • Synchronization primitives (e.g., locks) • Timers & Signals • Communication with subprocesses (IPC) • Non-std asynchronous I/O events (e.g., CUDA stream callback) § Current OSes do not provide a unified interface for all above. 12 / 41
Rebooted: the "asyncio" Module § The Motivation and Goal • Existing solutions: asyncore, asynchat, gevent, Twisted, … – Inextensible APIs in existing standard library – Lack of compatibility – tightly coupled with what library you use • Reusable and persistent event loop API with pluggable underlying implementation • Better networking abstraction with Transport and Protocols (like in Twisted) 14 / 41
gains • I/O waits must dominate the total execution time. • You should have many I/O channels to wait. § Advantages of asyncio in Python • Single-threaded Python apps are likely to get performance gains by I/O multiplexing. • Even without performance improvements, it is easier to write programs with concurrent contexts since they look like sequential codes. • It provides a unified abstraction for I/O, IPC, timers, and signals. 17 / 41
the key concepts to understand the asyncio ecosystem. def srange(n): while True: time.sleep(0.5) if i == n: break yield i i += 1 class srange: def __init__(self, n): self.n = n self.i = 0 def __iter__(self): return self def __next__(self): time.sleep(0.5) if self.i == self.n: raise StopIteration i = self.i self.i += 1 return i def run(): for i in myrange(10): print(i) run() 18 / 41
the key concepts to understand the asyncio ecosystem. async def arange(n): while True: await asyncio.sleep(0.5) if i == n: break yield i i += 1 class arange: def __init__(self, n): self.n = n self.i = 0 def __aiter__(self): return self async def __anext__(self): await asyncio.sleep(0.5) if self.i == self.n: raise StopAsyncIteration i = self.i self.i += 1 return i async def run(): async for i in arange(10): print(i) loop = asyncio.get_event_loop() loop.run_until_complete(run()) 19 / 41
the key concepts to understand the asyncio ecosystem. § await is almost same to yield from added in Python 3.3. • It distinguishes StopIteration and StopAsyncIteration. • They allow transparent two-way communication between the coroutine scheduler (caller of me) and the callee of me. @asyncio.coroutine def myfunc(): yield from fetch_data() async def myfunc(): await fetch_data() 20 / 41
over the control to the event loop scheduler. • Generator delegation allows the blocking callee to interact with the outer caller transparently to the current context. async def compose_items(arr): while arr: data = await fetch_data() print(arr.pop() + data) loop = asyncio.get_event_loop() loop.run_until_complete(compose_items([1, 2, 3])) 21 / 41
executing coroutines • Always check which functions are coroutines or not. § Remember that coroutines are not running in parallel! • They are non-blocking and interleaved manually. • Avoid long-running, non-cooperative blocking calls in coroutines. • You need to explicitly cancel_task() or loop.stop() to interrupt a coroutine. asyncio.ensure_future(some_coro(...)) loop.create_task(some_coro(...)) await some_coro(...) Non-blocking; returns immediately Blocking; returns after finish 22 / 41
different threads • Use loop.call_soon_threadsafe(loop.stop)where loop is the loop of the target thread. § Debugging unexpected hangs, freezes, etc. • Try asyncio’s debugging mode (PYTHONASYNCIODEBUG=1 in env.vars and activate logging for asyncio) • Use latest Python! (3.5.2 at the time of this talk) § Use “async for/with” whenever available for less code complexity • Check out the library manuals (e.g., aiohttp) 23 / 41
for async functions? • Simply wrap them with an event loop. • Use a 3rd-party package such as https://github.com/Martiusweb/asynctest. class MyTest(unittest.TestCase): def setUp(self): self.loop = asyncio.new_event_loop() asyncio.set_event_loop(self.loop) def tearDown(self): self.loop.close() def test_something(self): self.loop.run_until_complete(coro_to_test(...)) self.assertEqual(...) 24 / 41
(e.g., polling) • It may bring huge difference! § Avoid I/O logic written in Python • We all know pure Python loops are slow. • It is likely to bring unwanted extra memory copies. § Implement asyncio.Protocol instead of using coroutine-based streams. • It may add ~5% throughputs. • But, don’t do this if programming comforts matter (e.g., fast prototyping). § Use up-to-date, latest libraries (e.g., uvloop) • The ecosystem is under active development. remaining = 1024 data = [] while remaining > 0: data.append(conn.recv(remaining)) remaining -= len(data[-1]) data = b''.join(data) 26 / 41
ZMQ • aiozmq vs. pyzmq.asyncio • asyncio vs. tornado vs. zmqloop vs. uvloop • Workload: two racing push/pull sockets inside a single thread 0 1 2 3 4 5 6 7 asyncio + aiozmq tornado + aiozmq uvloop + aiozmq zmqloop + pyzmq tornado + pyzmq Relative Performance (lower is better) Redundant Vanilla Optimized https://github.com/achimnol/asyncio-zmq-benchmark ZMQ (ZeroMQ): A socket abstraction library that comes with various networking patterns such as queuing and pub/sub using a custom transport extension layer. 27 / 41
• Patching pyzmq.asyncio to avoid an extra polling bounce when data is available upon API call. 0 1 2 3 4 5 6 7 asyncio + aiozmq tornado + aiozmq uvloop + aiozmq zmqloop + pyzmq tornado + pyzmq Relative Performance (lower is better) Redundant Vanilla Optimized https://github.com/achimnol/asyncio-zmq-benchmark Excerpt from pyzmq PR#860 by Min RK 29 / 41
your app is still I/O-bound!) • Try to change threading to multiprocessing to avoid GIL. • Setting CPU affinity mask may help. (os.sched_setaffinity) • On *NIX systems: start_server(..., reuse_port=True) § Maybe PyPy can boost your app performance. (if your app is computation-bound!) • Good news: Mozilla funds Python 3.5 support in PyPy! https://morepypy.blogspot.kr/2016/08/pypy-gets-funding-from-mozilla-for.html § Most important thing: your workload should fit with asyncio. 30 / 41
intensive! • Eight 10 GbE ports ➜ ≥ 88M minimum-sized packets per sec. • 2.4 GHz 8-core CPU ➜ ~210 cycles (87 nsec) available per packet • cf) x86 lock: ~10 nsec, system call: 50 ~ 80 nsec § Delivering this performance to userspace apps is still challenging! • Could a “dynamic” language such as Python keep up? • Could the OS network stack (TCP/IP) keep up? § It is the reality — AWS offers 10 Gbps network interfaces now. 31 / 41
other *NIX-based event loops) are synchronous + non-blocking I/O instead of actual asynchronous I/O! § The asyncio API is completion-oriented. § The implementation is readiness-oriented. • Because *NIX systems provides readiness-oriented syscalls for I/O. (select / poll / epoll / kqueue) § On Windows, we can use completion-oriented, OS-managed APIs called IOCP (IO completion ports). • Let’s remove obstructions in Python to utilize it. 36 / 41
fill it with bytes from this socket. OK. Hey, do you have 10 bytes? I have only 4 bytes. Here they are. Hey, do you have the remaining 6 bytes? Not yet. (EAGAIN) Hey, do you now have those? Yes, here are another 4 bytes. Hey, where are the 2 bytes? ... Hey, I have 10 bytes buffer. Please fill it with bytes from this socket. OK. ... Here, you got the requested 10 bytes. Good! A modified excerpt from Trent Nelson’s talk https://speakerdeck.com/trent/parallelism-and-concurrency-with-python #47 37 / 41
with multiprocessing…) § Separation of main thread and parallel context (PCTX) • Intercept all thread-sensitive codes. (e.g., PY_INCREF) § GIL & reference counting avoidance • If in PCTX, do a thread-safe alternative. – Uses a bump memory allocator, with nested heap snapshots to avoid out-of-memory for long-running PCTX programs. – All main thread objects are read-only. – Main thread and PCTX are mutually exclusively executed. • If not in PCTX, do what the original CPython does. 38 / 41
high-performance. • Advantages come from coroutines enabled by generators + clean separation of event loop details and async functions. • Pluggable event loops has allowed high-performant 3rd parties such as uvloop. • For even more performance for multi-cores, we need to rethink the underlying OS I/O APIs and Python’s GIL with memory mgmt. • PyParallel has shown a promising subspace on Windows. § The Future? 40 / 41
which queries readable/writable states (no matter how much exactly the app wants to read/write). • IOCP notifies the app when the given read/write request is done. § Thread-agnostic I/O • Opposite to asyncio (and its relatives) where all I/O requests must be completed by the thread that initiated it. • IOCP keeps a set of threads to wake up for completed I/O requests. – The number of threads are not limited; the number of awoken concurrently threads are limited. – Optionally we can use thread affinity for consistent client-thread mapping.