• Seoul, South Korea • CTO & Co-founder of Lablup Inc. • PhD in Computer Science (packet processing acceleration using DPDK and GPUs) • A humble asyncio ecosystem contributor • Hobby pianist
MLDev/MLOps platform with highlights on GPU resource mgmt. – Powering 70+ enterprise customers (incl. a largest GPU cluster in Latin America) on 3 continents with Python asyncio! Used asyncio since 2014 (Python 3.4) / ~65K LoC Sibling open source projects: aiotools, Callosum, aiomonitor-ng Contributed: aiohttp, aiodocker, janus, async-timeout, asyncio – Model: Open core + proprietary enterprise plugins & tech support https://github.com/lablup/backend.ai
CPU hogging – Unexpected termination of critical long-running tasks – Loss of vital error logs – When these are combined with 3rd party codes that I have no control... • How to handle these in asyncio-based apps?
and arguments of the exception – The location where the exception is raised in the stack • Are these sufficient? – What if a 3rd-party code swallows the cancellation? – What if a task exits with an error but its garbage collection is delayed (e.g., due to extra references from long-lived objects)? File "~/workspace/aiomonitor-ng/examples/simple_aiohttp_srv.py", line 11, in inner2 await asyncio.sleep(100) File "~/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 639, in sleep return await future ^^^^^^^^^^^^ asyncio.exceptions.CancelledError
Who cancelled this task? e.g., Where did my precious task go?! – Who created this task? e.g., When there are unexpectedly many number of tasks... File "~/workspace/aiomonitor-ng/examples/simple_aiohttp_srv.py", line 11, in inner2 await asyncio.sleep(100) File "~/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 639, in sleep return await future ^^^^^^^^^^^^ asyncio.exceptions.CancelledError
My Code 3rd Party Code My Code My Code task stack Error! Swallows cancellation? raise cancel() await ... ... My Code 3rd Party Code create_task() await await An asyncio app is a giant tree of multiple tasks with their own stacks, interacting with each other. callback task stack
To see the usage of each command, run them with "--help" option. Commands: cancel Cancel an indicated task console Switch to async Python REPL exit (q,quit) Leave the monitor client session help (?,h) Show the list of commands ps (p) Show task table ps-terminated (pst,pt) List recently terminated/cancelled tasks signal Send a Unix signal stacktrace (st,stack) Print a stack trace from the event loop thread where (w) Show stack frames and the task creation chain of a task where-terminated (wt) Show stack frames and the termination/cancellation chain of a task
and its parents? • Implementation – class TracedTrask(asyncio.Task) Thanks to asyncio.set_task_factory() – Store weakref of tasks in a bookkeeper – Traverse through the chain of weakrefs upon where command • Assumption – Parent tasks are alive while child tasks are alive. (Lifetime inclusion) Structured concurrency and asyncio.TaskGroup encourage this. It makes easier to prevent memory leak in the creation chain tracker.
await ...() create_task(...) ...() await ...() await ...() create_task(...) Task D await ...() await ...() weak keys ...() Task A Task B Task C Task D _created_traceback_chain: WeakKeyDictionary[Task, weakref.ReferenceType[Task]) Task creation chain tracker _created_tracebacks: WeakKeyDictionary[Task, list[traceback.FrameSummary]) Task A Task B Task B stack A stack B stack C stack D weak values
(even it was a long before...)? • Implementation – Register _trace_termination() as the task done callback – get_trace_id(): Task object's ID is no longer valid after gc. Combine task ID and name (default: Task-XXX) to mangle a unique identifier to refer in the termination logs – Prevent memory leak by limiting the maximum length of history – Add @preserve_termination_logs decorator to mark specific coroutine task's termination log forever
which also may be cancelled by another? • Implementation – Override Task.cancel() to store CancellationChain current_task() in cancel() is the canceller task, while self in cancel() is the cancellee task. We can extract the canceller's stack trace at this point. – Deligate postprocessing to the monitor UI thread which holds the tracker objects. Uses janus for sync/async queue interfacing. Removes tracker objects of cancelled/terminated tasks when necessary (e.g., reached the history limit)
sessions (resolving aio-libs/aiomonitor#140) – Auto-completion of commands and task IDs / filtering the task list • Implementation – Combines several contextvars to keep track of: The monitor instance The current stdout buffer linked with the current telnet connection – Signals Click command completion with asyncio.Event – Defines a custom ClickCompleter adapter
longer user requests processed some time after startup. • Analysis: Agent's process_lifecycle_events() task to receive Docker's container status updates disappears... – Mis-used asyncio.TaskGroup to keep track of multiple fired tasks along a long-running loop, which caused termination of the entire group & loop upon any unhandled child exception This became the motivation behind aiotools.PersistentTaskGroup as well • aiomonitor-ng helped analysis with the task termination logs.
hang up. Happened in a specific customer site only. • Analysis: One user has put too many files in a storage folder (more than millions) which was periodically scanned to get the disk usage and file count recursively. – The latency of a scan task became longer than the interval. – The tasks were indefinitely accumulated (observed up to 6,000), competing for the same executor pool for synchronous I/O. • aiomonitor-ng helped analysis with the live task monitoring.
Memory leak after upgrading grpcio to a specific version • Analysis: There was an internal bug in grpcio when using the secure gRPC channel to acess etcd with explicit credentials, not cleaning up a specific callback task. – grpc/grpc#25449 – There were too many dangling tasks, which were generated as an internal callback of grpcio. • aiomonitor helped analysis with the task creation chain tracker.
& Logging • asyncio needs a unique tooling for debugging! • We should continuously invest at improving our toolchains for debugging. – There may seem no visible direct returns, but when bad things happen, it will save your time.
async app = a tree of task groups – List of persistent task groups (or supervisors) – Per-taskgroup task list view • Top-like activity monitor (e.g., most/least recently yielded tasks) • Adopting subinterpreters for better reliability • Performance optimization / Better pagination – Especially when there are thousands of tasks...