Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[PyCon APAC 2023] Improving Debuggability of Co...

Joongi Kim
October 30, 2023

[PyCon APAC 2023] Improving Debuggability of Complex Asyncio Applications

Joongi Kim

October 30, 2023
Tweet

More Decks by Joongi Kim

Other Decks in Programming

Transcript

  1. About me • PyCon KR/JP/US/APAC speaker • Seoul, South Korea

    • CTO & Co-founder of Lablup Inc. • PhD in Computer Science (packet processing acceleration using DPDK and GPUs & Xeon Phi) • A humble asyncio ecosystem contributor • Hobby pianist APAC 2016 in Seoul US 2023 in Salt Lake City JP 2022 in Tokyo
  2. About Lablup • Co-founded in 2015 • Key product: –

    MLDev/MLOps platform with highlights on GPU resource mgmt. – Powering 70+ enterprise customers (incl. a largest GPU cluster in Latin America) on 3 continents with Python asyncio! Used asyncio since 2014 (Python 3.4) / ~70K LoC Sibling open source projects: aiotools, Callosum, aiomonitor-ng Contributed: aiohttp, aiodocker, janus, async-timeout, asyncio – Model: Open core + proprietary enterprise plugins & tech support https://github.com/lablup/backend.ai
  3. Updates after PyCon US • aiomonitor v0.5 (Jul 21st) –

    Backported aiomonitor-ng – I became the new maintainer! • aiomonitor v0.6 (Aug 27th) – GUI made with htmx + alpine.js + tailwind.css – Reference Breaking Down Barriers: A Unified Approach to Web Development with django-htmx Track 2, Bae KwonHan
  4. Production debugging • The hard problems – Memory leak /

    CPU hogging – Unexpected termination of critical long-running tasks – Loss of vital error logs – When these are combined with 3rd party codes that I have no control... – When these happen in remote, air-gapped clusters... • How to handle these in asyncio-based apps?
  5. Solutions for “hard" problems Prevention via structured concurrency[1] Debugging via

    postmortem asyncio.TaskGroup aio-libs/{aiomonitor, aioconsole} asyncio.all_tasks() Debugging via live inspection achimnol/aiomonitor-ng Task termination chains achimnol/aiomonitor-ng Task creation chains asyncio 3rd parties asyncio.Supervisor asyncio.as_completed_safe() trio.open_nursery() [1] https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/ aiotools.PersistentTaskGroup aiotools.as_completed_safe() ? ? Traceback & logging N/A
  6. Traceback in asyncio • Python traceback gives: – The type

    and arguments of the exception – The location where the exception is raised in the stack • Are these sufficient? – What if a 3rd-party code swallows the cancellation? – What if a task exits with an error but its garbage collection is delayed (e.g., due to extra references from long-lived objects)? File "~/workspace/aiomonitor-ng/examples/simple_aiohttp_srv.py", line 11, in inner2 await asyncio.sleep(100) File "~/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 639, in sleep return await future ^^^^^^^^^^^^ asyncio.exceptions.CancelledError
  7. Traceback in asyncio • We also need to know: –

    Who cancelled this task? e.g., Where did my precious task go?! – Who created this task? e.g., When there are an unexpectedly large number of tasks... File "~/workspace/aiomonitor-ng/examples/simple_aiohttp_srv.py", line 11, in inner2 await asyncio.sleep(100) File "~/.pyenv/versions/3.11.3/lib/python3.11/asyncio/tasks.py", line 639, in sleep return await future ^^^^^^^^^^^^ asyncio.exceptions.CancelledError
  8. The world of async tasks My Code 3rd Party Code

    My Code 3rd Party Code My Code My Code task stack Error! Swallows cancellation? raise cancel() await ... ... My Code 3rd Party Code create_task() await await An asyncio app is a giant tree of multiple tasks with their own stacks, interacting with each other. callback task stack
  9. A brief architecture of aiomonitor asyncio Application A Separate Terminal

    App Thread aiomonitor Thread Telnet client asyncio event loop telnet ui loop REPL with start_monitor():
  10. Quick start import aiomonitor async def main(): loop = asyncio.get_running_loop()

    run_forever = loop.create_future() with aiomonitor.start_monitor(loop): await run_forever try: asyncio.run(main()) except KeyboardInterrupt: pass $ pip install aiomonitor-ng $ edit test.py $ python test.py & $ telnet localhost 50101
  11. Quick start monitor >>> help Usage: help [OPTIONS] COMMAND [ARGS]...

    To see the usage of each command, run them with "--help" option. Commands: cancel Cancel an indicated task console Switch to async Python REPL exit (q,quit) Leave the monitor client session help (?,h) Show the list of commands ps (p) Show task table ps-terminated (pst,pt) List recently terminated/cancelled tasks signal Send a Unix signal stacktrace (st,stack) Print a stack trace from the event loop thread where (w) Show stack frames and the task creation chain of a task where-terminated (wt) Show stack frames and the termination/cancellation chain of a task
  12. Additions in aiomonitor-ng • The task creation chain tracker •

    The task termination tracker • The task cancellation chain tracker • prompt_toolkit + Click integration for rich CLI
  13. Task creation chain tracker • Q: Who created this task

    and its parents? • Implementation – class TracedTrask(asyncio.Task) Thanks to asyncio.set_task_factory() – Store weakref of tasks in a bookkeeper – Traverse through the chain of weakrefs upon where command • Assumption – Parent tasks are alive while child tasks are alive. (Lifetime inclusion) Structured concurrency and asyncio.TaskGroup encourage this. It makes easier to prevent memory leak in the creation chain tracker.
  14. Task C Task B Task A await ...() ...() create_task(...)

    await ...() create_task(...) ...() await ...() await ...() create_task(...) Task D await ...() await ...() weak keys ...() Task A Task B Task C Task D _created_traceback_chain: WeakKeyDictionary[Task, weakref.ReferenceType[Task]) Task creation chain tracker _created_tracebacks: WeakKeyDictionary[Task, list[traceback.FrameSummary]) Task A Task B Task B stack A stack B stack C stack D weak values
  15. Task termination tracker • Q. Why is this task terminated

    (even it was a long before...)? • Implementation – Register _trace_termination() as the task done callback – get_trace_id(): Task object's ID is no longer valid after gc. Combine task ID and name (default: Task-XXX) to mangle a unique identifier to refer in the termination logs – Prevent memory leak by limiting the maximum length of history – Add @preserve_termination_logs decorator to mark specific coroutine task's termination log forever
  16. Task cancellation chain tracker • Q: Who cancelled this task,

    which also may be cancelled by another? • Implementation – Override Task.cancel() to store CancellationChain current_task() in cancel() is the canceller task, while self in cancel() is the cancellee task. We can extract the canceller's stack trace at this point. – Deligate postprocessing to the monitor UI thread which holds the tracker objects. Uses janus for sync/async queue interfacing. Removes tracker objects of cancelled/terminated tasks when necessary (e.g., reached the history limit)
  17. Task B Task X await ...() ...() task_b.cancel(...) await ...()

    _termination_info_queue: janus.Queue[TerminatedTaskInfo] Task termination & cancellation chain tracker _cancellation_chain_queue: janus.Queue[CancellationChain] Monitor UI thread App thread await ...() canceller cancellee cancel() extract_stack(task) filter_stack(task) cleanup_trackers() task done callback TracedTask.cancel()
  18. prompt_toolkit + Click integration • aiomonitor – Multiple concurrent telnet

    sessions (resolving aio-libs/aiomonitor#140) – Auto-completion of commands and task IDs / filtering the task list • Implementation – Combines several contextvars to keep track of: The monitor instance The current stdout buffer linked with the current telnet connection – Signals Click command completion with asyncio.Event – Defines a custom ClickCompleter adapter
  19. A part of task creation chain trace showing that a

    3rd party library (grpcio) creates internal sub-tasks ← my code ← 3rd party code
  20. Technical Goals of Web UI in aiomonitor • Rewritable by

    backend developers – The core logic should live at the server-side code. – Let's use a "good default" design framework! 🌞 • Lightweight dev setup – No additional build chain for Javascript (e.g., transpilers, bundlers, etc.) – Changed codes should just work if I reload the page. • Relatively small download size (under 1 MB) • Clean separation of UI and core logic
  21. v0.5 Architecture asyncio Application A Separate Terminal App Thread aiomonitor

    Thread Telnet client telnet ui loop REPL with start_monitor(): .monitor .task asyncio event loop .telnet .monitor
  22. v0.6: Restructuring frontends asyncio Application A Separate Terminal App Thread

    aiomonitor Thread Telnet client asyncio event loop ui loop REPL with start_monitor(): .termui .webui .monitor .task .telnet Web Browser htmx mustache.js alpine.js tailwind.css 127 KB 365 KB (non-gzipped)
  23. Case: Backend.AI “hangs" • Symptom: No error logs, but no

    longer user requests processed some time after startup. • Analysis: Agent's process_lifecycle_events() task to receive Docker's container status updates disappears... – Mis-used asyncio.TaskGroup to keep track of multiple fired tasks along a long-running loop, which caused termination of the entire group & loop upon any unhandled child exception This became the motivation behind aiotools.PersistentTaskGroup as well • aiomonitor helped analysis with the task termination logs.
  24. Case: Backend.AI “hogs the CPU" • Symptom: Slow-down and eventual

    hang up. Happened in a specific customer site only. • Analysis: One user has put too many files in a storage folder (more than millions) which was periodically scanned to get the disk usage and file count recursively. – The latency of a scan task became longer than the interval. – The tasks were indefinitely accumulated (observed up to 6,000), competing for the same executor pool for synchronous I/O. • aiomonitor helped analysis with the live task monitoring.
  25. Case: Hidden bomb inside a 3rd party dependency • Symptom:

    Memory leak after upgrading grpcio to a specific version • Analysis: There was an internal bug in grpcio when using the secure gRPC channel to acess etcd with explicit credentials, not cleaning up a specific callback task. – grpc/grpc#25449 – There were too many dangling tasks, which were generated as an internal callback of grpcio. • aiomonitor helped analysis with the task creation chain tracker.
  26. Lessons • Debuggability = Observability + Reproducibility – Live inspection

    & Logging • asyncio needs a unique tooling for debugging! • We should continuously invest at improving our toolchains for debugging. – There may seem no visible direct returns, but when bad things happen, it will save your time.
  27. Future Work • Top-like activity monitor – e.g., most/least recently

    yielded tasks – Requires more decent support of tracing in asyncio... • Adopting subinterpreters for better reliability • Performance optimization & better pagination • Scoped live task view – A structured async app = a tree of task groups – List of persistent task groups (or supervisors) – Per-taskgroup task list view
  28. Side Work • Expanding structured concurrency to long-running task groups

    – "TaskScope" / "Supervisor" (experimenting in aiotools) – ExceptionGroup → Exception Generator? (asyncio.as_completed) • Safe synchronous cancellation of tasks – https://github.com/python/cpython/issues/103486 – Graceful shutdown with timeouts? • Still under discussion & exploration with some CPython core devs...