$30 off During Our Annual Pro Sale. View Details »

DTrace at 21: Reflections on Fully-grown Software

Bryan Cantrill
October 24, 2024
30

DTrace at 21: Reflections on Fully-grown Software

Talk given on October 24th, 2024 at P99 CONF. Video: https://www.youtube.com/watch?v=KjQnB9yB9kQ

Bryan Cantrill

October 24, 2024
Tweet

Transcript

  1. OXIDE DTrace in adulthood • On September 3, 2003, we

    integrated DTrace into Solaris • DTrace became open source in 2005, and has found its way into quite a few systems (e.g., Mac and Windows) – and has influenced many more • DTrace is broadly done and firmly in adulthood – and we’ve been lucky enough to be using it more or less continuously for its entire life • The benefit of hindsight allows us to reflect on the stuff we got right, the things we figured out – and the ways in which we got lucky
  2. OXIDE We got right: Focus on production systems • From

    the outset, our focus for DTrace was on production systems: we knew that production systems have performance and other pathologies that cannot be readily reproduced elsewhere! • To be acceptable for production, instrumentation has to be absolutely safe above all else: misuse cannot result in system failure! • Facility has to always be available – can’t rely on recompiling anything, downloading binaries, downloading symbol tables, etc. • These constraints led to deep integration with the operating system
  3. OXIDE We got right: Dynamic instrumentation • We believed from

    the outset that instrumentation should be dynamic • Part of this comes from the production constraint: it was essential that DTrace have zero probe effect when disabled • We also felt strongly that we should be able to instrument software that had had no modifications to support it – and in arbitrary contexts • And because we wanted to also replace several existing tools, we separated out the methodology of instrumentation from the framework that consumed them
  4. OXIDE We got right: Organizational approach • We had been

    thinking about DTrace long before we started – and over the years, integrated the foundation that we knew we would need • But it couldn’t be done as a side-project – we needed to focus • Instead of attempting to timeline an entire project, we made the case in late 2001 to allow for two of us to focus for six months • After six months we had the proof points to add a third engineer – and to allow us to remain focussed on it for several more years • The team was always very small (three people!) and not in an office
  5. OXIDE We figured out: A domain specific language • We

    knew we wanted to have expressive power in the actions taken on instrumentation, but something we figured out early was the need to have our own domain specific language • We were heavily inspired in syntax by AWK, a little language that continues to have an outsized influence • We dubbed our language “D” (and its intermediate form “DIF”), not knowing that Walter Bright was concurrently working on a language of the same name!
  6. OXIDE We figured out: A domain specific virtual machine •

    We had assumed that we would transpile DIF to native instructions for execution, but it became quickly clear that executing DIF in a domain specific virtual machine in the kernel would be a huge win • Not only did this allow us to move quickly by adding powerful concepts to DIF (e.g., thread-local variables), it allowed us to achieve the safety constraint with extensive run-time checks • We assure completion of DIF by making it Turing incomplete – DIF has no backwards branches!
  7. OXIDE We got right: We used it ourselves • We

    used DTrace heavily ourselves – and used it as early as possible in its development to find issues in the operating system and beyond • This led to a more robust system – and one’s whose emphasis is utility • Every feature in DTrace is a direct consequence of concrete need! • This has led to many features that may seem arcane – but they are only arcane until you need them: thread-local variables, anonymous tracing, speculative tracing, postmortem tracing, arbitrary instruction tracing, etc.
  8. OXIDE We figured out: Statically-defined tracing • While the origin

    of DTrace is dynamic (our early tagline was “Concise answers to arbitrary questions”), it is overwhelming to deal with the implementation of the system to ask • We saw the need for statically-defined tracing (SDT) at points of semantic interest (CPU scheduling, performing I/O, etc.) • SDT probes coupled with type information via CTF and structure translators allowed for true interface stability, allowing users to instrument in terms of semantics and not implementation
  9. OXIDE We figured out: Application-level instrumentation • When we set

    out, it was with a focus on kernel-level instrumentation • Kernel-level instrumentation is necessary not just for kernel-level issues, but also to observe system-wide effects of application-level issues… • …but also insufficient: we needed application-level instrumentation • This is especially valuable with statically-defined tracing; user-level SDT (USDT) allows for programs themselves to be instrumented in a semantically stable fashion by their own users • USDT is essential at Oxide! See https://github.com/oxidecomputer/usdt
  10. OXIDE We got right: Writing our own documentation • In

    2003, documentation was really the only way to learn how to use a system – especially one that was sophisticated and proprietary • We made the deliberate decision to write all of our own documentation • The DTrace documentation (https://illumos.org/books/dtrace) was all written by the three DTrace engineers: the documentation was authoritative, canonical – and we found many bugs in writing the docs!
  11. OXIDE We figured out: Writing a paper • In 2003

    (right after the integration!), we attended an academic conference (AADEBUG – RIP!) which inspired us to write a canonical academic paper on DTrace • The resulting paper, Dynamic Instrumentation of Production Systems, was presented at the USENIX Annual Technical Conference in 2004 • We wrote a broader paper for practitioners in ACM Queue in 2006, Hidden in Plain Sight, that made the case for software observability • There is tremendous value in rigorously describing your ideas!
  12. OXIDE We got lucky: Open source • DTrace was born

    an entirely proprietary system, but there had been conversations internally about open sourcing the operating system as early as 1997 – and by 2003, there was urgency around it • Open sourcing big, proprietary software is not easy – and if DTrace had remained proprietary, it would have died • We got very lucky that Sun not only prioritized open sourcing the operating system, but led that with DTrace first (in January 2005)
  13. OXIDE We got lucky: Ports to other systems • DTrace

    is not easy – it is very tightly integrated with the operating system, and depends on many OS facilities • Initially ported to FreeBSD by the late John Birrell – and then to MacOS, QNX, Linux, and Windows • These ports required significant effort by veteran technologists! • The different ports have taken different liberties (e.g., the DTrace port to Linux now uses eBPF as a backend), but all shared by the goal of dynamic instrumentation in production
  14. OXIDE We got lucky: DTrace endures • We feel lucky

    to still be using DTrace everyday – and especially to be using it on our thorniest problems! • We feel lucky to be bringing new people into DTrace (older than DTrace itself – but not by much!), new languages (Rust!) and new systems • We feel lucky that it’s open source, which granted DTrace eternal life • We feel lucky to still be working together – join Adam Leventhal and me on our podcast Oxide and Friends!