Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Taking Human Performance Seriously

John Allspaw
June 03, 2019
800

Taking Human Performance Seriously

Monitorama PDX 2019

John Allspaw

June 03, 2019
Tweet

More Decks by John Allspaw

Transcript

  1. beliefs about safety (1940s-1970s) • Safety can be encoded in

    the design of technology. • Accidents can be avoided by having more automation. • Procedures can be specified to be objective and comprehensive. • Operators just have to follow the procedures to get work done. • “Humans Are Better At” versus “Machines Are Better At” List (HABA-MABA)
  2. new beliefs about safety, post-TMI • Automation is necessary in

    modern systems, and also introduces new forms of challenges and risk. • Rules and procedures are always underspecified, so therefore can’t guarantee safety by themselves without interpreting them in local context. • Events in these environments will require operators to make decisions and take action that cannot be pre-specified. • The methods and models for “risk” that rely on “human error” categories, accounting, taxonomies, etc. are fraught.
  3. We study cognitive work by studying incidents time pressure high

    (or potentially increasing) consequences uncertainty ambiguity
  4. “…nonroutine, challenging events, because these tough cases have the greatest

    potential for uncovering elements of expertise and related cognitive phenomena.” (Klein, Crandall, Hoffman, 2006) methods, approaches, and techniques cognitive task analysis cognitive work analysis process tracing conversation analysis Critical Decision Method Critical Incident Technique more…
  5. logs time of year day of year time of day

    observations and hypotheses others share what has been investigated thus far what’s been happening in the world (news, service provider outages, etc.) time-series data alerts tracing/observability tools recent changes in existing tech new dependencies who is on vacation, at a conference, traveling, etc. status of other ongoing work
  6. “Cues are not primitive events—they are constructions generated by people

    trying to understand situations. …cues are only ‘objective’ in a limited sense. …rather, the knowledge and expectancies a person has will determine what counts as a cue and whether it will be noticed.”
  7. DBA 2 weeks on the job Infra Engineer 2.5 years

    Network Engineer 5 years Product/App Engineer 3 years Security Engineer 1 year
  8. - problem detection and identification - generating hypotheses - diagnostic

    actions - therapeutic actions - sacrifice decisions - coordinating - (re) planning - preparing for potential escalation/cascades multiple threads of activity some productive some unproductive
  9. I mean I could ssh into one of the servers,

    and I might find something helpful by doing that…but… NO I REFUSE TO DO THAT BECAUSE I SHOULDN’T HAVE TO!!!
  10. people will pursue what they think will be productive who

    are these people? what roles do they play…actually? people for “fixing”…? for understanding? for ‘stemming the bleeding’? for customer support? for…? be productive via hypotheses? via past experience? via…? think
  11. Anomalous signals and representations Interventions and results Tentative, evolving, shared

    hypotheses Collective hypotheses ➝ plans acted on line of certainty and commitment to action
  12. Approaching Overload: Diagnosis and Response to Anomalies in Complex and

    Automated Production Software Systems Marisa Grayson Ohio State University
  13. Are there any sources of data about the systems (logs,

    graphs, etc.) that people regularly dismiss or are suspicious of? 0 100 200 300 400 0 10% 20% 30 % 40 % 0 100 200 300 400 0 1,000 2,000 3,000 4,000 1 2 3 4 5 How do people improvise new tools to help them understand what is happening? What tricks do people or teams use to understand how otherwise opaque 3rd party services are behaving?
  14. Research on supporting work in complex cognitive domains already exists!

    It will prove to be a competitive advantage for you.
  15. Summary • Understanding cognitive work in software engineering and operations

    is critically important. (The stakes are already too high, and we’re behind.) • Doing this well will mean new language, concepts, paradigms, and practices — some of which may be unintuitive and/or controversial. • Must be driven by both research/academia and industry/practitioners. • Vendors: if you pay attention, this will be a competitive advantage for you.
  16. –Lisanne Bainbridge, 1983 “Ironies of Automation” “...irony that the more

    advanced a control system is, so the more crucial may be the contribution of the human operator.”