Resilience engineering papers

Overview

Alias: http://resiliencepapers.club (thanks to John Allspaw).

This doc contains notes about people active in resilience engineering, as well as some influential researchers who are no longer with us, organized alphabetically. It also includes people and papers from related fields, such as cognitive systems engineering and naturalistic decision-making.

Some papers have a (TWRR) link next to them. This stands for Thai Wood's Resilience Roundup. Thai publishes a newsletter that summarizes resilience engineering papers.

People

For each person, I list concepts that they reference in their writings, along with some publications. The publications lists aren't comprehensive: they're ones I've read or have added to my to-read list.

John Allspaw
Lisanne Bainbridge
Andrea Baker
E. Asher Balkin
Johan Bergström
Matthieu Branlat
Sheuwen Chuang
Todd Conklin
Richard I. Cook
Sidney Dekker
John C. Doyle
Bob Edwards
Anders Ericsson
Paul Feltovich
Pedro Ferreira
Meir Finkel
Marisa Grayson
Ivonne Andrade Herrera
Robert Hoffman
Erik Hollnagel
Leila Johannesen
Gary Klein
Elizabeth Lay
Nancy Leveson
Carl Macrae
Laura Maguire
Christopher Nemeth
Anne-Sophie Nyssen
Elinor Ostrom
Jean Pariès
Emily Patterson
Charles Perrow
Shawna J. Perry
Jens Rasmussen
James Reason
J. Paul Reed
Emilie M. Roth
Nadine Sarter
James C. Scott
Steven Shorrock
Barry Turner
Diane Vaughan
Robert L. Wears
David Woods
John Wreathall

Some big ideas

The adaptive universe (David Woods)
Dynamic safety model (Jens Rasmussen)
Safety-II (Erik Hollnagel)
Graceful extensibility (David Woods)
ETTO: Efficiency-tradeoff principle (Erik Hollnagel)
Drift into failure (Sidney Dekker)
Robust yet fragile (John C. Doyle)
STAMP: Systems-Theoretic Accident Model & Process (Nancy Leveson)
Polycentric governance (Elinor Ostrom)

Note: there are now multiple contributors to this repository.

John Allspaw

Allspaw is the former CTO of Etsy. He applies concepts from resilience engineering to the tech industry. He is one of the founders Adaptive Capacity Labs, a resilience engineering consultancy.

Allspaw tweets as @allspaw.

Selected publications

Selected talks

Resilience Engineering: The What and How
Incidents as we Imagine Them Versus How They Actually Are
How your systems keep running day after day
Problem detection (papers we love) (presentation of Problem detection paper)
Common Ground and Coordination in Joint Activity (papers we love) (presentation of Common Ground and Coordination in Joint Activity paper)
Amplifying sources of resilience (presentation about applying Resilience Engineering thinking & paradigms to the world of software engineering)
Incidents: What Is Often Missed & What Can Be Done About That
Incident Analysis: How Learning is Different Than Fixing

Lisanne Bainbridge

Bainbridge is a psychology researcher. She has a website at http://www.complexcognition.co.uk/

Contributions

Ironies of automation

Bainbridge is famous for her 1983 Ironies of automation paper, which continues to be frequently cited.

Concepts

automation
design errors
human factors/ ergonomics
cognitive modelling
cognitive architecture
mental workload
situation awareness
cognitive error
skill and training
interface design

Selected publications

Ironies of automation (TWRR)

Andrea Baker

Baker is a practitioner who provides training services in human and organizational performance (HOP) and learning teams.

Baker tweets as @thehopmentor.

Concepts

Human and organizational performance (HOP)
Learning teams
Industrial empathy

Selected publications

A bit about HOP (editorial)
A short introduction to human and organizational performance (hop) and learning teams (blog post)

E. Asher Balkin

Selected publications

Resiliency Trade Space Study: The Interaction of Degraded C2 Link and Detect and Avoid Autonomy on Unmanned Aircraft

Selected talks

Root cause and the wrong path

Johan Bergström

Bergström is a safety research and consultant. He runs the Master Program of Human Factors and Systems Safety at Lund University.

Bergström tweets as @bergstrom_johan.

Concepts

Analytical traps in accident investigation
- Counterfactual reasoning
- Normative language
- Mechanistic reasoning
Generic competencies

Selected publications

Selected talks

Matthieu Branlat

Selected publications

Sheuwen Chuang

Selected publications

Todd Conklin

Conklin's books are on my reading list, but I haven't read anything by him yet. I have listened to his great Preaccident investigation podcast.

Conklin tweets as @preaccident.

Selected publications

Selected talks

Quanta - Risk and Safety Conf 2019

Richard I. Cook

Cook was an anasthesiologist who studies failures in complex systems. He is one of the founders Adaptive Capacity Labs, a resilience engineering consultancy. He tweeted as @ri_cook.

Concepts

how complex systems fail
degraded mode
sharp end (c.f. Reason's blunt end)
Going solid
Cycle of error
"new look"
first vs second stories

Selected publications

Selected talks

How Complex Systems Fail (Velocity 2012)
Resilience in Complex Adaptive Systems: Operating at the Edge of Failure (Velocity 2013)
Lectures on the study of cognitive work (Graduate student lecture-discussions at The Royal Institute of Technology, Huddinge, SWEDEN in 2012 )
Panel discussion: Safety Culture, Lean, and DevOps (DOES 2017)
Working at the center of the Cyclone (DOES 2018)
A Few Observations on the Marvelous Resilience of Bone & Resilience Engineering (REdeploy 2019)

Sidney Dekker

Dekker is a human factors and safety researcher with a background in aviation. His books aimed at a lay audience (Drift Into Failure, Just Culture, The Field Guide to 'Human Error' investigations) have been enormously influential. He was a founder of the MSc programme in Human Factors & Systems Safety at Lund University. His PhD advisor is David Woods.

Dekker tweets as @sidneydekkercom.

Contributions

Drift into failure

Dekker developed the theory of drift, characterized by five concepts:

Scarcity and competition
Decrementalism, or small steps
Sensitive dependence on initial conditions
Unruly technology
Contribution of the protective structure

Just Culture

Dekker examines how cultural norms defining justice can be re-oriented to minimize the negative impact and maximize learning when things go wrong.

Retributive justice as society's traditional idea of justice: distributing punishment to those responsible based on severity of the violation
Restorative justice as an improvement for both victims and practicioners: distributing obligations of rebuilding trust to those responsible based on who is hurt and what they need
First, second, and third victims: an incident's negative impact is felt by more than just the obvious victims
Learning theory: people break rules when they have learned there are no negative consequences, and there are actually positive consequences - in other words, they break rules to get things done to meet production pressure
Reporting culture: contributing to reports of adverse events is meant to help the organization understand what went wrong and how to prevent recurrence, but accurate reporting requires appropriate and proportionate accountability actions
Complex systems: normal behavior of practicioners and professionals in the context of a complex system can appear abnormal or deviant in hindsight, particularly in the eyes of non-expert juries and reviewers
The nature of practicioners: professionals want to do good work, and therefore want to be held accountable for their mistakes; they generally want to help similarly-situated professionals avoid the same mistake.

Safety Differently

There is a difference between the organization's prescribed processes for completing work and how work is actually completed. (work as imagined vs work as done)
- The difference between work as imagined and work as done is the result of the expertise that exists in your workers from contact with real-life pressures, heuristics, and unexpected conditions.
- Old View: People are the problem to control with process
  - They did something wrong
  - They need more rules and enforcement
  - They need to try harder
  - We need to get rid of "bad apples"
  - Focus on the "sharp end" of the organization - the people closest to the work
- New View: Work is done adaptively in an uncertain world
  - Things go wrong all the time
  - Workers often detect and correct these problems
  - Local adaptations are a source of organizational expertise
  - "What conditions existed that made the selected course of action seem correct to the people involved?"
Traditional safety interventions have diminishing yields with increasing overhead. Accumulated compliance burden and "safety clutter" makes it harder to get work done and to do so safely.
- Safety Clutter is accountable to safety bureaucracy and compliance rather than the safety of the workers or the process
- Safety Clutter is produced by the "blunt end" of the organization without local expertise of what is practicable or practical in-situ
- Safety Clutter represents a broader "deprofessionalization" - a removal of trust and confidence in professionals to do their job well, removing their pride, autonomy, and achievement.
- Paradoxically, Safety Clutter can result from government deregulation - organizations need to self-impose risk controls in the absence of external guidelines.
- Sadly for organizations with Safety Clutter, more internal rules do not equal better legal protection.
When a process is relatively safe or stable, measurements of bad outcomes lack statistical significance to understand trends or tie trends to interventions.
- Fundamental Regulator Paradox: regulating a system so well that there are no useful measurements left to understand how the system is performing
- Zero Paradox: A study of construction contractors showed more fatal accidents in firms with "goal zero" safety policies than in those without. Non-fatal accidents were similar.
- Risk Secrecy: "goal zero" commitments result in injury underreporting and hiding of incidents which prevents learning, particularly when tied to financial incentives for leadership.
There are patterns (capacities) that help things go well
- Diversity of opinion - possibility to voice dissent
- Keeping the discussion on risk alive even when things go well
- Deference to expertise that already exists in people at the sharp end
- Psychological safety / "stop" ability
- Low barriers to interaction between organizational groups
- Sharp end improvements to existing systems based on local expertise
- Pride in work - process and results
Rapid problem-solving can prevent effective problem-understanding
Leadership buy-in and practice of New View safety is imperative to its success. It's also difficult to foster.
- Worker buy-in is rapid and fits their existing mental model
- Leadership must abandon the mental model that has governed their past work and decision-making - difficult for anyone.
- Peer discussions are especially helpful for leadership
- Highlighting how local adaptations helped things go well also helps

Concepts

Drift into failure
Safety differently
New view vs old view of human performance & error
Just culture
complexity
broken part
Newton-Descartes
diversity
systems theory
unruly technology
decrementalism
generic competencies
work as imagined vs work as done

Selected publications

Selected talks

Panel discussion: Safety Culture, Lean, and DevOps

John C. Doyle

Doyle is a control systems researcher. He is seeking to identify the universal laws that capture the behavior of resilient systems, and is concerned with the architecture of such systems.

Concepts

Robust yet fragile
layered architectures
constraints that deconstrain
protocol-based architectures
emergent constraints
Universal laws and architectures
conservation laws
universal architectures
Highly optimized tolerance
Doyle's catch

Doyle's catch

Doyle's catch is a term introduced by David Woods, but attributed to John Doyle. Here's how Woods quotes Doyle:

Computer-based simulation and rapid prototyping tools are now broadly available and powerful enough that it is relatively easy to demonstrate almost anything, provided that conditions are made sufficiently idealized. However, the real world is typically far from idealized, and thus a system must have enough robustness in order to close the gap between demonstration and the real thing.

Selected publications

Bob Edwards

Edwards is a practitioner who provides training services in human and organizational performance (HOP).

Edwards tweets as @thehopcoach.

Anders Ericsson

Ericsson introduced the idea of deliberate practice as a mechanism for achieving high level of expertise.

Ericsson isn't directly associated with the field of resilience engineering. However, Gary Klein's work is informed by his, and I have a particular interest in how people improve in expertise, so I'm including him here.

Concepts

Expertise
Deliberate practice
Protocol analysis

Selected publications

Paul Feltovich

Feltovich is a retired Senior Research Scientist at the Florida Institute for Human & Machine Cognition (IHMC), who has done extensive reserach in human expertise.

Selected publications

Meir Finkel

Finkel is a Colonel in the Israeli Defense Force (IDF) and the Director of the IDF's Ground Forces Concept Development and Doctrine Department

Selected publications

On Flexibility: Recovery from Technological and Doctrinal Surprise on the Battlefield

Marisa Grayson

Grayson is a cognitive systems engineer at Mile Two, LLC.

Selected Publications

Ivonne Andrade Herrera

Herrera is an associate professor in the department of industrial economics and technology management at NTNU and a senior research scientist at SINTEF. Her areas of expertise include safety management and resilience engineering in avionics and air traffic management.

Selected publications

Organisational accidents and resilient organisations: six perspectives (SINTEF A17034 report)

Robert Hoffman

Hoffman is a senior research scientist at Florida Institute for Human & Machine Cognition (IHMC), who has done extensive reserach in human expertise.

Selected publications

Concepts

Seven deadly myths of autonomous systems:

"Autonomy" is unidimensional.
The conceptualization of "levels of autonomy" is a useful scientific grounding for the development of autonomous system roadmaps.
Autonomy is a widget.
Autonomous systems are autonomous.
Once achieved, full autonomy obviates the need for human-machine collaboration.
As machines acquire more autonomy, they will work as simple sibstitutes (or multipliers) of human capability
"Full autonomy" is not only possible, but is always desireable.

Erik Hollnagel

Contributions

ETTO principle

Hollnagel proposed that there is always a fundamental tradeoff between efficiency and thoroughness, which he called the ETTO principle.

Safety-I vs. Safety-II

Safety-I: avoiding things that go wrong

looking at what goes wrong
bimodal view of work and activities (acceptable vs unacceptable)
find-and-fix approach
prevent transition from 'normal' to 'abnormal'
causality credo: believe that adverse outcomes happen because something goes wrong (they have causes that can be found and treated)
it either works or it doesn't
systems are decomposable
functioning is bimodal

Safety-II: performance variability rather than bimodality

the system’s ability to succeed under varying conditions, so that the number of intended and acceptable outcomes (in other words, everyday activities) is as high as possible
performance is always variable
performance variation is ubiquitous
things that go right
focus on frequent events
remain sensitive to possibility of failure
be thorough as well as efficient

FRAM

Hollnagel proposed the Functional Resonance Analysis Method (FRAM) for modeling complex socio-technical systems.

Four abilities necessary for resilient performance

respond
monitor
learn
anticipate

Concepts

ETTO (efficiency thoroughness tradeoff) principle
FRAM (functional resonance analysis method)
Safety-I and Safety-II
things that go wrong vs things that go right
causality credo
performance variability
bimodality
emergence
work-as-imagined vs. work-as-done
joint cognitive systems
systems of the first, second, third, fourth kind

Selected publications

Leila Johannesen

Johannesen is currently a UX researcher and community advocate at IBM. Her PhD dissertation work examined how humans cooperate, including studies of anesthesiologists.

Concepts

common ground

Selected publications

Gary Klein

Klein studies how experts are able to quickly make effective decisions in high-tempo situations.

Klein tweets as @KleInsight.

Concepts

naturalistic decision making (NDM)
intuitive expertise
cognitive task analysis
common ground
problem detection
automation as a "team player"

Selected publications

Selected talks

Problem detection

Elizabeth Lay

Elizabeth Lay is a resilience engineering practitioner. She is currently a director of safety and human performance at Lewis Tree Service.

Selected publications

Nancy Leveson

Nancy Leveson is a computer science researcher with a focus in software safety.

Contributions

STAMP

Leveson developed the accident causality model known as STAMP: the Systems-Theoretic Accident Model and Process.

See STAMP for some more detailed notes of mine.

Concepts

Software safety
STAMP (systems-theoretic accident model and processes)
STPA (system-theoretic process analysis) hazard analysis technique
CAST (causal analysis based on STAMP) accident analysis technique
Systems thinking
hazard
interactive complexity
system accident
dysfunctional interactions
safety constraints
control structure
dead time
time constants
feedback delays

Selected publications

Carl Macrae

Macrae is a social psychology researcher who has done safety research in multiple domains, including aviation and healthcare. He helped set up the new healthcare investigation agency in England. He is currently a professor of organizational behavior and psychology at the Notthingham University Business School.

Macrae tweets at @CarlMacrae.

Concepts

risk resilience

Selected publications

Laura Maguire

Maguire is a cognitive systems engineering researcher with a PhD from Ohio State University. Maguire has done safety work in multiple domains, including forestry, avalanches, and software services. She currently works as a researcher at jeli.io

Maguire tweets as @LauraMDMaguire.

Selected publications

Selected talks

Christopher Nemeth

Nemeth is a principal scientist at Applied Resesarch Associates, Inc.

Selected publications

Anne-Sophie Nyssen

Nyssen is a psychology professor at the University of Liège, who does research on human error in complex systems, in particular in medicine.

A list of publications can be found on her website linked above.

Elinor Ostrom

Ostrom was a Nobel-prize winning economics and political science researcher.

Selected publications

Concepts

tragedy of the commons
polycentric governance
social-ecological system framework

Jean Pariès

Pariès is the president of Dédale, a safety and human factors consultancy.

Selected publications

Resilience engineering in practice: a guidebook

Selected talks

Predicting The fatal flaws: The challenge of The unpredictable...

Emily Patterson

Patterson is a researcher who applies human factors engineering to improve patient safety in healthcare.

Selected publications

Charles Perrow

Perrow is a sociologist who studied the Three Mile Island disaster. "Normal Accidents" is cited by numerous other influential systems engineering publications such as Vaughan's "The Challenger Launch Decision".

Concepts

Complex systems: A system of tightly-coupled components with common mode connections that is prone to unintended feedback loops, complex controls, low observability, and poorly-understood mechanisms. They are not always high-risk, and thus their failure is not always catastrophic.
Normal accidents: Complex systems with many components exhibit unexpected interactions in the face of inevitable component failures. When these components are tightly-coupled, failed parts cannot be isolated from other parts, resulting in unpredictable system failures. Crucially, adding more safety devices and automated system controls often makes these coupling problems worse.
Common-mode: The failure of one component that serves multiple purposes results in multiple associated failures, often with high interactivity and low linearity - both ingredients for unexpected behavior that is difficult to control.
Production pressures and safety: Organizations adopt processes and devices to improve safety and efficiency, but production pressure often defeats any safety gained from the additions: the safety devices allow or encourage more risky behavior. As an unfortunate side-effect, the system is now also more complex.

Selected publications

Normal Accidents: Living With High-Risk Technologies

Shawna J. Perry

Perry is a medical researcher who studies emergency medicine.

Concepts

Underground adaptations
Articulated functions vs. important functions
Unintended effects
Apparent success vs real success
Exceptions
Dynamic environments

Selected publications

Other

Interview on Naturalistic Decision Making podcast

Jens Rasmussen

Jens Rasmussen was an enormously influential researcher in human factors and safety systems. In particular, you can see his influence in the work of Sidney Dekker, Nancy Leveson, David Woods.

Contributions

Skill-rule-knowledge (SKR) model

Rasmussen proposed three models of human performance.

Skill-based behavior doesn't require conscious attention. The prototypical example is riding a bicycle.

Rule-based behavior is based on a set of rules that we have internalized in advance. We select which rule to use based on experience, and then carry it out. An example would be: if threads are blocked, restart the server. You can think of rule-based behavior as a memorized runbook.

Knowledge-based behavior comes into play when facing an unfamiliar situation. The person generates a set of plans based on their understanding of the environment, and then selects which one to use. The challenging incidents are the ones that require knowledge-based behavior to resolve.

He also proposed three types of information that humans process as they perform work.

Signals. Example: weather vane

Signs. Example: stop sign

Symbols. Example: written language

Abstraction hierarchy

Rasmussen proposed a model of how operators reason about the behavior of a system they are supervising called the abstraction hierarchy. The levels in the hierarchy are

functional purpose
abstract functions
general functions
physical funcitons
physical form

The hierarchy forms a means-ends relationship: proper function is described top-down (ends), and problems are explained bottom-up (means)

Dynamic safety model

Rasmussen proposed a state-based model of a socio-technical system as a system that moves within a region of a state space. The region is surrounded by different boundaries:

economic failure
unacceptable work load
functionality acceptable performance

Source: Risk management in a dynamic society: a modelling problem

Incentives push the system towards the boundary of acceptable performance: accidents happen when the boundary is exceeded.

AcciMaps

The AcciMaps approach is a technique for reasoning about the causes of an accident, using a diagram.

Risk management framework

Rasmussen proposed a multi-layer view of socio-technical systems:

Source: Risk management in a dynamic society: a modelling problem

Concepts

Dynamic safety model
Migration toward accidents
Risk management framework
Boundaries:
- boundary of functionally acceptable performance
- boundary to economic failure
- boundary to unacceptable work load
Cognitive systems engineering
Skill-rule-knowledge (SKR) model
AcciMaps
Means-ends hierarchy
Ecological interface design
Systems approach
Control-theoretic
decisions, acts, and errors
hazard source
anatomy of accidents
energy
systems thinking
trial and error experiments
defence in depth (fallacy)
Role of managers
- Information
- Competency
- Awareness
- Commitment
Going solid

Selected publications

(These are written but others about Rasmussen's work)

Recurring themes in the legacy of Jens Rasmussen - special issue of Applied Ergonomics
Reflecting on Jens Rasmussen’s legacy. A strong program for a hard problem (my notes)
Reflecting on Jens Rasmussen's legacy (2) behind and beyond, a ‘constructivist turn’

James Reason

Reason is a psychology researcher who did work on understanding and categorizing human error.

Contributions

Accident causation model (Swiss cheese model)

Reason developed an accident causation model that is sometimes known as the swiss cheese model of accidents. In this model, Reason introduced the terms "sharp end" and "blunt end".

Human Error model: Slips, lapses and mistakes

Reason developed a model of the types of errors that humans make:

slips
lapses
mistakes

Concepts

Blunt end
Human error
Slips, lapses and mistakes
Swiss cheese model

Selected publications

Human error

J. Paul Reed

Reed is a Senior Applied Resilience engineer at Netflix and runs REdeploy, a conference focused on Resilience Engineering in the software development and operations industry.

Reed tweets as @jpaulreed.

Selected Publications

[Maps, Context, and Tribal Knowledge: On the Structure and Use of Post-Incident Analysis Artifacts in Software Development and Operations](https://lup.lub.lu.se/student-papers/search/publication/8966930j
Beyond the "Fix-it" Treadmill

Concepts

Blame "Aware" (versus "Blameless") Culture
Postmortem Artifact Archetypes

Emilie M. Roth

Roth is a cognitive psychologist who serves as the principal scientist at Roth Cognitive Engineering, a small company that conducts research and application in the areas of human factors and applied cognitive psychology (cognitive engineering)

Selected publications

Other

Interview on Naturalistic Decision Making podcast

Nadine Sarter

Sarter is a researcher in industrial and operations engineering. She is the director of the Center for Ergonomics at the University of Michigan.

Concepts

cognitive ergonomics
organization safety
human-automation/robot interaction
human error / error management
attention / interruption management
design of decision support systems

Selected publications

James C. Scott

Scott is an anthropologist who also does research in political science. While Scott is not a member of a resilience engineering community, his book Seeing like a state has long been a staple of the cognitive systems engineering and resilience engineering communities.

Concepts

authoritarian high-modernism
legibility
mētis

Selected publications

Seeing like a state: how certain schemes to improve the human condition have failed

Steven Shorrock

Shorrock is a chartered psychologist and a chartered ergonomist and human factors specialist. He is the editor-in-chief of EUROCONTROL HindSight magazine. He runs the excellent Humanistic Systems blog.

Shorrock tweets as @StevenShorrock.

Selected publications

Selected talks

Life After Human Error (Velocity Europe 2014 keynote)

Diane Vaughan

Vaughan is a sociology researcher who did a famous study of the NASA Challenger accident, concluding that it was the result of organizational failure rather than a technical failure. Specifically, production pressure overrode the rigorous scientific safety culture in place at NASA.

Concepts

Structural Secrecy: Organizational structure, processes, and information exchange patterns can systematically undermine the ability to "see the whole picture" and conceal risky decisions.
Social Construction of Risk: Out of the necessity to balance risk with the associated reward, any group of people will develop efficient heuristics to solve the problems they face. The understanding of risk that faces one subgroup may not match that of another subgroup or of the whole group. The ability of an individual to change a social construction of risk, formed over years with good intentions and often with evidence, is limited. (Though the evidence is usually accurate, the conclusion might not be, leading to an inadvertent scientific paradigm.)
Normalization of Deviance: During operation of a complex system, inadvertent deviations from system design may occur and not result in a system failure. Because the intial construction of risk is usually conservative, the deviation is seen as showing that the system and its redundancies "worked", leading to a new accepted safe operating envelope.
Signals of potential danger: Information gained through the operation of a system that may indicate the system does not work as designed. Most risk constructions are based on a comprehensive understanding of the operation of the system, so information to the contrary is a sign that the system could leave the safe operation envelope in unexpected ways - a danger.
Weak signals, mixed signals, missed signals: signals of potential danger that have been interpreted as non-threats or acceptable risk because at the time they didn't represent a clear and present danger sufficient to overcome the Social Construction of Risk. Often, post-hoc, these are seen as causes due to cherry-picking - such signals were ignored before with no negative consequences.
Competition for Scarce Resources: An ongoing need to justify investment to customers leads to Efficiency-Thoroughness Tradeoffs (ETTOs). In NASA's case, justifying the cost of the Space Shuttle program to taxpayers and their congressional representatives meant pressure to quickly develop payload delivery capability at the lowest cost possible.
Belief in Redundancy: Constructing risk from a signal of potential danger such that a redundant subsystem becomes part of the normal operating strategy for a primary subsystem. In NASA's case, signals that the primary O-ring assembly did not operate as expected formed an acceptable risk because a secondary O-ring would contain a failure. Redundancy was eliminated from the design in this construction of risk - the secondary system now became part of the primary system, eliminating system redundancy.

Selected publications

The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA

Barry Turner

Turner was a sociologist who greatly influenced the field of organization studies.

Selected publications

Man-made disasters

Robert L. Wears

Wears was a medical researcher who also had a PhD in industrial safety.

Concepts

Underground adaptations
Articulated functions vs. important functions
Unintended effects
Apparent success vs real success
Exceptions
Dynamic environments
Systems of care are intrinsically hazardous

Selected publications

Selected talks

Design of resilient systems

David Woods

Woods has a research background in cognitive systems engineering and did work researching NASA accidents. He is one of the founders Adaptive Capacity Labs, a resilience engineering consultancy.

Woods tweets as @ddwoods2.

Contributions

Woods has contributed an enormous number of concepts.

The adaptive universe

Woods uses the adaptive universe as a lens for understanding the behavior of all different kinds of systems.

All systems exist in a dynamic environment, and must adapt to change.

A successful system will need to adapt by virtue of its success.

Systems can be viewed as units of adaptive behavior (UAB) that interact. UABs exist at different scales (e.g., cell, organ, individual, group, organization).

All systems have competence envelopes, which are constrained by boundaries.

The resilience of a system is determined by how it behaves when it comes near to a boundary.

See Resilience Engineering Short Course for more details.

Charting adaptive cycles

Trigger
Units of adaptive behavior
Goals and goal conflicts
Pressure points
Subcycles

Graceful extensibility

From The theory of graceful extensibility: basic rules that govern adaptive systems:

(Longer wording)

Adaptive capacity is finite
Events will produce demands that challenge boundaries on the adaptive capacity of any UAB
Adaptive capacities are regulated to manage the risk of saturating CfM
No UAB can have sufficient ability to regulate CfM to manage the risk of saturation alone
Some UABs monitor and regulate the CfM of other UABs in response to changes in the risk of saturation
Adaptive capacity is the potential for adjusting patterns of action to handle future situations, events, opportunities and disruptions
Performance of a UAB as it approaches saturation is different from the performance of that UAB when it operates far from saturation
All UABs are local
There are bounds on the perspective any UAB, but these limits are overcome by shifts and contrasts over multiple perspectives.
Reflective systems risk mis-calibration

(Shorter wording)

Boundaries are universal
Surprise occurs, continuously
Risk of saturation is monitored and regulated
Synchronization across multiple units of adaptive behavior in a network is necessary
Risk of saturation can be shared
Pressure changes what is sacrificed when
Pressure for optimality undermines graceful extensibility
All adaptive units are local
Perspective contrast overcomes bounds
Mis-calibration is the norm

For more details, see summary of graceful extensibility theorems.

SCAD (Systemic Contributors Analysis and Diagram)

(tbd)

Concepts

Many of these are mentioned in Woods's short course.

adaptive capacity
adaptive universe
unit of adaptive behavior (UAB), adaptive unit
continuous adaptation
graceful extensibility
sustained adaptability
Tangled, layered networks (TLN)
competence envelope
adaptive cycles/histories
precarious present (unease)
resilient future
tradeoffs, five fundamental
efflorescence: the degree that changes in one area tend to recruit or open up beneficial changes in many other aspects of the network - which opens new opportunities across the network ...
reverberation
adaptive stalls
borderlands
anticipate
synchronize
proactive learning
initiative
reciprocity
SNAFUs
robustness
surprise
dynamic fault management
software systems as "team players"
multi-scale
brittleness
how adaptive systems fail (see: How do systems manage their adaptive capacity to successfully handle disruptions? A resilience engineering perspective)
- decompensation
- working at cross-purposes
- getting stuck in outdated behaviors
proactive learning vs getting stuck
oversimplification
fixation
fluency law, veil of fluency
capacity for manoeuvre (CfM)
crunches
turnaround test
sharp end, blunt end
adaptive landscapes
law of stretched systems: Every system is continuously stretched to operate at capacity.
cascades
adapt how to adapt
unit working hard to stay in control
you can monitor how hard you're working to stay in control (monitor risk of saturation)
reality trumps algorithms
stand down
time matters
Properties of resilient organizations
- Tangible experience with surprise
- uneasy about the precarious present
- push initiative down
- reciprocity
- align goals across multiple units
goal conflicts, goal interactions (follow them!)
to understand system, must study it under load
adaptive races are unstable
adaptive traps
roles, nesting of
hidden interdependencies
net adaptive value
matching tempos
tilt toward florescence
linear simplification
common ground
problem detection
joint cognitive systems
automation as a "team player"
"new look"
sacrifice judgment
task tailoring
substitution myth
observability
directability
directed attention
inter-predictability
error of the third kind: solving the wrong problem
buffering capacity
context gap
Norbert's contrast
anomaly response
automation surprises
disturbance management
Doyle's catch
Cooperative advocacy

Selected publications

Selected talks

Online courses

Resilience Engineering: An Introductory Short Course

John Wreathall

Wreathall is an expert in human performance in safety. He works at the WreathWood Group, a risk and safety studies consultancy. Wreathall tweets as @wreathall.

Selected publications

Resilience engineering in practice: a guidebook

Name		Name	Last commit message	Last commit date
Latest commit History 440 Commits
topics		topics
LICENSE.md		LICENSE.md
README.md		README.md
STAMP.md		STAMP.md
boundary.graffle		boundary.graffle
boundary.png		boundary.png
graceful-extensibility.md		graceful-extensibility.md
intro.md		intro.md
laws.md		laws.md
paries-keynote-2015.pptx		paries-keynote-2015.pptx
resilience-doodle.jpg		resilience-doodle.jpg
risk-management-framework.graffle		risk-management-framework.graffle
risk-management-framework.png		risk-management-framework.png
topics.md		topics.md

License

koleson/resilience-engineering

Folders and files

Latest commit

History

Repository files navigation

Resilience engineering papers

Overview

Other interesting links

People

Some big ideas

John Allspaw

Selected publications

Selected talks

Lisanne Bainbridge

Contributions

Ironies of automation

Concepts

Selected publications

Andrea Baker

Concepts

Selected publications

E. Asher Balkin

Selected publications

Selected talks

Johan Bergström

Concepts

Selected publications

Selected talks

Matthieu Branlat

Selected publications

Sheuwen Chuang

Selected publications

Todd Conklin

Selected publications

Selected talks

Richard I. Cook

Concepts

Selected publications

Selected talks

Sidney Dekker

Contributions

Drift into failure

Just Culture

Safety Differently

Concepts

Selected publications

Selected talks

John C. Doyle

Concepts

Doyle's catch

Selected publications

Bob Edwards

Anders Ericsson

Concepts

Selected publications

Paul Feltovich

Selected publications

Meir Finkel

Selected publications

Marisa Grayson

Selected Publications

Ivonne Andrade Herrera

Selected publications

Robert Hoffman

Selected publications

Concepts

Seven deadly myths of autonomous systems:

Erik Hollnagel

Contributions

ETTO principle

Safety-I vs. Safety-II

FRAM

Four abilities necessary for resilient performance

Concepts

Selected publications

Leila Johannesen

Concepts

Selected publications

Gary Klein