Not all failures are equal.
FAILURE.md maps every one.
FAILURE.md is a plain-text Markdown file you place in the root of any repository that contains an AI agent. It defines the four failure modes an AI agent can encounter, how to detect each, and the exact response procedures for each — from graceful degradation to circuit breaking to human review.
What problem does FAILURE.md solve?
AI agents fail in different ways, and not all failures are equal. A non-critical API being unavailable is different from a database cascade. A tool returning an error is different from a tool silently returning wrong data. Without explicit failure mode definitions, agents either over-react (stopping entirely for minor failures) or under-react (silently continuing through serious errors). Either way, behaviour is unpredictable and unauditable.
How does FAILURE.md work?
Drop FAILURE.md in your repo root and define each failure mode: its description, examples, detection signals, and response procedure (action, log level, notification rules, escalation target). Configure health checks, heartbeat intervals, and error pattern matching. Every failure event is logged with full context.
What regulations require FAILURE.md?
The EU AI Act (effective 2 August 2026) requires AI systems to have documented error handling and to behave predictably under adverse conditions. FAILURE.md provides the auditable failure mode definitions and response procedures that compliance requires.
How do I add FAILURE.md to my project?
Copy the template from GitHub and place it in your project root:
├── AGENTS.md
├── ESCALATE.md
├── FAILURE.md ← add this
├── README.md
└── src/
What did teams use before FAILURE.md?
Before FAILURE.md, failure handling was either hardcoded in agent logic, written in a Notion page no one read, or absent entirely. FAILURE.md makes failure response version-controlled, explicit, and auditable — the same file the agent reads is the same file your compliance team reviews.
Who benefits from FAILURE.md?
The AI agent reads it on startup. Your SRE reads it when something goes wrong. Your compliance team reads it during audits. One file serves all three audiences.
A complete protocol.
From slow down to shut down.
FAILURE.md is one file in a complete open specification for AI agent safety. The twelve-file stack provides graduated intervention from proactive slow-down through permanent shutdown and compliance enforcement.
Frequently asked questions.
What is FAILURE.md?
A plain-text Markdown file defining the four failure modes an AI agent can encounter — graceful degradation, partial failure, cascading failure, and silent failure — along with detection signals and per-mode response procedures. Every failure event is logged with timestamp, mode, component, and action taken.
What is the difference between graceful degradation and partial failure?
Graceful degradation: the agent continues operating with reduced capability — a non-critical tool is unavailable, so it skips that feature and logs it. Partial failure: one component fails and the agent must actively route around it — retrying with backoff, routing to a replica, or queuing for later. Partial failure triggers retries and potentially escalates; graceful degradation does not.
What is a silent failure and why is it dangerous?
A silent failure is when the agent produces output without detecting an underlying error — the API returned stale data, the write partially succeeded, or the validation check was skipped. Because no error is raised, normal escalation paths don't fire. FAILURE.md defines output validation, data freshness checks, and cross-reference consistency checks specifically to catch silent failures.
What triggers the circuit breaker?
Cascading failure detection: three failures within 60 seconds, two or more health check components failing simultaneously, or resource consumption doubling within 10 minutes. The circuit breaker stops all dependent operations immediately and escalates to FAILSAFE.md to prevent the cascade spreading further.
How does FAILURE.md relate to FAILSAFE.md and ESCALATE.md?
FAILURE.md defines what failure modes exist and how to detect and respond to each. Cascading failures escalate to FAILSAFE.md (which defines the safe recovery state). Partial failures escalate to ESCALATE.md after exhausting retries (which routes to a human). FAILURE.md is the taxonomy; FAILSAFE.md and ESCALATE.md are the recovery paths.
Does FAILURE.md work with all AI frameworks?
Yes — it is framework-agnostic. It defines failure mode policy; your agent implementation enforces it. Works with LangChain, AutoGen, CrewAI, Claude Code, custom agents, or any AI system that can monitor component health and implement circuit breaker logic.
Own the standard.
Own failure.md
This domain is available for acquisition. It is the canonical home of the FAILURE.md specification — the failure mode protocol layer of the AI agent safety stack, essential for any resilient production AI deployment.
Inquire About AcquisitionOr email directly: [email protected]