Pipeline Incident Commander –– Airflow
Automatically trigger a multi-agent incident triage workflow from Airflow task failures or a manual CLI, then post a unified Slack summary with root cause, impact, owner, and next action.
Automatically triage a data pipeline incident by spawning three specialist sub-agents in parallel — a log analyser, a query profiler, and an owner notifier — then composing their findings into a single structured Slack post. The orchestrator runs the moment a failure is detected, with no manual intervention required.
This tutorial covers two trigger paths: an Airflow on_failure_callback that fires the incident commander automatically when any DAG task fails, and a manual CLI trigger for teams who want to invoke triage on demand.
Before You Start
Paradime
Your Paradime API endpoint, API key, and API secret — generate these under Workspace Settings → API. Make sure to enable
DinoAI agent APIcapabilities. Requires Admin access.
Recommended reading
Before proceeding, read the Programmable Agents section under Products → DinoAI:
Quick Start
YAML Configuration
Tools Reference
Agent-to-Agent Delegation
This tutorial uses invoke_agent, notify_parent_session, and child_session_ids — all covered in the Agent-to-Agent Delegation guide.
Integrations
The following must already be connected in Paradime:
Slack — the orchestrator and sub-agents post to
#incidentsviapost_slack_message
What You'll Build
By the end of this guide you'll have:
Four DinoAI agent YAMLs — one orchestrator (
incident-commander) and three specialists (log-analyzer,query-profiler,owner-notifier)An Airflow
on_failure_callbackfunction that fires the incident commander automatically when any task in a DAG fails, pre-populated with the DAG name, task ID, run ID, and execution dateA manual trigger script for ad-hoc incident triage
What Happens During a Triage
Once the incident commander is triggered:
The commander never investigates itself — it only delegates and composes. None of the sub-agents have invoke_agent in their tool allowlist, so the graph is exactly two levels deep. This keeps incident response predictable and prevents runaway delegation chains.
The Slack Incident Post
Once all three sub-agents have reported back, the incident commander posts a single structured message to #incidents:
Architecture Overview
Create the Agent YAMLs
Create all four agent files. The orchestrator and three sub-agents must all exist in .dinoai/agents/ before the first trigger.
Orchestrator — incident-commander
Sub-agent — log-analyzer
Sub-agent — query-profiler
Sub-agent — owner-notifier
None of the sub-agents have invoke_agent in their tool allowlist. This keeps the delegation graph exactly two levels deep — the commander is the only delegator. Sub-agents cannot spawn further children, which makes incident response predictable and prevents runaway chains.
Trigger Path A — Airflow on_failure_callback
on_failure_callbackThis is the recommended path for teams running Airflow. The callback fires automatically when any task in your DAG fails, extracts the task and DAG context from Airflow's context dict, and hands it directly to the incident commander so the triage message is pre-populated with real incident details.
Create dags/callbacks/incident_commander.py:
Then attach the callback to your DAG:
The callback runs in a daemon thread so it does not block Airflow's task runner while the agent session runs (which can take several minutes). The original task failure is always surfaced normally in Airflow regardless of whether the triage agent succeeds or fails.
To attach the callback to a single task rather than the whole DAG, set on_failure_callback=trigger_incident_commander on the operator directly. This is useful when only certain high-priority models should trigger a full triage.
PARADIME_API_ENDPOINT, PARADIME_API_KEY, and PARADIME_API_SECRET should be stored as Airflow Variables or in an Airflow Connection, not hardcoded. Retrieve them with Variable.get("PARADIME_API_KEY") if you prefer the Airflow Variables pattern over environment variables.
Watch Sub-Agents in Flight
Both trigger paths block until the full triage is complete. If you want to stream sub-agent progress in real time as each specialist reports back, use the non-blocking trigger_run pattern and poll child_session_ids:
child_session_ids populates progressively as the orchestrator spawns each sub-agent. The first poll may return an empty list — this is expected. The three child sessions will appear within the first 30–60 seconds as the commander issues its invoke_agent calls.
Execution Flow
When the incident commander receives the trigger message:
It calls
invoke_agentthree times in parallel — one for each specialist — passing its own session ID as the callback targetEach sub-agent runs independently, posting updates to the
#incidentsSlack thread as it worksEach sub-agent calls
notify_parent_sessionwith its findings when completeThe commander resumes once all three callbacks have arrived, then composes and posts the unified triage report
The entire triage typically completes in 3–8 minutes depending on log volume and warehouse query history depth.
Set Your Environment Variables
The callback and trigger scripts require three variables. For the Airflow path, set these as Airflow Variables or in your Airflow environment. For the manual path, export them in your shell.
PARADIME_API_ENDPOINT
Your Paradime API endpoint
PARADIME_API_KEY
Your Paradime API key
PARADIME_API_SECRET
Your Paradime API secret
INCIDENT_SLACK_CHANNEL
(optional) Slack channel for incident posts — defaults to #incidents
SUSPECT_MODEL
(optional, Airflow only) Override the suspect model name — defaults to the failed task ID
Your Paradime API endpoint, key, and secret are available under Workspace Settings → API. Make sure the key has DinoAI agent API capabilities enabled.
File Structure
Your repository should look like this after completing the setup:
Related Docs
Last updated
Was this helpful?