Pipeline Incident Commander –– Airflow

Automatically trigger a multi-agent incident triage workflow from Airflow task failures or a manual CLI, then post a unified Slack summary with root cause, impact, owner, and next action.

Automatically triage a data pipeline incident by spawning three specialist sub-agents in parallel — a log analyser, a query profiler, and an owner notifier — then composing their findings into a single structured Slack post. The orchestrator runs the moment a failure is detected, with no manual intervention required.

This tutorial covers two trigger paths: an Airflow on_failure_callback that fires the incident commander automatically when any DAG task fails, and a manual CLI trigger for teams who want to invoke triage on demand.

compass

Before You Start

Paradime

  • Your Paradime API endpoint, API key, and API secret — generate these under Workspace Settings → API. Make sure to enable DinoAI agent API capabilities. Requires Admin access.

Recommended reading

Before proceeding, read the Programmable Agents section under Products → DinoAI:

  • Quick Start

  • YAML Configuration

  • Tools Reference

  • Agent-to-Agent Delegation

This tutorial uses invoke_agent, notify_parent_session, and child_session_ids — all covered in the Agent-to-Agent Delegation guide.

Integrations

The following must already be connected in Paradime:

  • Slack — the orchestrator and sub-agents post to #incidents via post_slack_message

What You'll Build

By the end of this guide you'll have:

  • Four DinoAI agent YAMLs — one orchestrator (incident-commander) and three specialists (log-analyzer, query-profiler, owner-notifier)

  • An Airflow on_failure_callback function that fires the incident commander automatically when any task in a DAG fails, pre-populated with the DAG name, task ID, run ID, and execution date

  • A manual trigger script for ad-hoc incident triage

What Happens During a Triage

Once the incident commander is triggered:

The commander never investigates itself — it only delegates and composes. None of the sub-agents have invoke_agent in their tool allowlist, so the graph is exactly two levels deep. This keeps incident response predictable and prevents runaway delegation chains.

The Slack Incident Post

Once all three sub-agents have reported back, the incident commander posts a single structured message to #incidents:

Architecture Overview

1

Create the Agent YAMLs

Create all four agent files. The orchestrator and three sub-agents must all exist in .dinoai/agents/ before the first trigger.

Orchestrator — incident-commander

Sub-agent — log-analyzer

Sub-agent — query-profiler

Sub-agent — owner-notifier

None of the sub-agents have invoke_agent in their tool allowlist. This keeps the delegation graph exactly two levels deep — the commander is the only delegator. Sub-agents cannot spawn further children, which makes incident response predictable and prevents runaway chains.

2

Trigger Path A — Airflow on_failure_callback

This is the recommended path for teams running Airflow. The callback fires automatically when any task in your DAG fails, extracts the task and DAG context from Airflow's context dict, and hands it directly to the incident commander so the triage message is pre-populated with real incident details.

Create dags/callbacks/incident_commander.py:

Then attach the callback to your DAG:

The callback runs in a daemon thread so it does not block Airflow's task runner while the agent session runs (which can take several minutes). The original task failure is always surfaced normally in Airflow regardless of whether the triage agent succeeds or fails.

To attach the callback to a single task rather than the whole DAG, set on_failure_callback=trigger_incident_commander on the operator directly. This is useful when only certain high-priority models should trigger a full triage.

PARADIME_API_ENDPOINT, PARADIME_API_KEY, and PARADIME_API_SECRET should be stored as Airflow Variables or in an Airflow Connection, not hardcoded. Retrieve them with Variable.get("PARADIME_API_KEY") if you prefer the Airflow Variables pattern over environment variables.

3

Trigger Path B — Manual CLI Trigger

For teams not running Airflow, or for ad-hoc incident triage when a failure is spotted manually, create scripts/trigger_incident.py:

Run it from your terminal:

4

Watch Sub-Agents in Flight

Both trigger paths block until the full triage is complete. If you want to stream sub-agent progress in real time as each specialist reports back, use the non-blocking trigger_run pattern and poll child_session_ids:

child_session_ids populates progressively as the orchestrator spawns each sub-agent. The first poll may return an empty list — this is expected. The three child sessions will appear within the first 30–60 seconds as the commander issues its invoke_agent calls.

Execution Flow

When the incident commander receives the trigger message:

  1. It calls invoke_agent three times in parallel — one for each specialist — passing its own session ID as the callback target

  2. Each sub-agent runs independently, posting updates to the #incidents Slack thread as it works

  3. Each sub-agent calls notify_parent_session with its findings when complete

  4. The commander resumes once all three callbacks have arrived, then composes and posts the unified triage report

The entire triage typically completes in 3–8 minutes depending on log volume and warehouse query history depth.

Set Your Environment Variables

The callback and trigger scripts require three variables. For the Airflow path, set these as Airflow Variables or in your Airflow environment. For the manual path, export them in your shell.

Variable
Description

PARADIME_API_ENDPOINT

Your Paradime API endpoint

PARADIME_API_KEY

Your Paradime API key

PARADIME_API_SECRET

Your Paradime API secret

INCIDENT_SLACK_CHANNEL

(optional) Slack channel for incident posts — defaults to #incidents

SUSPECT_MODEL

(optional, Airflow only) Override the suspect model name — defaults to the failed task ID

Your Paradime API endpoint, key, and secret are available under Workspace Settings → API. Make sure the key has DinoAI agent API capabilities enabled.

File Structure

Your repository should look like this after completing the setup:

Last updated

Was this helpful?