dbt™ Documentation Backfiller

Automate dbt™ documentation backfills with a DinoAI agent that detects missing or stale YAML descriptions, commits updates to the PR branch, and posts a summary to Slack.

Automatically detect dbt™ models with missing or stale descriptions, draft model-level and column-level YAML documentation based on the SQL definition and upstream sources, and commit the changes directly to the open PR branch — all triggered by a DinoAI agent running inside a GitHub Actions workflow.

The agent runs on every pull request that touches a .sql or .yml file. It writes missing descriptions, flags columns whose docs no longer match the SQL after a recent change, and posts a summary to #analytics-eng — so documentation debt never accumulates.

compass

Before You Start

Paradime

  • Your Paradime API endpoint, API key, and API secret — generate these under Workspace Settings → API. Make sure to enable DinoAI agent API capabilities. Requires Admin access.

GitHub

  • Write access to the repository you want to backfill docs on

  • Ability to add repository secrets and create GitHub Actions workflows

Integrations

The following must already be connected in Paradime:

  • Slack — the agent posts a documentation summary to #analytics-eng via post_slack_message

What You'll Build

By the end of this guide you'll have:

  • A doc-backfiller DinoAI agent YAML that finds missing and stale descriptions, drafts them from the SQL, commits the YAML changes, and posts a summary to Slack

  • A Python driver script that collects changed .sql and .yml files from the PR and triggers the agent

  • A GitHub Actions workflow that runs automatically on every PR that touches a dbt™ model or schema file

What the Agent Does Per PR

Once triggered, the agent works through four steps:

The agent never invents business meaning. When a column's purpose is unclear from the SQL, it writes TODO: confirm with owner rather than guessing. If existing documentation is still accurate after a change, it leaves it untouched.

When Docs Are Considered Stale

The agent flags an existing description as stale when any of the following are true after a SQL change:

  • A column is referenced in the .sql file but has no entry in the schema YAML

  • A column exists in the schema YAML but is no longer selected in the SQL

  • The model's SQL logic has changed substantially enough that the model-level description no longer matches what the model produces (detected by reading both the old and new SQL)

The agent does not delete stale column entries automatically. Instead it adds an inline # STALE: column no longer selected — confirm removal comment in the YAML so a human reviews before merging. This prevents accidental data contract breakage downstream.

Architecture Overview

How It Works

When a PR is opened or updated, GitHub Actions runs doc_backfiller.py, which collects the list of changed .sql and .yml files via git diff and builds a trigger message containing that context.

The DinoAI agent then reads each changed model, compares the SQL definition against the existing schema YAML, drafts any missing or stale descriptions, writes the YAML changes, commits them directly to the PR branch, and posts a summary to Slack. The PR author sees a new commit appear on their branch with all documentation gaps filled.

1

Create the Agent YAML

Create the following file in your repository at .dinoai/agents/doc-backfiller.yml. This defines the agent's role, four-step goal, staleness detection logic, guardrails, and Slack output channel.

tools.mode: allowlist restricts the agent to only the tools listed. Notably run_sql_query is excluded — the agent works entirely from the SQL source files and YAML, not from the warehouse, so no live database connection is needed.

2

Create the Driver Script

Create the Python script at scripts/doc_backfiller.py. This script runs inside GitHub Actions and is responsible for:

  • Reading the PR event payload to get the branch name and PR metadata

  • Collecting the list of changed .sql and .yml files via git diff

  • Triggering the DinoAI agent and posting a "started" comment to the PR immediately

  • Polling until the agent completes, then posting a completion comment to the PR with a per-file breakdown of every description added, flagged, or marked for review

The script filters changed files to .sql, .yml, and .yaml only. If a PR touches no dbt™ model or schema files — for example a README change — it exits cleanly with code 0 and the agent is never triggered.

3

Add Your Paradime Credentials to GitHub Secrets

The driver script authenticates with Paradime using three values. Add the following as GitHub Actions secrets in your repository under Settings → Secrets and variables → Actions:

  • PARADIME_API_KEY

  • PARADIME_API_SECRET

  • PARADIME_API_ENDPOINT

GITHUB_TOKEN is provided automatically by GitHub Actions — you do not need to add it as a secret.

4

Create the GitHub Actions Workflow

Create the workflow file at .github/workflows/doc-backfiller.yml. This triggers the driver script automatically whenever a PR that touches a dbt™ model or schema file is opened, updated, or marked as ready for review.

The workflow uses paths: to filter — it only runs when a file inside models/ with a .sql, .yml, or .yaml extension is changed. PRs that only touch Python scripts, README files, or other non-dbt™ files will not trigger the agent.

contents: write permission is required because the agent commits documentation changes directly to the PR branch via git push. Without this, the push step inside the agent will fail with a permissions error.

What the PR Experience Looks Like

Once the workflow is set up, the experience for a PR author is:

  1. They open a PR adding or modifying a dbt™ model

  2. Within seconds, a comment appears on the PR:

    📝 DinoAI doc backfiller started — session agt_sess_abc123xyz. Checking for missing and stale descriptions across 2 model(s). Any changes will be committed to feat/add-revenue-mart shortly.

  3. A few minutes later, a new commit appears on their branch: docs: backfill missing descriptions [DinoAI]

  4. A second PR comment appears with the full completion summary:

    📝 DinoAI doc backfiller — complete

    Models checked: 3 Model descriptions added: 2 Column descriptions added: 11 Stale columns flagged: 1 (marked # STALE in YAML) Descriptions needing review: 2 (marked # REVIEW in YAML)

    Files updated:

    • models/marts/_fct_orders.yml — 5 column descriptions added

    • models/staging/_stg_sessions.yml — 6 column descriptions added, 1 stale column flagged

    • models/staging/_stg_users.yml — model description added, 2 descriptions need review

    Committed to branch: feat/add-revenue-mart

    ⚠️ 3 entries marked TODO: confirm with owner — please review before merging.


    Session agt_sess_abc123xyz · Check #analytics-eng for the full summary.

  5. The same summary is posted to #analytics-eng on Slack.

The PR author can then review the generated descriptions, confirm or edit any TODO: confirm with owner entries, and remove any # STALE columns they agree are no longer needed — all before the PR is merged.

File Structure

Your repository should look like this after completing the setup:

Unlike the query cost optimizer tutorials, this agent does not require a pyproject.toml — the driver script only depends on paradime-io, which is installed directly via pip install paradime-io in the workflow. If your repo already has a Poetry project set up, you can add paradime-io there instead and swap the install step for poetry install.

Last updated

Was this helpful?