dbt™ Documentation Backfiller
Automate dbt™ documentation backfills with a DinoAI agent that detects missing or stale YAML descriptions, commits updates to the PR branch, and posts a summary to Slack.
Automatically detect dbt™ models with missing or stale descriptions, draft model-level and column-level YAML documentation based on the SQL definition and upstream sources, and commit the changes directly to the open PR branch — all triggered by a DinoAI agent running inside a GitHub Actions workflow.
The agent runs on every pull request that touches a .sql or .yml file. It writes missing descriptions, flags columns whose docs no longer match the SQL after a recent change, and posts a summary to #analytics-eng — so documentation debt never accumulates.
Before You Start
Paradime
Your Paradime API endpoint, API key, and API secret — generate these under Workspace Settings → API. Make sure to enable
DinoAI agent APIcapabilities. Requires Admin access.
GitHub
Write access to the repository you want to backfill docs on
Ability to add repository secrets and create GitHub Actions workflows
Integrations
The following must already be connected in Paradime:
Slack — the agent posts a documentation summary to
#analytics-engviapost_slack_message
What You'll Build
By the end of this guide you'll have:
A
doc-backfillerDinoAI agent YAML that finds missing and stale descriptions, drafts them from the SQL, commits the YAML changes, and posts a summary to SlackA Python driver script that collects changed
.sqland.ymlfiles from the PR and triggers the agentA GitHub Actions workflow that runs automatically on every PR that touches a dbt™ model or schema file
What the Agent Does Per PR
Once triggered, the agent works through four steps:
The agent never invents business meaning. When a column's purpose is unclear from the SQL, it writes TODO: confirm with owner rather than guessing. If existing documentation is still accurate after a change, it leaves it untouched.
When Docs Are Considered Stale
The agent flags an existing description as stale when any of the following are true after a SQL change:
A column is referenced in the
.sqlfile but has no entry in the schema YAMLA column exists in the schema YAML but is no longer selected in the SQL
The model's SQL logic has changed substantially enough that the model-level description no longer matches what the model produces (detected by reading both the old and new SQL)
The agent does not delete stale column entries automatically. Instead it adds an inline # STALE: column no longer selected — confirm removal comment in the YAML so a human reviews before merging. This prevents accidental data contract breakage downstream.
Architecture Overview
How It Works
When a PR is opened or updated, GitHub Actions runs doc_backfiller.py, which collects the list of changed .sql and .yml files via git diff and builds a trigger message containing that context.
The DinoAI agent then reads each changed model, compares the SQL definition against the existing schema YAML, drafts any missing or stale descriptions, writes the YAML changes, commits them directly to the PR branch, and posts a summary to Slack. The PR author sees a new commit appear on their branch with all documentation gaps filled.
Create the Agent YAML
Create the following file in your repository at .dinoai/agents/doc-backfiller.yml. This defines the agent's role, four-step goal, staleness detection logic, guardrails, and Slack output channel.
tools.mode: allowlist restricts the agent to only the tools listed. Notably run_sql_query is excluded — the agent works entirely from the SQL source files and YAML, not from the warehouse, so no live database connection is needed.
Create the Driver Script
Create the Python script at scripts/doc_backfiller.py. This script runs inside GitHub Actions and is responsible for:
Reading the PR event payload to get the branch name and PR metadata
Collecting the list of changed
.sqland.ymlfiles viagit diffTriggering the DinoAI agent and posting a "started" comment to the PR immediately
Polling until the agent completes, then posting a completion comment to the PR with a per-file breakdown of every description added, flagged, or marked for review
The script filters changed files to .sql, .yml, and .yaml only. If a PR touches no dbt™ model or schema files — for example a README change — it exits cleanly with code 0 and the agent is never triggered.
Add Your Paradime Credentials to GitHub Secrets
The driver script authenticates with Paradime using three values. Add the following as GitHub Actions secrets in your repository under Settings → Secrets and variables → Actions:
PARADIME_API_KEYPARADIME_API_SECRETPARADIME_API_ENDPOINT
GITHUB_TOKEN is provided automatically by GitHub Actions — you do not need to add it as a secret.
Create the GitHub Actions Workflow
Create the workflow file at .github/workflows/doc-backfiller.yml. This triggers the driver script automatically whenever a PR that touches a dbt™ model or schema file is opened, updated, or marked as ready for review.
The workflow uses paths: to filter — it only runs when a file inside models/ with a .sql, .yml, or .yaml extension is changed. PRs that only touch Python scripts, README files, or other non-dbt™ files will not trigger the agent.
contents: write permission is required because the agent commits documentation changes directly to the PR branch via git push. Without this, the push step inside the agent will fail with a permissions error.
What the PR Experience Looks Like
Once the workflow is set up, the experience for a PR author is:
They open a PR adding or modifying a dbt™ model
Within seconds, a comment appears on the PR:
📝 DinoAI doc backfiller started — session
agt_sess_abc123xyz. Checking for missing and stale descriptions across 2 model(s). Any changes will be committed tofeat/add-revenue-martshortly.A few minutes later, a new commit appears on their branch:
docs: backfill missing descriptions [DinoAI]A second PR comment appears with the full completion summary:
📝 DinoAI doc backfiller — complete
Models checked: 3 Model descriptions added: 2 Column descriptions added: 11 Stale columns flagged: 1 (marked
# STALEin YAML) Descriptions needing review: 2 (marked# REVIEWin YAML)Files updated:
models/marts/_fct_orders.yml— 5 column descriptions addedmodels/staging/_stg_sessions.yml— 6 column descriptions added, 1 stale column flaggedmodels/staging/_stg_users.yml— model description added, 2 descriptions need review
Committed to branch:
feat/add-revenue-mart⚠️ 3 entries marked
TODO: confirm with owner— please review before merging.Session
agt_sess_abc123xyz· Check#analytics-engfor the full summary.The same summary is posted to
#analytics-engon Slack.
The PR author can then review the generated descriptions, confirm or edit any TODO: confirm with owner entries, and remove any # STALE columns they agree are no longer needed — all before the PR is merged.
File Structure
Your repository should look like this after completing the setup:
Unlike the query cost optimizer tutorials, this agent does not require a pyproject.toml — the driver script only depends on paradime-io, which is installed directly via pip install paradime-io in the workflow. If your repo already has a Poetry project set up, you can add paradime-io there instead and swap the install step for poetry install.
Related Docs
Last updated
Was this helpful?