# dbt™ Documentation Backfiller

Automatically detect dbt™ models with missing or stale descriptions, draft model-level and column-level YAML documentation based on the SQL definition and upstream sources, and commit the changes directly to the open PR branch — all triggered by a DinoAI agent running inside a GitHub Actions workflow.

The agent runs on every pull request that touches a `.sql` or `.yml` file. It writes missing descriptions, flags columns whose docs no longer match the SQL after a recent change, and posts a summary to `#analytics-eng` — so documentation debt never accumulates.

{% hint style="info" icon="compass" %}
**Before You Start**

**Paradime**

* Your Paradime API endpoint, API key, and API secret — generate these under Workspace Settings → API. Make sure to enable `DinoAI agent API` capabilities. Requires Admin access.

**GitHub**

* Write access to the repository you want to backfill docs on
* Ability to add repository secrets and create GitHub Actions workflows

**Integrations**

The following must already be connected in Paradime:

* **Slack** — the agent posts a documentation summary to `#analytics-eng` via `post_slack_message`
  {% endhint %}

## What You'll Build

By the end of this guide you'll have:

* A `doc-backfiller` DinoAI agent YAML that finds missing and stale descriptions, drafts them from the SQL, commits the YAML changes, and posts a summary to Slack
* A Python driver script that collects changed `.sql` and `.yml` files from the PR and triggers the agent
* A GitHub Actions workflow that runs automatically on every PR that touches a dbt™ model or schema file

### What the Agent Does Per PR

Once triggered, the agent works through four steps:

```
STEP 1 — Identify which changed models are missing model-level descriptions
STEP 2 — Identify which changed models have columns missing descriptions,
          or whose existing descriptions no longer match the SQL after edits
STEP 3 — Draft and write all missing and updated descriptions to the schema YAML,
          then commit the changes to the open PR branch
STEP 4 — Post a summary to #analytics-eng with counts and a link to the commit
```

The agent never invents business meaning. When a column's purpose is unclear from the SQL, it writes `TODO: confirm with owner` rather than guessing. If existing documentation is still accurate after a change, it leaves it untouched.

### When Docs Are Considered Stale

The agent flags an existing description as stale when any of the following are true after a SQL change:

* A column is referenced in the `.sql` file but has no entry in the schema YAML
* A column exists in the schema YAML but is no longer selected in the SQL
* The model's SQL logic has changed substantially enough that the model-level description no longer matches what the model produces (detected by reading both the old and new SQL)

{% hint style="info" %}
The agent does not delete stale column entries automatically. Instead it adds an inline `# STALE: column no longer selected — confirm removal` comment in the YAML so a human reviews before merging. This prevents accidental data contract breakage downstream.
{% endhint %}

### Architecture Overview

```mermaid
flowchart TD
    A([Pull Request opened / updated]) --> B[GitHub Actions\ndoc-backfiller workflow]
    B --> C[driver script\ndoc_backfiller.py]
    C --> D[Collect changed .sql\nand .yml files from git diff]
    C --> E[Trigger DinoAI agent\nparadime.dinoai_agents.trigger_run]

    E --> F([doc-backfiller agent])

    F --> F1[read_file · ripgrep_search\nfind missing + stale descriptions]
    F --> F2[write_file\ndraft and write YAML descriptions]
    F --> F3[run_terminal_command\ngit commit · push to PR branch]
    F --> F4[post_slack_message\n→ #analytics-eng]
```

## How It Works

When a PR is opened or updated, GitHub Actions runs `doc_backfiller.py`, which collects the list of changed `.sql` and `.yml` files via `git diff` and builds a trigger message containing that context.

The DinoAI agent then reads each changed model, compares the SQL definition against the existing schema YAML, drafts any missing or stale descriptions, writes the YAML changes, commits them directly to the PR branch, and posts a summary to Slack. The PR author sees a new commit appear on their branch with all documentation gaps filled.

{% stepper %}
{% step %}

### Create the Agent YAML

Create the following file in your repository at `.dinoai/agents/doc-backfiller.yml`. This defines the agent's role, four-step goal, staleness detection logic, guardrails, and Slack output channel.

{% code title=".dinoai/agents/doc-backfiller.yml" lineNumbers="true" %}

```yaml
name: doc-backfiller
version: 1

role: >
  Analytics Engineering Documentation Specialist with deep knowledge of dbt™
  YAML conventions and clear technical writing.

goal: >
  You will be given a list of changed .sql and .yml files from a pull request.
  Your mission is to ensure every changed dbt™ model has complete, accurate
  model-level and column-level descriptions in its schema YAML — and that any
  existing descriptions that are no longer accurate after recent SQL changes
  are flagged or updated.

  ── STEP 1: FIND MISSING MODEL-LEVEL DESCRIPTIONS ─────────────────────────

  For each changed .sql file:
    1. Find its corresponding schema YAML (e.g. _<model_name>.yml or schema.yml
       in the same directory).
    2. Check whether the model has a non-empty `description:` field.
    3. If missing or empty, read the full SQL, understand what the model
       produces, and draft a one or two sentence active-voice description.

  ── STEP 2: FIND MISSING AND STALE COLUMN DESCRIPTIONS ────────────────────

  For each changed .sql file:
    1. Extract the final SELECT column list from the SQL (the outermost SELECT,
       not CTEs). Use ripgrep_search or read_file to read the full file.
    2. Compare against the columns listed in the schema YAML.
    3. For each column in the SQL but missing from the YAML: draft a description
       from its name, expression, and context. If the purpose is unclear, write
       `TODO: confirm with owner` rather than guessing.
    4. For each column in the YAML but no longer in the SQL: add an inline
       comment `# STALE: column no longer selected — confirm removal` on the
       same line as the column name. Do NOT delete the entry.
    5. For each column whose description exists but the SQL expression has
       changed substantially: prepend `# REVIEW: SQL changed — verify this
       description is still accurate` as a comment above the description line.

  ── STEP 3: WRITE CHANGES AND COMMIT ─────────────────────────────────────

  After drafting all descriptions:
    1. Write the updated schema YAML files using write_file. Preserve all
       existing formatting, indentation, and ordering. Only add or annotate —
       never reorder or reformat unchanged sections.
    2. Stage all changed YAML files and commit to the PR branch explicitly:
         git add <changed_yaml_files>
         git commit -m "docs: backfill missing descriptions [DinoAI]"
         git push origin HEAD:<branch_name>
       Always use `git push origin HEAD:<branch_name>` — never plain `git push`
       — to ensure the commit lands on the correct PR branch regardless of
       the local HEAD state inside the runner.
    3. If no descriptions were missing or stale, skip the commit entirely
       and note this in both the Slack message and the PR comment.

  ── STEP 4: POST SUMMARY TO SLACK AND PR ──────────────────────────────────

  Post the same structured summary in two places:

  A) To #analytics-eng via post_slack_message:

    📝 *Doc Backfiller — PR Documentation Summary*
    *PR:* <PR URL>
    *Models checked:* <N>
    *Model descriptions added:* <N>
    *Column descriptions added:* <N>
    *Stale columns flagged:* <N>  (marked # STALE in YAML)
    *Descriptions needing review:* <N>  (marked # REVIEW in YAML)
    <If changes committed:> *Committed to branch:* `<branch_name>`
    <If nothing missing:> ✅ All models in this PR are fully documented.

  B) As a GitHub PR comment via run_terminal_command using the GitHub CLI:
       gh pr comment <pr_number> --body "<same summary, in markdown>"

     The PR comment must include:
     - Counts of what was added, flagged, or left for review
     - A per-file breakdown, e.g.:
         - `models/marts/_fct_orders.yml` — 3 column descriptions added
         - `models/staging/_stg_sessions.yml` — 1 stale column flagged
     - If nothing was missing: a single line confirming full coverage
     - A note that `TODO: confirm with owner` entries need manual review
       before the PR is merged

  ── GUARDRAILS ────────────────────────────────────────────────────────────

  - NEVER invent business meaning. When unsure, write `TODO: confirm with owner`.
  - NEVER delete existing YAML entries, even if a column appears stale.
    Add the # STALE comment instead and let a human decide.
  - NEVER reformat or reorder unchanged YAML sections. Be surgical.
  - NEVER commit if no changes were made — skip the commit and say so.
  - Write descriptions in active voice, one or two sentences maximum.
  - Do not describe implementation details (e.g. "this column is computed
    using a CASE WHEN") — describe what the value means to a business user.

backstory: >
  You are concise and precise. You write descriptions in active voice, one
  or two sentences each. You never invent business meaning — when a column's
  purpose is unclear from the code, you mark it `TODO: confirm with owner`
  rather than guessing. You understand dbt™ YAML schema conventions
  thoroughly, including model-level and column-level descriptions, data
  tests, and meta fields. You treat the PR branch as your workspace and
  commit directly to it so the PR author sees the documentation appear
  as a new commit without any manual steps.

tools:
  mode: allowlist
  list:
    - read_file
    - write_file
    - search_files_and_directories
    - ripgrep_search
    - run_terminal_command
    - post_slack_message

slack:
  channel: "#analytics-eng"
```

{% endcode %}

{% hint style="info" %}
`tools.mode: allowlist` restricts the agent to only the tools listed. Notably `run_sql_query` is excluded — the agent works entirely from the SQL source files and YAML, not from the warehouse, so no live database connection is needed.
{% endhint %}
{% endstep %}

{% step %}

### Create the Driver Script

Create the Python script at `scripts/doc_backfiller.py`. This script runs inside GitHub Actions and is responsible for:

* Reading the PR event payload to get the branch name and PR metadata
* Collecting the list of changed `.sql` and `.yml` files via `git diff`
* Triggering the DinoAI agent and posting a "started" comment to the PR immediately
* Polling until the agent completes, then posting a completion comment to the PR with a per-file breakdown of every description added, flagged, or marked for review

{% code title="scripts/doc:backfiller.py" lineNumbers="true" %}

```python
"""
doc_backfiller.py
-----------------
Trigger script for the `doc-backfiller` DinoAI agent.

Collects changed dbt™ model and schema files from the current PR,
triggers a DinoAI agent session to backfill missing and stale descriptions,
and posts the session ID back to the PR as a comment.

Environment variables (required)
=================================
  PARADIME_API_ENDPOINT  – Paradime API endpoint
  PARADIME_API_KEY       – Paradime API key
  PARADIME_API_SECRET    – Paradime API secret
  GITHUB_EVENT_PATH      – Set automatically by GitHub Actions
  GITHUB_TOKEN           – Set automatically by GitHub Actions
"""

import json
import os
import subprocess
import sys
import urllib.request

from paradime import Paradime

AGENT_NAME = "doc-backfiller"


def _build_client() -> Paradime:
    required = {
        "PARADIME_API_ENDPOINT": os.environ.get("PARADIME_API_ENDPOINT"),
        "PARADIME_API_KEY":      os.environ.get("PARADIME_API_KEY"),
        "PARADIME_API_SECRET":   os.environ.get("PARADIME_API_SECRET"),
    }
    missing = [k for k, v in required.items() if not v]
    if missing:
        print(f"ERROR: Missing env vars: {', '.join(missing)}", file=sys.stderr)
        raise SystemExit(1)
    return Paradime(
        api_endpoint=required["PARADIME_API_ENDPOINT"],
        api_key=required["PARADIME_API_KEY"],
        api_secret=required["PARADIME_API_SECRET"],
    )


def main() -> None:
    paradime = _build_client()

    # 1. Read the PR event payload
    with open(os.environ["GITHUB_EVENT_PATH"]) as f:
        event = json.load(f)

    pr           = event["pull_request"]
    pr_number    = pr["number"]
    pr_title     = pr["title"]
    pr_url       = pr["html_url"]
    branch_name  = pr["head"]["ref"]
    base_sha     = pr["base"]["sha"]
    head_sha     = pr["head"]["sha"]

    # 2. Collect changed .sql and .yml files from the diff
    changed = subprocess.check_output(
        ["git", "diff", "--name-only", f"{base_sha}...{head_sha}"]
    ).decode().splitlines()

    dbt_files = [
        f for f in changed
        if f.endswith(".sql") or f.endswith(".yml") or f.endswith(".yaml")
    ]

    if not dbt_files:
        print("No dbt™ model or schema files changed. Skipping doc backfill.")
        raise SystemExit(0)

    # 3. Build the trigger message
    files_block = "\n".join(f"  - {f}" for f in dbt_files)

    sql_files  = [f for f in dbt_files if f.endswith(".sql")]
    yaml_files = [f for f in dbt_files if f.endswith((".yml", ".yaml"))]

    message = f"""Backfill missing and stale dbt™ model documentation for PR #{pr_number}.

PR Title: {pr_title}
PR URL:   {pr_url}
Branch:   {branch_name}

Changed dbt™ files ({len(dbt_files)} total):
  SQL models  ({len(sql_files)}):
{chr(10).join(f"    - {f}" for f in sql_files) or "    (none)"}

  Schema YAML ({len(yaml_files)}):
{chr(10).join(f"    - {f}" for f in yaml_files) or "    (none)"}

For each changed SQL model:
1. Find its schema YAML and check for missing model-level descriptions.
2. Compare the final SELECT columns against the YAML column list — find
   missing columns, stale columns (in YAML but not in SQL), and columns
   whose descriptions may be outdated after recent SQL changes.
3. Draft all missing descriptions. Write `TODO: confirm with owner` when
   the business purpose is not clear from the SQL. Flag stale entries
   with # STALE comments rather than deleting them.
4. Write the updated YAML files, commit to branch `{branch_name}`, and
   push. Skip the commit if nothing needed updating.
5. Post a summary to #analytics-eng with counts of what was added,
   flagged, or left for review.
"""

    # 4. Trigger the agent and wait for completion
    trigger    = paradime.dinoai_agents.trigger_run(
        agent=AGENT_NAME,
        message=message,
    )
    session_id = trigger.agent_session_id
    print(f"Started doc backfill session: {session_id}")

    # 5. Post a "started" comment to the PR immediately
    gh_token = os.environ.get("GITHUB_TOKEN")
    repo     = os.environ.get("GITHUB_REPOSITORY", "")

    def _post_pr_comment(body: str) -> None:
        if not gh_token:
            return
        req = urllib.request.Request(
            f"https://api.github.com/repos/{repo}/issues/{pr_number}/comments",
            data=json.dumps({"body": body}).encode(),
            headers={
                "Authorization": f"Bearer {gh_token}",
                "Accept":        "application/vnd.github+json",
            },
            method="POST",
        )
        urllib.request.urlopen(req).read()

    _post_pr_comment(
        f"📝 **DinoAI doc backfiller started** — session `{session_id}`.\n"
        f"Checking for missing and stale descriptions across {len(sql_files)} model(s).\n"
        f"Any changes will be committed to `{branch_name}` shortly."
    )

    # 6. Poll until the agent completes
    import time
    from paradime.apis.dinoai_agents.types import DinoaiAgentRunStatus

    TIMEOUT       = 1800   # 30 minutes
    POLL_INTERVAL = 15
    start         = time.time()

    while True:
        run = paradime.dinoai_agents.get_run(agent_session_id=session_id)

        if run.status == DinoaiAgentRunStatus.COMPLETED:
            print(f"[doc-backfiller] ✅  Session {session_id} completed.")
            break

        if run.status == DinoaiAgentRunStatus.FAILED:
            last_msg = run.messages[-1].content if run.messages else "no output"
            _post_pr_comment(
                f"❌ **DinoAI doc backfiller failed** — session `{session_id}`.\n"
                f"Last message: {last_msg}"
            )
            raise SystemExit(f"Agent session failed: {last_msg}")

        elapsed = time.time() - start
        if elapsed > TIMEOUT:
            _post_pr_comment(
                f"⏱ **DinoAI doc backfiller timed out** — session `{session_id}`.\n"
                f"The agent did not complete within {TIMEOUT // 60} minutes."
            )
            raise SystemExit(f"Timed out after {TIMEOUT}s")

        print(f"[doc-backfiller]   {run.status.value} ({int(elapsed)}s) … retrying in {POLL_INTERVAL}s")
        time.sleep(POLL_INTERVAL)

    # 7. Post the agent's final output back to the PR as a completion comment
    final_output = run.messages[-1].content if run.messages else "(no output returned)"

    _post_pr_comment(
        f"📝 **DinoAI doc backfiller — complete**\n\n"
        f"{final_output}\n\n"
        f"---\n"
        f"*Session `{session_id}` · Check `#analytics-eng` for the full summary.*"
    )

    print("[doc-backfiller] ✅  Done.")



if __name__ == "__main__":
    main()
```

{% endcode %}

{% hint style="info" %}
The script filters changed files to `.sql`, `.yml`, and `.yaml` only. If a PR touches no dbt™ model or schema files — for example a README change — it exits cleanly with code `0` and the agent is never triggered.
{% endhint %}
{% endstep %}

{% step %}

### Add Your Paradime Credentials to GitHub Secrets

The driver script authenticates with Paradime using three values. Add the following as GitHub Actions secrets in your repository under **Settings → Secrets and variables → Actions**:

* `PARADIME_API_KEY`
* `PARADIME_API_SECRET`
* `PARADIME_API_ENDPOINT`

{% hint style="info" %}
`GITHUB_TOKEN` is provided automatically by GitHub Actions — you do not need to add it as a secret.
{% endhint %}
{% endstep %}

{% step %}

### Create the GitHub Actions Workflow

Create the workflow file at `.github/workflows/doc-backfiller.yml`. This triggers the driver script automatically whenever a PR that touches a dbt™ model or schema file is opened, updated, or marked as ready for review.

{% code title=".github/workflows/doc-backfiller.yml" lineNumbers="true" %}

```yaml
name: DinoAI dbt™ doc backfiller

on:
  pull_request:
    types: [opened, synchronize, ready_for_review]
    paths:
      - "models/**/*.sql"
      - "models/**/*.yml"
      - "models/**/*.yaml"

permissions:
  pull-requests: write
  contents: write       # required so the agent can push commits to the PR branch

jobs:
  backfill-docs:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0        # need full history for git diff
          ref: ${{ github.head_ref }}   # check out the PR branch so the agent can push to it

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install Paradime SDK
        run: pip install paradime-io

      - name: Configure git for agent commits
        run: |
          git config user.name  "DinoAI Doc Backfiller"
          git config user.email "dinoai-docs@noreply.github.com"

      - name: Run doc backfiller
        env:
          PARADIME_API_ENDPOINT: ${{ secrets.PARADIME_API_ENDPOINT }}
          PARADIME_API_KEY:      ${{ secrets.PARADIME_API_KEY }}
          PARADIME_API_SECRET:   ${{ secrets.PARADIME_API_SECRET }}
          GITHUB_TOKEN:          ${{ secrets.GITHUB_TOKEN }}
        run: python scripts/doc_backfiller.py
```

{% endcode %}

{% hint style="info" %}
The workflow uses `paths:` to filter — it only runs when a file inside `models/` with a `.sql`, `.yml`, or `.yaml` extension is changed. PRs that only touch Python scripts, README files, or other non-dbt™ files will not trigger the agent.
{% endhint %}

{% hint style="info" %}
`contents: write` permission is required because the agent commits documentation changes directly to the PR branch via `git push`. Without this, the push step inside the agent will fail with a permissions error.
{% endhint %}
{% endstep %}
{% endstepper %}

## What the PR Experience Looks Like

Once the workflow is set up, the experience for a PR author is:

1. They open a PR adding or modifying a dbt™ model
2. Within seconds, a comment appears on the PR:

   > 📝 **DinoAI doc backfiller started** — session `agt_sess_abc123xyz`. Checking for missing and stale descriptions across 2 model(s). Any changes will be committed to `feat/add-revenue-mart` shortly.
3. A few minutes later, a new commit appears on their branch: `docs: backfill missing descriptions [DinoAI]`
4. A second PR comment appears with the full completion summary:

   > 📝 **DinoAI doc backfiller — complete**
   >
   > **Models checked:** 3 **Model descriptions added:** 2 **Column descriptions added:** 11 **Stale columns flagged:** 1 (marked `# STALE` in YAML) **Descriptions needing review:** 2 (marked `# REVIEW` in YAML)
   >
   > **Files updated:**
   >
   > * `models/marts/_fct_orders.yml` — 5 column descriptions added
   > * `models/staging/_stg_sessions.yml` — 6 column descriptions added, 1 stale column flagged
   > * `models/staging/_stg_users.yml` — model description added, 2 descriptions need review
   >
   > **Committed to branch:** `feat/add-revenue-mart`
   >
   > ⚠️ 3 entries marked `TODO: confirm with owner` — please review before merging.
   >
   > ***
   >
   > *Session `agt_sess_abc123xyz` · Check `#analytics-eng` for the full summary.*
5. The same summary is posted to `#analytics-eng` on Slack.

The PR author can then review the generated descriptions, confirm or edit any `TODO: confirm with owner` entries, and remove any `# STALE` columns they agree are no longer needed — all before the PR is merged.

### File Structure

Your repository should look like this after completing the setup:

```
your-repo/
├── dbt_project.yml
├── .dinoai/
│   └── agents/
│       └── doc-backfiller.yml
├── .github/
│   └── workflows/
│       └── doc-backfiller.yml
└── scripts/
    └── doc_backfiller.py             ← driver script
```

{% hint style="info" %}
Unlike the query cost optimizer tutorials, this agent does not require a `pyproject.toml` — the driver script only depends on `paradime-io`, which is installed directly via `pip install paradime-io` in the workflow. If your repo already has a Poetry project set up, you can add `paradime-io` there instead and swap the install step for `poetry install`.
{% endhint %}

## Related Docs

* [**Programmable Agents — Quick Start** — getting started with DinoAI agents](/app-help/products/dino-ai/programmable-agents/quick-start.md)
* [**Programmable Agents — YAML Configuration** — full reference for agent config options](/app-help/products/dino-ai/programmable-agents/yaml-configuration.md)
* [**Programmable Agents — Tools Reference** — all available tools including `read_file`, `write_file`, and `post_slack_message`](/app-help/products/dino-ai/programmable-agents/tools-reference.md)
* [**Slack Integration** — connecting Slack to Paradime](/app-help/integrations/slack.md)
* [**Paradime API & Credentials** — where to find your API endpoint, key, and secret](/app-help/developers/generate-api-keys.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.paradime.io/app-help/guides-new/porgrammable-agents/doc-backfiller.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
