# Column-Level Lineage Diff

#### **Overview**

The Column-Level Lineage Diff Analysis feature in Paradime enables users to understand the blast radius of their changes directly within pull requests (PRs). By leveraging field-level lineage, this CI check identifies changes to columns in your dbt™ models and creates a report for all impacted downstream objects. This includes renaming or removing columns and changes to the underlying logic of columns in your dbt™ models.

When a PR is opened it GitHub, an automated comment is generated listing all downstream nodes. This allows users to understand the changes introduced at a column level and assess the potential impact on downstream dbt™ models. BI dashboards, and other downstream elements.

### **Key Features**

* **Field-Level Lineage**: Identify changes to columns in your dbt™ models and generate a detailed report of all impacted downstream objects.
* **Automated Comments**: Receive automated comments in your PRs listing all downstream dbt™ models and BI nodes affected by the changes.
* **Impact Assessment**: Understand what nodes and other elements might be impacted by the changes introduced in the PR.

### Use cases

* Assess all downstream nodes nodes impacted by changes both within a dbt project and in downstream application (example: BI)
* For Data Mesh architectures, see how your current project's changes impact other project changes.

### Tutorial

{% @arcade/embed url="<https://app.arcade.software/share/QzlG3xuzUh52aZQZDTQy>" flowId="QzlG3xuzUh52aZQZDTQy" %}

### Prerequisites

To use theColumn-Level Lineage Diff Analysis features, ensure the following prerequisites are met:

1. **Git Integration**: Install the Paradime GitHub app and authorize access to the dbt™ repository used in Paradime or use alternative methods based on your Git Provider. See setup instructions.
2. **Production Connection**: Add a production connection with access to your sources and models generated when running production jobs. This allows Paradime to run information schema queries and build field-level lineage. See [connection guide](https://docs.paradime.io/app-help/documentation/settings/connections/scheduler-environment) for instructions based on your data warehouse provider.
3. **Have at least one Bolt TurboCI configured**. This is required to generate field-level lineage for your dbt™ project. See [Bolt Scheduler](https://docs.paradime.io/app-help/documentation/bolt/creating-schedules) for configuration.

{% hint style="info" %}
To get the most value out of Lineage Diff Analysis, connect you BI tools ([Tableau](https://docs.paradime.io/app-help/integrations/tableau), [Thoughtspot](https://docs.paradime.io/app-help/integrations/thoughtspot-beta), [Looker](https://docs.paradime.io/app-help/integrations/looker), etc.) to see all downstream nodes impacted.
{% endhint %}

### Setup Instructions

{% tabs %}
{% tab title="Using GitHub" %}

<figure><img src="https://2337193041-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FHET0AD04uHMgdeLAjptq%2Fuploads%2FDuaBXVdqGBuJk57U3HoB%2Fimage.png?alt=media&#x26;token=dc8009e6-bd01-4edb-b4ae-8decc08ac083" alt=""><figcaption></figcaption></figure>

***

{% hint style="info" %}
**GitHub Integration**: Install the Paradime GitHub app and authorize access to the dbt™ repository used in Paradime. See [installation guide](https://docs.paradime.io/app-help/integrations/github#github-user-level-oauth) for instructions.
{% endhint %}

### Troubleshooting

#### Unable to find public GitHub email address

If a user GitHub is not configured correctly when opening a PR the user will see the below comment in the Pull Request:

<figure><img src="https://2337193041-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FHET0AD04uHMgdeLAjptq%2Fuploads%2FQ9OOunfrfexIvjRHAiNo%2FScreenshot%202024-07-22%20at%2002.08.01.png?alt=media&#x26;token=ffbb9738-5d58-4779-8cee-453ae9a51a38" alt=""><figcaption></figcaption></figure>

To fix this issue, make sure the user opening the Pull Request completed the [https://docs.paradime.io/app-help/integrations/ci-cd/github#github-user-level-oauth](https://docs.paradime.io/app-help/integrations/ci-cd/github#github-user-level-oauth "mention") in Paradime.
{% endtab %}

{% tab title="Using GitLab" %}

<figure><img src="https://2337193041-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FHET0AD04uHMgdeLAjptq%2Fuploads%2Fl3HYjPJu8jk2PXSiOuv9%2Fimage.png?alt=media&#x26;token=1eef3ecf-7218-4125-8205-925f22098123" alt=""><figcaption></figcaption></figure>

***

#### 1. Generate API keys <a href="#generate-api-keys-and-find-you-workspace-token" id="generate-api-keys-and-find-you-workspace-token"></a>

{% hint style="warning" %}
**API keys are generate at a workspace level.**
{% endhint %}

To be able to trigger Bolt using the API, you will first need to generate API keys for your workspace. Got to account settings and generate your API keys, make sure to save in your password manager:

* API key
* API secret
* API Endpoint

You will need this later when setting up the secrets in GitLab.

#### 2. Create a Pipeline <a href="#create-an-azure-pipeline" id="create-an-azure-pipeline"></a>

Now you will need to create a new `.gitlab-ci.yml` file in your dbt™️ repository. Copy the code below.

This pipeline has two stages:

1. **`paradime_turbo_ci`** — Runs the Turbo CI bolt schedule and exports the `BOLT_RUN_ID` for the next stage.
2. **`lineage_diff_report`** — Uses the `BOLT_RUN_ID` to trigger the lineage diff report and posts the result as a comment on the Merge Request.

```yaml
stages:
  - paradime_turbo_ci
  - lineage_diff_report

paradime_turbo_ci:
  stage: paradime_turbo_ci
  image: python:3.11
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
      when: always
  variables:
    PARADIME_SCHEDULE_NAME: "turbo_ci_run"
  before_script:
    - pip install paradime-io==5.1.0
  script:
    - echo "Merge Request Number: $CI_MERGE_REQUEST_IID"
    - echo "Commit SHA: $CI_COMMIT_SHA"
    - |
      python3 -u << 'EOF'
      import os
      import time
      from paradime import Paradime

      paradime = Paradime(
        api_key=os.environ['PARADIME_API_KEY'],
        api_secret=os.environ['PARADIME_API_SECRET'],
        api_endpoint=os.environ['PARADIME_API_ENDPOINT']
      )

      run_id = paradime.bolt.trigger_run(
        schedule_name=os.environ['PARADIME_SCHEDULE_NAME'],
        branch=os.environ['CI_COMMIT_SHA'],
        pr_number=int(os.environ['CI_MERGE_REQUEST_IID']),
      )
      print(f"Bolt Run ID: {run_id}")

      while True:
        run_status = paradime.bolt.get_run_status(run_id)
        print(f"Status: {run_status.state}")
        if run_status.state in ("success", "error", "failed", "canceled"):
          break
        time.sleep(30)

      with open("bolt.env", "w") as f:
        f.write(f"BOLT_RUN_ID={run_id}\n")

      if run_status.state != "success":
        raise SystemExit(f"Bolt run {run_id} failed with state: {run_status.state}")
      EOF
  artifacts:
    reports:
      dotenv: bolt.env
  timeout: 60 minutes

lineage_diff_report:
  stage: lineage_diff_report
  image: python:3.11
  needs:
    - job: paradime_turbo_ci
      artifacts: true
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
      when: always
  variables:
    GIT_DEPTH: 0
  before_script:
    - pip install paradime-io==5.1.0
  script:
    - echo "Bolt Run ID: $BOLT_RUN_ID"
    - echo "Pull Request Number: $CI_MERGE_REQUEST_IID"
    - echo "Pull Request Author Email: $GITLAB_USER_EMAIL"
    - export CHANGED_FILES=$(git diff --name-only $CI_MERGE_REQUEST_DIFF_BASE_SHA $CI_COMMIT_SHA)
    - echo "Files changed in this MR:"; echo "$CHANGED_FILES"
    - export REPORT_FILE="report_$(openssl rand -hex 12).json"
    - python3 -u <<EOF
      import os
      import json
      from pathlib import Path
      from paradime import Paradime
      paradime = Paradime(
        api_key=os.environ['PARADIME_API_KEY'],
        api_secret=os.environ['PARADIME_API_SECRET'],
        api_endpoint=os.environ['PARADIME_API_ENDPOINT']
      )
      report = paradime.lineage_diff.trigger_report_and_wait(
        bolt_run_id=int(os.environ['BOLT_RUN_ID']),
        pull_request_number=int(os.environ['CI_MERGE_REQUEST_IID']),
        user_email=os.environ['GITLAB_USER_EMAIL'],
        changed_file_paths=os.environ['CHANGED_FILES'].split('\n'),
      )
      print("Lineage Diff Report Message:", report.message)
      if report.result_markdown:
        data = json.dumps({'body': report.result_markdown})
        Path(os.environ['REPORT_FILE']).write_text(data)
      EOF
    - |
      if [ -f "$REPORT_FILE" ]; then
        COMMENT_BODY=$(cat "$REPORT_FILE")
        curl --request POST --header "PRIVATE-TOKEN: ${GITLAB_TOKEN}" \
          --header "Content-Type: application/json" \
          --data "$COMMENT_BODY" \
          "https://gitlab.com/api/v4/projects/${CI_PROJECT_ID}/merge_requests/${CI_MERGE_REQUEST_IID}/notes"
      else
        echo "Report file not found, skipping comment."
      fi
  timeout: 60 minutes
```

#### 3. Create a GitLab Access Token

You will need to create GitLab Access Token with API to enable Comments to be generated when opening a Merge Request.

**Create a group access token (Only available on GitLab Premium Tier or Higher)**

Navigate to your GitLab project

1. Go to **Settings > Access Tokens** in the left sidebar
2. Enter a name for your token (e.g., "API Comment Access")
3. Select the appropriate scopes:
   * `api` (for general API access)
   * Specifically for comments, you'll need at least `api` scope, which includes the ability to write comments

**Create a GitLab personal access token**

{% hint style="warning" %}
If using a GitLab personal access token, we suggest creating a new user, called "Paradime" and create the access token attached to this user.
{% endhint %}

Log in to your GitLab account

1. Go to your user profile > Preferences > Access Tokens (or navigate to `https://gitlab.com/-/profile/personal_access_tokens`)
2. Enter a name for your token (e.g., "API Comment Access")
3. Select the appropriate scopes:
   * `api` (for general API access)
   * Specifically for comments, you'll need at least `api` scope, which includes the ability to write comments

#### 4. Add the API key and Credential in the GitLab variables <a href="#add-the-api-key-and-credential-in-the-gitlab-variables" id="add-the-api-key-and-credential-in-the-gitlab-variables"></a>

Finally you need to add the API key and Credentials generated in the previous step in GitLab CI/CD settings (not in the `.gitlab-ci.yml` file directly). Set the corresponding values for the following variable names:

* `PARADIME_API_KEY`
* `PARADIME_API_SECRET`
* `PARADIME_API_ENDPOINT`
* `GITLAB_TOKEN`

{% @arcade/embed url="<https://app.arcade.software/share/UcTn34OhEg5OwXDVOKVu>" flowId="UcTn34OhEg5OwXDVOKVu" %}
{% endtab %}

{% tab title="Using Azure Pipelines" %}

<figure><img src="https://2337193041-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FHET0AD04uHMgdeLAjptq%2Fuploads%2FARJwpjOJrlQ6rMRuO693%2Fimage.png?alt=media&#x26;token=ce564f1e-12a9-48bb-9e2e-cfdbd5cb35e3" alt=""><figcaption></figcaption></figure>

***

#### 1. Generate API keys <a href="#generate-api-keys-and-find-you-workspace-token" id="generate-api-keys-and-find-you-workspace-token"></a>

{% hint style="warning" %}
**API keys are generate at a workspace level.**
{% endhint %}

To be able to trigger Bolt using the API, you will first need to generate API keys for your workspace. Got to account settings and generate your API keys, make sure to save in your password manager:

* API key
* API secret
* API Endpoint

You will need this later when setting up the secrets in Azure pipelines.

#### 2. Create an Azure Pipeline <a href="#create-an-azure-pipeline" id="create-an-azure-pipeline"></a>

Now you will need to create a new `azure-pipeline.yml` file in your dbt™️ repository. Copy the code below.

This pipeline has two jobs:

1. **`paradime_turbo_ci`** — Runs the Turbo CI bolt schedule and passes the `BOLT_RUN_ID` to the next job.
2. **`lineage_diff_report`** — Uses the `BOLT_RUN_ID` to trigger the lineage diff report and posts the result as a comment on the Pull Request.

```yaml
trigger:
  - none

pr:
  - "*"

jobs:
  - job: paradime_turbo_ci
    displayName: "Run Paradime Turbo CI"
    pool:
      vmImage: ubuntu-latest
    variables:
      PARADIME_SCHEDULE_NAME: "turbo_ci_run"
    steps:
      - task: UsePythonVersion@0
        inputs:
          versionSpec: "3.11"
          addToPath: true
        displayName: Use Python 3.11

      - checkout: self
        fetchDepth: 0
        displayName: Clone repository

      - script: |
          pip install paradime-io==5.1.0

          echo "Merge Request Number: $(System.PullRequest.PullRequestId)"
          echo "Commit SHA: $(System.PullRequest.SourceCommitId)"

          python3 -u <<EOF
          import os
          from paradime import Paradime

          paradime = Paradime(
            api_key=os.environ['PARADIME_API_KEY'],
            api_secret=os.environ['PARADIME_API_SECRET'],
            api_endpoint=os.environ['PARADIME_API_ENDPOINT']
          )

          run_id = paradime.bolt.trigger_run(
            schedule_name=os.environ['PARADIME_SCHEDULE_NAME'],
            branch=os.environ['SOURCE_COMMIT_SHA'],
            pr_number=int(os.environ['PR_NUMBER']),
          )

          print(f"Bolt Run ID: {run_id}")

          with open("bolt_run_id.txt", "w") as f:
            f.write(str(run_id))
          EOF

          BOLT_RUN_ID=$(cat bolt_run_id.txt)
          echo "##vso[task.setvariable variable=BOLT_RUN_ID;isOutput=true]$BOLT_RUN_ID"
        name: runTurboCI
        displayName: "Run Turbo CI and export Bolt Run ID"
        env:
          PARADIME_API_KEY: $(PARADIME_API_KEY)
          PARADIME_API_SECRET: $(PARADIME_API_SECRET)
          PARADIME_API_ENDPOINT: $(PARADIME_API_ENDPOINT)
          SOURCE_COMMIT_SHA: $(System.PullRequest.SourceCommitId)
          PR_NUMBER: $(System.PullRequest.PullRequestId)

  - job: lineage_diff_report
    displayName: "Generate Lineage Diff Report"
    dependsOn: paradime_turbo_ci
    pool:
      vmImage: ubuntu-latest
    variables:
      BOLT_RUN_ID: $[ dependencies.paradime_turbo_ci.outputs['runTurboCI.BOLT_RUN_ID'] ]
    steps:
      - task: UsePythonVersion@0
        inputs:
          versionSpec: "3.11"
          addToPath: true
        displayName: Use Python 3.11

      - checkout: self
        fetchDepth: 0
        displayName: Clone repository

      - script: |
          pip install paradime-io==5.1.0

          echo "Bolt Run ID: $BOLT_RUN_ID"
          echo "Pull Request Number: $(System.PullRequest.PullRequestId)"
          echo "Pull Request Author Email: $(Build.RequestedForEmail)"

          export BASE_COMMIT_SHA=$(git rev-parse origin/$(System.PullRequest.TargetBranchName))
          export HEAD_COMMIT_SHA=$(System.PullRequest.SourceCommitId)
          export CHANGED_FILES=$(git diff --name-only $BASE_COMMIT_SHA $HEAD_COMMIT_SHA)
          echo "Files changed in this PR: $CHANGED_FILES"

          export REPORT_FILE="report_$(openssl rand -hex 12).txt"

          python3 -u <<EOF
          import os
          import json
          from pathlib import Path
          from paradime import Paradime

          paradime = Paradime(
            api_key=os.environ['PARADIME_API_KEY'],
            api_secret=os.environ['PARADIME_API_SECRET'],
            api_endpoint=os.environ['PARADIME_API_ENDPOINT']
          )

          report = paradime.lineage_diff.trigger_report_and_wait(
            bolt_run_id=int(os.environ['BOLT_RUN_ID']),
            pull_request_number=int(os.environ['PR_NUMBER']),
            user_email=os.environ['USER_EMAIL'],
            changed_file_paths=os.environ['CHANGED_FILES'].split('\\n'),
          )

          print("Lineage Diff Report Message:", report.message)

          if report.result_markdown:
            Path(os.environ['REPORT_FILE']).write_text(report.result_markdown)
          EOF

          if [ -f "$REPORT_FILE" ]; then
            COMMENT=$(cat "$REPORT_FILE")
            ADO_API="$(System.CollectionUri)$(System.TeamProject)/_apis/git/repositories/$(Build.Repository.Name)/pullRequests/$(System.PullRequest.PullRequestId)/threads?api-version=7.1-preview.1"
            PR_COMMENT=$(jq -n --arg comment "$COMMENT" '{"comments": [{"parentCommentId": 0, "content": $comment, "commentType": 1}], "status": 1}')

            curl --request POST "$ADO_API" \
              --header "Content-Type: application/json" \
              --header "Accept: application/json" \
              --header "Authorization: Bearer $(System.AccessToken)" \
              --data "$PR_COMMENT"
          else
            echo "Report file not found, skipping comment."
          fi
        displayName: "Generate report and comment on PR"
        env:
          PARADIME_API_KEY: $(PARADIME_API_KEY)
          PARADIME_API_SECRET: $(PARADIME_API_SECRET)
          PARADIME_API_ENDPOINT: $(PARADIME_API_ENDPOINT)
          BOLT_RUN_ID: $(BOLT_RUN_ID)
          PR_NUMBER: $(System.PullRequest.PullRequestId)
          USER_EMAIL: $(Build.RequestedForEmail)
```

#### 3. Add the API keys and Credential in the Azure Pipeline variables <a href="#add-the-api-keys-and-credential-in-the-azure-pipeline-variables" id="add-the-api-keys-and-credential-in-the-azure-pipeline-variables"></a>

Finally you need to add the API key and credentials generated in the previous step in Azure Pipelines.

Set the corresponding values using your credentials for the variable names:

* `PARADIME_API_KEY`
* `PARADIME_API_SECRET`
* `PARADIME_API_ENDPOINT`

{% @arcade/embed url="<https://app.arcade.software/share/tCJyMGYsHX7weycpd3kq>" flowId="tCJyMGYsHX7weycpd3kq" %}

#### 4. Update Build Service User Permissions

You will need to add the Contribute to pull requests permission to enable Comments to be generated when opening the Pull Request.

Go to Project Settings in Azure DevOps

* Navigate to "Repositories" under Repos
* Select your repository
* Click "Security"
* Find "Project Collection Build Service" or "\[Project Name] Build Service"
* Set "Contribute to pull requests" to **allowed**

<figure><img src="https://2337193041-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FHET0AD04uHMgdeLAjptq%2Fuploads%2FWktZYl8dWiDHinfTQaqi%2Fimage.png?alt=media&#x26;token=585e92cf-202a-4659-9fb2-94ded20c89e2" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Using BitBucket" %}

<figure><img src="https://2337193041-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FHET0AD04uHMgdeLAjptq%2Fuploads%2FK9gM5FQHGJBt9RRCxSjI%2Fimage%20(25).png?alt=media&#x26;token=3616a64f-c9c6-4ea1-874d-bbb242f41f79" alt=""><figcaption></figcaption></figure>

***

#### 1. Generate API keys <a href="#generate-api-keys-and-find-you-workspace-token" id="generate-api-keys-and-find-you-workspace-token"></a>

{% hint style="warning" %}
**API keys are generate at a workspace level.**
{% endhint %}

To be able to trigger Bolt using the API, you will first need to generate API keys for your workspace. Got to account settings and generate your API keys, make sure to save in your password manager:

* API key
* API secret
* API Endpoint

You will need this later when setting up the secrets in BitBucket.

#### 2. Create a BitBucket Pipeline <a href="#create-a-bitbucket-pipeline" id="create-a-bitbucket-pipeline"></a>

Now you will need to create a new `bitbucket-pipelines.yml` file in your dbt™️ repository. Copy the code block below.

This pipeline has two steps:

1. **`Run Paradime Turbo CI`** — Runs the Turbo CI bolt schedule and saves the `BOLT_RUN_ID` as a pipeline artifact.
2. **`Generate Lineage Diff Report`** — Reads the `BOLT_RUN_ID` from the artifact and triggers the lineage diff report.

```yaml
image: python:3.11

definitions:
  steps:
    - step: &paradime-turbo-ci
        name: Run Paradime Turbo CI
        size: 2x
        max-time: 60
        script:
          - apt-get update && apt-get install -y git
          - pip install paradime-io==5.1.0
          - export PARADIME_SCHEDULE_NAME="turbo_ci_run"
          - echo "Merge Request Number: $BITBUCKET_PR_ID"
          - echo "Commit SHA: $BITBUCKET_COMMIT"
          - |
            cat > run_turbo_ci.py << 'EOL'
            import os
            from paradime import Paradime

            paradime = Paradime(
              api_key=os.environ['PARADIME_API_KEY'],
              api_secret=os.environ['PARADIME_API_SECRET'],
              api_endpoint=os.environ['PARADIME_API_ENDPOINT']
            )

            run_id = paradime.bolt.trigger_run(
              schedule_name=os.environ['PARADIME_SCHEDULE_NAME'],
              branch=os.environ['BITBUCKET_COMMIT'],
              pr_number=int(os.environ['BITBUCKET_PR_ID']),
            )

            print(f"Bolt Run ID: {run_id}")

            with open("bolt.env", "w") as f:
              f.write(f"BOLT_RUN_ID={run_id}\n")
            EOL
          - python3 run_turbo_ci.py
        artifacts:
          - bolt.env

    - step: &paradime-lineage-diff-report
        name: Generate Lineage Diff Report
        size: 2x
        max-time: 30
        script:
          - apt-get update && apt-get install -y curl jq git
          - pip install paradime-io==5.1.0
          - source bolt.env
          - echo "Bolt Run ID: $BOLT_RUN_ID"
          - git fetch origin $BITBUCKET_PR_DESTINATION_BRANCH
          - export HEAD_COMMIT_SHA=$(git rev-parse $BITBUCKET_PR_DESTINATION_COMMIT)
          - export BASE_COMMIT_SHA=$BITBUCKET_COMMIT
          - export PULL_REQUEST_NUMBER=$BITBUCKET_PR_ID
          - export USER_EMAIL=$(git log -1 --pretty=format:"%ae" $BITBUCKET_COMMIT)
          - export CHANGED_FILES=$(git diff --name-only $BASE_COMMIT_SHA $HEAD_COMMIT_SHA)
          - echo "Pull Request Number $PULL_REQUEST_NUMBER"
          - echo "Pull Request Author Email $USER_EMAIL"
          - echo "Files changed in this PR $CHANGED_FILES"
          - export REPORT_FILE="report_$(openssl rand -hex 12).json"
          - echo "Report file $REPORT_FILE"
          - |
            cat > generate_report.py << 'EOL'
            import os
            import json
            from pathlib import Path
            from paradime import Paradime

            def format_json_to_markdown(report_json):
                """
                Convert the JSON report into formatted markdown for Bitbucket comments.
                """
                formatted_output = []

                # 1. Title and Warning
                formatted_output.append("# Paradime Column-Level Lineage Analysis\n")
                formatted_output.append("⚠️ The changes included in this pull request may have downstream impacts:\n")

                # 2. Table headers and data
                formatted_output.append("| File changed | dbt nodes | dbt projects | Tableau | Looker | ThoughtSpot |")
                formatted_output.append("| :- | -: | -: | -: | -: | -: |")

                # Process each file's impact summary
                for file_path, file_data in report_json["changed_files"].items():
                    summary = file_data["impact_summary"]
                    row = [
                        file_path,
                        f"**{summary['dbt_nodes_count']}**",
                        f"**{summary['dbt_projects_count']}**",
                        f"**{summary['tableau_nodes_count']}**",
                        f"**{summary['looker_nodes_count']}**",
                        f"**{summary['thoughtspot_nodes_count']}**"
                    ]
                    formatted_output.append(f"| {' | '.join(row)} |")

                # 3. Impact Details Section
                formatted_output.append("\n## 📁 Impact by Changed Files\n")
                formatted_output.append("Expand to view the potential downstream impact created by each changed file:\n")

                # Process each file's detailed impacts
                for file_path, file_data in report_json["changed_files"].items():
                    # File header
                    formatted_output.append(f"\n### {file_path}\n")

                    impacted_nodes = file_data["impacted_nodes"]

                    # DBT impacts
                    if impacted_nodes["dbt"]["nodes"]:
                        formatted_output.append(f"**🔶 dbt nodes impacted**: \n")
                        for node in impacted_nodes["dbt"]["nodes"]:
                            formatted_output.append(f"- {node['model_name']}.*{node['column_name']}*\n")

                    # Looker impacts
                    if any([
                        impacted_nodes["looker"]["views"],
                        impacted_nodes["looker"]["explores"],
                        impacted_nodes["looker"]["dashboards"]
                    ]):
                        formatted_output.append(f"**📊 Looker nodes impacted**\n")

                        if impacted_nodes["looker"]["views"]:
                            formatted_output.append("**Views:**\n")
                            for view in impacted_nodes["looker"]["views"]:
                                node_text = f"{view['model_name']}.{view['view_id']}.**{view['field_id']}**"
                                if 'url' in view and view['url']:
                                    formatted_output.append(f"- [{node_text}]({view['url']})\n")
                                else:
                                    formatted_output.append(f"- {node_text}\n")

                        if impacted_nodes["looker"]["explores"]:
                            formatted_output.append("**Explores:**\n")
                            for explore in impacted_nodes["looker"]["explores"]:
                                node_text = f"{explore['model_name']}.{explore['explore_id']}.**{explore['field_id']}**"
                                if 'url' in explore and explore['url']:
                                    formatted_output.append(f"- [{node_text}]({explore['url']})\n")
                                else:
                                    formatted_output.append(f"- {node_text}\n")

                        if impacted_nodes["looker"]["dashboards"]:
                            formatted_output.append("**Dashboards:**\n")
                            for dashboard in impacted_nodes["looker"]["dashboards"]:
                                if 'url' in dashboard and dashboard['url']:
                                    formatted_output.append(f"- [{dashboard['name']}]({dashboard['url']})\n")
                                else:
                                    formatted_output.append(f"- {dashboard['name']}\n")

                    # Tableau impacts
                    if any([
                        impacted_nodes["tableau"]["dashboards"],
                        impacted_nodes["tableau"]["datasources"],
                        impacted_nodes["tableau"]["worksheets"]
                    ]):
                        formatted_output.append(f"**📊 Tableau nodes impacted**\n")

                        if impacted_nodes["tableau"]["dashboards"]:
                            formatted_output.append("**Dashboards:**\n")
                            for dashboard in impacted_nodes["tableau"]["dashboards"]:
                                if 'url' in dashboard and dashboard['url']:
                                    formatted_output.append(f"- [{dashboard['name']}]({dashboard['url']})\n")
                                else:
                                    formatted_output.append(f"- {dashboard['name']}\n")

                        if impacted_nodes["tableau"]["datasources"]:
                            formatted_output.append("**Datasources:**\n")
                            for datasource in impacted_nodes["tableau"]["datasources"]:
                                if 'url' in datasource and datasource['url']:
                                    formatted_output.append(f"- [{datasource['name']}]({datasource['url']})\n")
                                else:
                                    formatted_output.append(f"- {datasource['name']}\n")

                        if impacted_nodes["tableau"]["worksheets"]:
                            formatted_output.append("**Worksheets:**\n")
                            for worksheet in impacted_nodes["tableau"]["worksheets"]:
                                if 'url' in worksheet and worksheet['url']:
                                    formatted_output.append(f"- [{worksheet['name']}]({worksheet['url']})\n")
                                else:
                                    formatted_output.append(f"- {worksheet['name']}\n")

                    # ThoughtSpot impacts
                    if any([
                        impacted_nodes["thoughtspot"]["answers"],
                        impacted_nodes["thoughtspot"]["liveboards"],
                        impacted_nodes["thoughtspot"]["visualizations"],
                        impacted_nodes["thoughtspot"]["worksheets"]
                    ]):
                        formatted_output.append(f"**📊 ThoughtSpot nodes impacted**\n")

                        if impacted_nodes["thoughtspot"]["answers"]:
                            formatted_output.append("**Answers:**\n")
                            for answer in impacted_nodes["thoughtspot"]["answers"]:
                                if 'url' in answer and answer['url']:
                                    formatted_output.append(f"- [{answer['name']}]({answer['url']})\n")
                                else:
                                    formatted_output.append(f"- {answer['name']}\n")

                        if impacted_nodes["thoughtspot"]["liveboards"]:
                            formatted_output.append("**Liveboards:**\n")
                            for liveboard in impacted_nodes["thoughtspot"]["liveboards"]:
                                if 'url' in liveboard and liveboard['url']:
                                    formatted_output.append(f"- [{liveboard['name']}]({liveboard['url']})\n")
                                else:
                                    formatted_output.append(f"- {liveboard['name']}\n")

                        if impacted_nodes["thoughtspot"]["visualizations"]:
                            formatted_output.append("**Visualizations:**\n")
                            for viz in impacted_nodes["thoughtspot"]["visualizations"]:
                                if 'url' in viz and viz['url']:
                                    formatted_output.append(f"- [{viz['name']}]({viz['url']})\n")
                                else:
                                    formatted_output.append(f"- {viz['name']}\n")

                        if impacted_nodes["thoughtspot"]["worksheets"]:
                            formatted_output.append("**Worksheets:**\n")
                            for worksheet in impacted_nodes["thoughtspot"]["worksheets"]:
                                if 'url' in worksheet and worksheet['url']:
                                    formatted_output.append(f"- [{worksheet['name']}]({worksheet['url']})\n")
                                else:
                                    formatted_output.append(f"- {worksheet['name']}\n")

                return '\n'.join(formatted_output)

            paradime = Paradime(
                api_key=os.environ["PARADIME_API_KEY"],
                api_secret=os.environ["PARADIME_API_SECRET"],
                api_endpoint=os.environ["PARADIME_API_ENDPOINT"]
            )

            report = paradime.lineage_diff.trigger_report_and_wait(
                bolt_run_id=int(os.environ["BOLT_RUN_ID"]),
                pull_request_number=int(os.environ["PULL_REQUEST_NUMBER"]),
                user_email=os.environ["USER_EMAIL"],
                changed_file_paths=os.environ["CHANGED_FILES"].split("\n"),
            )

            print("Lineage Diff Report Message: ", report.message)

            if report.result_json:
                # Convert JSON string to Python dict and format as markdown
                report_data = json.loads(report.result_json)
                formatted_markdown = format_json_to_markdown(report_data)
                Path(os.environ["REPORT_FILE"]).write_text(formatted_markdown)
            EOL
          - python3 generate_report.py
          - |
            if [ -f "$REPORT_FILE" ]; then
              # Create the JSON payload with properly escaped content
              FORMATTED_CONTENT=$(cat "$REPORT_FILE")
              JSON_PAYLOAD=$(jq -n \
                --arg content "$FORMATTED_CONTENT" \
                '{content:{raw:$content}}')

              # Send to Bitbucket API
              curl -X POST \
                "https://api.bitbucket.org/2.0/repositories/$BITBUCKET_REPO_FULL_NAME/pullrequests/$BITBUCKET_PR_ID/comments" \
                -H "Authorization: Bearer $BITBUCKET_ACCESS_TOKEN" \
                -H "Content-Type: application/json" \
                -d "$JSON_PAYLOAD"
            else
              echo "Report file not found, skipping comment."
            fi

pipelines:
  pull-requests:
    '**':
      - step:
          <<: *paradime-turbo-ci
      - step:
          <<: *paradime-lineage-diff-report

```

#### 3. Generate BitBucket Access Token

You will need to create a BitBucket Access Token with p`ull_request:write` to enable Comments to be generated when opening a Pull Request.

Go to [bitbucket.org](https://bitbucket.org/) and sign in:

1. Select your BitBucket Repository
2. Click "Repository Settings" in the left sidebar
3. In Settings, scroll to "Security" section
4. Select "Access tokens"
5. Create New Access Token
6. Click "Create repository access token"
7. Enter the following details:
   * **Name**: `Paradime`
   * **Permissions**: Select `Pull requests: Write`

<figure><img src="https://2337193041-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FHET0AD04uHMgdeLAjptq%2Fuploads%2FQ7o7ENslwA13pgBhwCLZ%2Fimage.png?alt=media&#x26;token=be26a6b6-91d3-4f99-97f2-2f260c3048b7" alt=""><figcaption></figcaption></figure>

#### 4. Add the API keys and Credentials in the BitBucket Pipeline variables <a href="#add-the-api-keys-and-credentials-in-the-bitbucket-pipeline-variables" id="add-the-api-keys-and-credentials-in-the-bitbucket-pipeline-variables"></a>

Finally you need to add the API key and credentials generated in the previous step in BitBucket Pipelines as well as the BitBucket Access Token.

Set the corresponding values using your credentials for the variable names:

* `PARADIME_API_KEY`
* `PARADIME_API_SECRET`
* `PARADIME_API_ENDPOINT`
* `BITBUCKET_ACCESS_TOKEN`

{% @arcade/embed url="<https://app.arcade.software/share/F07biP3yTQk37Ntyc095>" flowId="F07biP3yTQk37Ntyc095" %}
{% endtab %}
{% endtabs %}

### Lineage Diff Feature - Supported Use Cases

The lineage diff feature analyzes changes in dbt models to track structural modifications that affect downstream dependencies.

The lineage diff feature focuses on **structural changes to SELECT statements** that affect the schema and column availability for downstream models. It does not track logic changes, data transformations, or modifications to non-SELECT clauses.

#### Supported Changes

**SQL Structural Changes**

The lineage diff feature detects and tracks the following structural modifications:

* **Column renaming**: When a column is renamed in a SELECT statement
* **Column removal**: When a column is removed from a SELECT statement
* **Column addition**: When a new column is added to a SELECT statement

**Example - Supported Changes**

```sql
-- Before
SELECT
    customer_id,
    customer_name,
    email
FROM customers

-- After (column renamed)
SELECT
    customer_id,
    full_name,  -- renamed from customer_name
    email
FROM customers
```

#### Non-Supported Changes

**Structural Changes to Non-SELECT Statements**

* **WHERE clause modifications**: Changes to filtering conditions
* **JOIN modifications**: Adding, removing, or changing JOIN conditions
* **GROUP BY changes**: Modifications to grouping logic
* **ORDER BY changes**: Changes to sorting logic

**Structural Changes Used in Non-SELECT Contexts**

Even if a change is structural (like renaming a column), the lineage diff feature does not track usage in:

* **WHERE clauses**: Column references in filtering conditions
* **JOIN conditions**: Column references in table joins
* **GROUP BY clauses**: Column references in grouping logic
* **ORDER BY clauses**: Column references in sorting logic

**Data Changes**

* **Column calculation changes**: Modifications to how a column value is computed
* **NULL handling changes**: Changes in NULL value treatment
* **Data type transformations**: Changes that affect data representation but not structure

**Example - Non-Supported Changes**

```sql
-- Before
SELECT
    customer_id,
    customer_name,
    revenue * 1.1 as adjusted_revenue
FROM customers
WHERE status = 'active'
ORDER BY customer_name

-- After (non-supported changes)
SELECT
    customer_id,
    customer_name,
    revenue * 1.2 as adjusted_revenue  -- calculation change (not detected)
FROM customers
WHERE status IN ('active', 'pending')  -- WHERE clause change (not detected)
ORDER BY full_name  -- ORDER BY with renamed column (not detected)
```

**Note**: Even though `customer_name` was structurally renamed to `full_name`, the lineage diff feature only tracks this change in the SELECT statement itself, not its usage in the ORDER BY clause.

### Summary

Column-Level Lineage Diff helps you review downstream impact before merge. It detects structural column changes in `SELECT` statements, then reports affected dbt™ models and connected BI assets in your PR.

It is best for column additions, removals, and renames. It does not track logic changes in `WHERE`, `JOIN`, `GROUP BY`, `ORDER BY`, or column calculations.

### Next steps

1. Connect your Git provider so Paradime can post PR comments.
2. Add a production connection and confirm at least one Bolt schedule runs.
3. Connect BI tools like Looker, Tableau, or ThoughtSpot for wider impact coverage.
4. Open a test PR with a column change and review the generated lineage report.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.paradime.io/app-help/documentation/bolt/ci-cd/lineage-diff.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
