PII Anonymization with dbtβ„’ Mesh Setup

Overview

This document demonstrates how to set up a dbt mesh architecture using Paradime where a parent repository contains PII (Personally Identifiable Information) models, and a child dbt project consumes anonymized subsets of these models.

Architecture

Parent Repo (customer-data-platform)
β”œβ”€β”€ PII Models (private)
β”œβ”€β”€ Anonymized Models (public via mesh)
└── Data transformations

Child Repo (analytics-workspace)
β”œβ”€β”€ Consumes anonymized models from parent
β”œβ”€β”€ Creates analytics models
└── Business intelligence layer

Parent Repository Setup

1. Project Structure

# dbt_project.yml (Parent)
name: 'customer_data_platform'
version: '1.0.0'
config-version: 2

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

models:
  customer_data_platform:
    # Private PII models - not exposed
    staging:
      +materialized: table
      +group: private_data

    # Public anonymized models - exposed via mesh
    marts:
      anonymized:
        +materialized: table
        +group: public_analytics
        +access: public

2. Model Groups Configuration

3. Private PII Models

4. Public Anonymized Models (Exposed via Mesh)

Child Repository Setup

1. Paradime Mesh Dependencies Configuration

2. Project Configuration

3. Consuming Parent Models

4. Business Intelligence Models

Paradime Configuration

1. Producer Project Setup

Prerequisites:

  • dbt version 1.7 or greater in both projects

  • At least one successful Bolt schedule run in the producer project

  • Models with access: public configuration

Producer project requirements:

Ensure you have a Bolt schedule running (e.g., daily_production_run) This is required for Paradime to fetch model metadata

2. Consumer Project API Credentials Setup

Step 1: Generate API credentials in the producer project

  • Navigate to the producer project (customer_data_platform)

  • Go to Settings β†’ API Keys

  • Generate API credentials with "Bolt schedules metadata viewer" capability

  • Note down: API Key, API Secret, and API Endpoint

Step 2: Set Workspace-level Environment Variables (for Bolt schedules) In the consumer project workspace settings, add:

Step 3: Set User-level Environment Variables (for Code IDE) Each developer in the consumer project must set the same environment variables in their Code IDE settings:

3. Model Referencing in Consumer Project

Always use the two-argument ref function when referencing models from the producer project:

Security Considerations

Access Control

  • PII models are in private_data group with no public access

  • Only anonymized models in public_analytics group are exposed

  • Child projects can only access explicitly exposed models

Testing Strategy

1. Parent Project Tests

2. Child Project Tests

Best Practices

  1. Regular Security Audits: Review anonymized models quarterly

  2. Change Management: Use PR reviews for any changes to public models

  3. Documentation: Keep anonymization logic well-documented

  4. Testing: Implement comprehensive tests for PII detection

  5. Monitoring: Set up alerts for mesh model failures

  6. Version Control: Tag releases when exposing new models

Troubleshooting

Common Issues

  1. Model not found in child: Check access configuration and group assignment

  2. PII exposure: Review anonymization logic and add tests

  3. Stale data: Monitor upstream model runs in parent project

  4. Permission errors: Verify Paradime project dependency configuration

Debug Commands

Common Issues and Solutions:

  1. "Model not found" errors

    • Verify dbt_loom.config.yml configuration

    • Check that environment variables are set correctly

    • Ensure the Bolt schedule has run successfully in producer project

    • Confirm model has access: public in producer project

  2. API authentication errors

    • Verify API credentials are correctly set at both workspace and user levels

    • Check API key permissions include "Bolt schedules metadata viewer"

    • Ensure API endpoint URL is correct

  3. Stale metadata

    • Producer project must have successful Bolt schedule runs

    • Paradime fetches metadata from the specified schedule name

    • If producer models change, wait for next Bolt schedule run

  4. Model access denied

    • Check model access configuration in producer project

    • Only public models are available through mesh

    • Verify model is in correct group with appropriate access level

This setup ensures that sensitive PII remains secure in the parent repository while providing rich, anonymized datasets for analytics in the child projects through Paradime's dbt mesh capabilities.

Last updated

Was this helpful?