PII Anonymization with dbt™ Mesh Setup

Overview

This document demonstrates how to set up a dbt mesh architecture using Paradime where a parent repository contains PII (Personally Identifiable Information) models, and a child dbt project consumes anonymized subsets of these models.

Architecture

Parent Repo (customer-data-platform)
├── PII Models (private)
├── Anonymized Models (public via mesh)
└── Data transformations

Child Repo (analytics-workspace)
├── Consumes anonymized models from parent
├── Creates analytics models
└── Business intelligence layer

Parent Repository Setup

1. Project Structure

# dbt_project.yml (Parent)
name: 'customer_data_platform'
version: '1.0.0'
config-version: 2

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

models:
  customer_data_platform:
    # Private PII models - not exposed
    staging:
      +materialized: table
      +group: private_data

    # Public anonymized models - exposed via mesh
    marts:
      anonymized:
        +materialized: table
        +group: public_analytics
        +access: public

2. Model Groups Configuration

# models/_groups.yml (Parent)
version: 2

groups:
  - name: private_data
    description: "Internal PII and sensitive customer data - not exposed via mesh"
    owner:
      name: "Data Platform Team"
      email: "[email protected]"

  - name: public_analytics
    description: "Anonymized and aggregated data safe for analytics consumption"
    owner:
      name: "Analytics Team"
      email: "[email protected]"

3. Private PII Models

-- models/staging/stg_customers.sql (Private - PII)
{{ config(
    materialized='table',
    group='private_data'
) }}

select
    customer_id,
    first_name,           -- PII
    last_name,            -- PII
    email,                -- PII
    phone,                -- PII
    date_of_birth,        -- PII
    created_at,
    updated_at
from {{ source('raw_data', 'customers') }}
-- models/staging/stg_orders.sql (Private - contains PII references)
{{ config(
    materialized='table',
    group='private_data'
) }}

select
    order_id,
    customer_id,          -- Links to PII
    order_date,
    total_amount,
    shipping_address,     -- PII
    billing_address,      -- PII
    created_at
from {{ source('raw_data', 'orders') }}

4. Public Anonymized Models (Exposed via Mesh)

-- models/marts/anonymized/customers_anonymized.sql
{{ config(
    materialized='table',
    group='public_analytics',
    access='public'
) }}

select
    {{ dbt_utils.generate_surrogate_key(['customer_id', 'created_at']) }} as customer_key,
    -- Anonymize age instead of DOB
    case
        when date_diff('year', date_of_birth, current_date()) < 18 then 'Under 18'
        when date_diff('year', date_of_birth, current_date()) between 18 and 25 then '18-25'
        when date_diff('year', date_of_birth, current_date()) between 26 and 35 then '26-35'
        when date_diff('year', date_of_birth, current_date()) between 36 and 50 then '36-50'
        else 'Over 50'
    end as age_group,

    -- Geographic aggregation
    regexp_extract(shipping_address, r'([A-Z]{2})\\\\s+\\\\d{5}') as state_code,

    -- Temporal data (safe to expose)
    date_trunc('month', created_at) as signup_month,
    date_trunc('quarter', created_at) as signup_quarter,

    -- Customer lifecycle info
    case
        when date_diff('day', created_at, current_date()) <= 30 then 'New'
        when date_diff('day', created_at, current_date()) <= 365 then 'Active'
        else 'Established'
    end as customer_segment

from {{ ref('stg_customers') }}
-- models/marts/anonymized/order_metrics.sql
{{ config(
    materialized='table',
    group='public_analytics',
    access='public'
) }}

select
    {{ dbt_utils.generate_surrogate_key(['customer_id', 'order_date']) }} as customer_day_key,
    date_trunc('day', order_date) as order_date,
    date_trunc('month', order_date) as order_month,
    date_trunc('quarter', order_date) as order_quarter,

    -- Aggregated metrics (no PII)
    count(*) as order_count,
    sum(total_amount) as total_revenue,
    avg(total_amount) as avg_order_value,
    min(total_amount) as min_order_value,
    max(total_amount) as max_order_value,

    -- Geographic aggregation
    regexp_extract(shipping_address, r'([A-Z]{2})\\\\s+\\\\d{5}') as state_code

from {{ ref('stg_orders') }}
group by 1, 2, 3, 4, 8

Child Repository Setup

1. Paradime Mesh Dependencies Configuration

# dbt_loom.config.yml (Child)
manifests:
  - name: customer_data_platform  # The name of your "producer" project
    type: paradime
    config:
      schedule_name: daily_production_run  # The name of the Bolt schedule in your "producer" project
      # Environment variables for API credentials from the "producer" project
      api_key: ${CUSTOMER_PLATFORM_API_KEY}
      api_secret: ${CUSTOMER_PLATFORM_API_SECRET}
      api_endpoint: ${CUSTOMER_PLATFORM_API_ENDPOINT}

2. Project Configuration

# dbt_project.yml (Child)
name: 'analytics_workspace'
version: '1.0.0'
config-version: 2

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]

models:
  analytics_workspace:
    marts:
      +materialized: table

    reports:
      +materialized: view

3. Consuming Parent Models

-- models/marts/customer_analytics.sql (Child)
{{ config(materialized='table') }}

with customer_base as (
    -- Reference producer project models using two-argument ref
    select * from {{ ref('customer_data_platform', 'customers_anonymized') }}
),

order_summary as (
    select
        customer_day_key,
        order_month,
        order_quarter,
        state_code,
        sum(total_revenue) as monthly_revenue,
        sum(order_count) as monthly_orders,
        avg(avg_order_value) as avg_monthly_aov
    from {{ ref('customer_data_platform', 'order_metrics') }}
    group by 1, 2, 3, 4
)

select
    c.customer_key,
    c.age_group,
    c.customer_segment,
    c.state_code,
    c.signup_quarter,

    -- Aggregated metrics from parent
    coalesce(sum(o.monthly_revenue), 0) as lifetime_value,
    coalesce(sum(o.monthly_orders), 0) as total_orders,
    coalesce(avg(o.avg_monthly_aov), 0) as average_order_value,

    -- Time-based metrics
    date_diff('month', c.signup_month, current_date()) as customer_tenure_months

from customer_base c
left join order_summary o
    on c.customer_key = o.customer_day_key
group by 1, 2, 3, 4, 5, 9

4. Business Intelligence Models

-- models/reports/customer_segmentation_report.sql (Child)
{{ config(materialized='view') }}

select
    age_group,
    customer_segment,
    state_code,

    count(*) as customer_count,
    sum(lifetime_value) as segment_revenue,
    avg(lifetime_value) as avg_lifetime_value,
    avg(total_orders) as avg_orders_per_customer,
    avg(average_order_value) as avg_order_value,
    avg(customer_tenure_months) as avg_tenure_months,

    -- Performance metrics
    sum(lifetime_value) / count(*) as revenue_per_customer,
    sum(total_orders) / count(*) as orders_per_customer

from {{ ref('customer_analytics') }}
group by 1, 2, 3
order by segment_revenue desc

Paradime Configuration

1. Producer Project Setup

Prerequisites:

  • dbt version 1.7 or greater in both projects

  • At least one successful Bolt schedule run in the producer project

  • Models with access: public configuration

Producer project requirements:

Ensure you have a Bolt schedule running (e.g., daily_production_run) This is required for Paradime to fetch model metadata

2. Consumer Project API Credentials Setup

Step 1: Generate API credentials in the producer project

  • Navigate to the producer project (customer_data_platform)

  • Go to Settings → API Keys

  • Generate API credentials with "Bolt schedules metadata viewer" capability

  • Note down: API Key, API Secret, and API Endpoint

Step 2: Set Workspace-level Environment Variables (for Bolt schedules) In the consumer project workspace settings, add:

CUSTOMER_PLATFORM_API_KEY=<your_producer_api_key>
CUSTOMER_PLATFORM_API_SECRET=<your_producer_api_secret>
CUSTOMER_PLATFORM_API_ENDPOINT=<your_producer_api_endpoint>

Step 3: Set User-level Environment Variables (for Code IDE) Each developer in the consumer project must set the same environment variables in their Code IDE settings:

CUSTOMER_PLATFORM_API_KEY=<your_producer_api_key>
CUSTOMER_PLATFORM_API_SECRET=<your_producer_api_secret>
CUSTOMER_PLATFORM_API_ENDPOINT=<your_producer_api_endpoint>

3. Model Referencing in Consumer Project

Always use the two-argument ref function when referencing models from the producer project:

-- ✅ Correct way to reference producer models
select * from {{ ref('customer_data_platform', 'customers_anonymized') }}

-- ❌ This won't work for cross-project references
select * from {{ ref('customers_anonymized') }}

Security Considerations

Access Control

  • PII models are in private_data group with no public access

  • Only anonymized models in public_analytics group are exposed

  • Child projects can only access explicitly exposed models

Testing Strategy

1. Parent Project Tests

-- tests/assert_no_pii_in_public_models.sql
-- Ensure no PII leaks into public models

select *
from {{ ref('customers_anonymized') }}
where customer_key in (
    select customer_id::string
    from {{ ref('stg_customers') }}
)

2. Child Project Tests

-- tests/validate_mesh_data_quality.sql
-- Ensure mesh data meets quality standards

select *
from {{ ref('customer_analytics') }}
where lifetime_value < 0
   or total_orders < 0
   or customer_tenure_months < 0

Best Practices

  1. Regular Security Audits: Review anonymized models quarterly

  2. Change Management: Use PR reviews for any changes to public models

  3. Documentation: Keep anonymization logic well-documented

  4. Testing: Implement comprehensive tests for PII detection

  5. Monitoring: Set up alerts for mesh model failures

  6. Version Control: Tag releases when exposing new models

Troubleshooting

Common Issues

  1. Model not found in child: Check access configuration and group assignment

  2. PII exposure: Review anonymization logic and add tests

  3. Stale data: Monitor upstream model runs in parent project

  4. Permission errors: Verify Paradime project dependency configuration

Debug Commands

# Check available models from producer project
dbt list --resource-type model --output name --select customer_data_platform.*

# Validate model access and compilation
dbt compile --select +customer_data_platform.customers_anonymized

# Test mesh connectivity with specific producer models
dbt run --select customer_data_platform.customers_anonymized+

Common Issues and Solutions:

  1. "Model not found" errors

    • Verify dbt_loom.config.yml configuration

    • Check that environment variables are set correctly

    • Ensure the Bolt schedule has run successfully in producer project

    • Confirm model has access: public in producer project

  2. API authentication errors

    • Verify API credentials are correctly set at both workspace and user levels

    • Check API key permissions include "Bolt schedules metadata viewer"

    • Ensure API endpoint URL is correct

  3. Stale metadata

    • Producer project must have successful Bolt schedule runs

    • Paradime fetches metadata from the specified schedule name

    • If producer models change, wait for next Bolt schedule run

  4. Model access denied

    • Check model access configuration in producer project

    • Only public models are available through mesh

    • Verify model is in correct group with appropriate access level

This setup ensures that sensitive PII remains secure in the parent repository while providing rich, anonymized datasets for analytics in the child projects through Paradime's dbt mesh capabilities.

Last updated

Was this helpful?