PII Anonymization with dbt™ Mesh Setup
Overview
This document demonstrates how to set up a dbt mesh architecture using Paradime where a parent repository contains PII (Personally Identifiable Information) models, and a child dbt project consumes anonymized subsets of these models.
Architecture
Parent Repo (customer-data-platform)
├── PII Models (private)
├── Anonymized Models (public via mesh)
└── Data transformations
Child Repo (analytics-workspace)
├── Consumes anonymized models from parent
├── Creates analytics models
└── Business intelligence layer
Parent Repository Setup
1. Project Structure
# dbt_project.yml (Parent)
name: 'customer_data_platform'
version: '1.0.0'
config-version: 2
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
models:
customer_data_platform:
# Private PII models - not exposed
staging:
+materialized: table
+group: private_data
# Public anonymized models - exposed via mesh
marts:
anonymized:
+materialized: table
+group: public_analytics
+access: public
2. Model Groups Configuration
# models/_groups.yml (Parent)
version: 2
groups:
- name: private_data
description: "Internal PII and sensitive customer data - not exposed via mesh"
owner:
name: "Data Platform Team"
email: "[email protected]"
- name: public_analytics
description: "Anonymized and aggregated data safe for analytics consumption"
owner:
name: "Analytics Team"
email: "[email protected]"
3. Private PII Models
-- models/staging/stg_customers.sql (Private - PII)
{{ config(
materialized='table',
group='private_data'
) }}
select
customer_id,
first_name, -- PII
last_name, -- PII
email, -- PII
phone, -- PII
date_of_birth, -- PII
created_at,
updated_at
from {{ source('raw_data', 'customers') }}
-- models/staging/stg_orders.sql (Private - contains PII references)
{{ config(
materialized='table',
group='private_data'
) }}
select
order_id,
customer_id, -- Links to PII
order_date,
total_amount,
shipping_address, -- PII
billing_address, -- PII
created_at
from {{ source('raw_data', 'orders') }}
4. Public Anonymized Models (Exposed via Mesh)
-- models/marts/anonymized/customers_anonymized.sql
{{ config(
materialized='table',
group='public_analytics',
access='public'
) }}
select
{{ dbt_utils.generate_surrogate_key(['customer_id', 'created_at']) }} as customer_key,
-- Anonymize age instead of DOB
case
when date_diff('year', date_of_birth, current_date()) < 18 then 'Under 18'
when date_diff('year', date_of_birth, current_date()) between 18 and 25 then '18-25'
when date_diff('year', date_of_birth, current_date()) between 26 and 35 then '26-35'
when date_diff('year', date_of_birth, current_date()) between 36 and 50 then '36-50'
else 'Over 50'
end as age_group,
-- Geographic aggregation
regexp_extract(shipping_address, r'([A-Z]{2})\\\\s+\\\\d{5}') as state_code,
-- Temporal data (safe to expose)
date_trunc('month', created_at) as signup_month,
date_trunc('quarter', created_at) as signup_quarter,
-- Customer lifecycle info
case
when date_diff('day', created_at, current_date()) <= 30 then 'New'
when date_diff('day', created_at, current_date()) <= 365 then 'Active'
else 'Established'
end as customer_segment
from {{ ref('stg_customers') }}
-- models/marts/anonymized/order_metrics.sql
{{ config(
materialized='table',
group='public_analytics',
access='public'
) }}
select
{{ dbt_utils.generate_surrogate_key(['customer_id', 'order_date']) }} as customer_day_key,
date_trunc('day', order_date) as order_date,
date_trunc('month', order_date) as order_month,
date_trunc('quarter', order_date) as order_quarter,
-- Aggregated metrics (no PII)
count(*) as order_count,
sum(total_amount) as total_revenue,
avg(total_amount) as avg_order_value,
min(total_amount) as min_order_value,
max(total_amount) as max_order_value,
-- Geographic aggregation
regexp_extract(shipping_address, r'([A-Z]{2})\\\\s+\\\\d{5}') as state_code
from {{ ref('stg_orders') }}
group by 1, 2, 3, 4, 8
Child Repository Setup
1. Paradime Mesh Dependencies Configuration
# dbt_loom.config.yml (Child)
manifests:
- name: customer_data_platform # The name of your "producer" project
type: paradime
config:
schedule_name: daily_production_run # The name of the Bolt schedule in your "producer" project
# Environment variables for API credentials from the "producer" project
api_key: ${CUSTOMER_PLATFORM_API_KEY}
api_secret: ${CUSTOMER_PLATFORM_API_SECRET}
api_endpoint: ${CUSTOMER_PLATFORM_API_ENDPOINT}
2. Project Configuration
# dbt_project.yml (Child)
name: 'analytics_workspace'
version: '1.0.0'
config-version: 2
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
models:
analytics_workspace:
marts:
+materialized: table
reports:
+materialized: view
3. Consuming Parent Models
-- models/marts/customer_analytics.sql (Child)
{{ config(materialized='table') }}
with customer_base as (
-- Reference producer project models using two-argument ref
select * from {{ ref('customer_data_platform', 'customers_anonymized') }}
),
order_summary as (
select
customer_day_key,
order_month,
order_quarter,
state_code,
sum(total_revenue) as monthly_revenue,
sum(order_count) as monthly_orders,
avg(avg_order_value) as avg_monthly_aov
from {{ ref('customer_data_platform', 'order_metrics') }}
group by 1, 2, 3, 4
)
select
c.customer_key,
c.age_group,
c.customer_segment,
c.state_code,
c.signup_quarter,
-- Aggregated metrics from parent
coalesce(sum(o.monthly_revenue), 0) as lifetime_value,
coalesce(sum(o.monthly_orders), 0) as total_orders,
coalesce(avg(o.avg_monthly_aov), 0) as average_order_value,
-- Time-based metrics
date_diff('month', c.signup_month, current_date()) as customer_tenure_months
from customer_base c
left join order_summary o
on c.customer_key = o.customer_day_key
group by 1, 2, 3, 4, 5, 9
4. Business Intelligence Models
-- models/reports/customer_segmentation_report.sql (Child)
{{ config(materialized='view') }}
select
age_group,
customer_segment,
state_code,
count(*) as customer_count,
sum(lifetime_value) as segment_revenue,
avg(lifetime_value) as avg_lifetime_value,
avg(total_orders) as avg_orders_per_customer,
avg(average_order_value) as avg_order_value,
avg(customer_tenure_months) as avg_tenure_months,
-- Performance metrics
sum(lifetime_value) / count(*) as revenue_per_customer,
sum(total_orders) / count(*) as orders_per_customer
from {{ ref('customer_analytics') }}
group by 1, 2, 3
order by segment_revenue desc
Paradime Configuration
1. Producer Project Setup
Prerequisites:
dbt version 1.7 or greater in both projects
At least one successful Bolt schedule run in the producer project
Models with
access: public
configuration
Producer project requirements:
Ensure you have a Bolt schedule running (e.g., daily_production_run) This is required for Paradime to fetch model metadata
2. Consumer Project API Credentials Setup
Step 1: Generate API credentials in the producer project
Navigate to the producer project (
customer_data_platform
)Go to Settings → API Keys
Generate API credentials with "Bolt schedules metadata viewer" capability
Note down: API Key, API Secret, and API Endpoint
Step 2: Set Workspace-level Environment Variables (for Bolt schedules) In the consumer project workspace settings, add:
CUSTOMER_PLATFORM_API_KEY=<your_producer_api_key>
CUSTOMER_PLATFORM_API_SECRET=<your_producer_api_secret>
CUSTOMER_PLATFORM_API_ENDPOINT=<your_producer_api_endpoint>
Step 3: Set User-level Environment Variables (for Code IDE) Each developer in the consumer project must set the same environment variables in their Code IDE settings:
CUSTOMER_PLATFORM_API_KEY=<your_producer_api_key>
CUSTOMER_PLATFORM_API_SECRET=<your_producer_api_secret>
CUSTOMER_PLATFORM_API_ENDPOINT=<your_producer_api_endpoint>
3. Model Referencing in Consumer Project
Always use the two-argument ref
function when referencing models from the producer project:
-- ✅ Correct way to reference producer models
select * from {{ ref('customer_data_platform', 'customers_anonymized') }}
-- ❌ This won't work for cross-project references
select * from {{ ref('customers_anonymized') }}
Security Considerations
Access Control
PII models are in
private_data
group with no public accessOnly anonymized models in
public_analytics
group are exposedChild projects can only access explicitly exposed models
Testing Strategy
1. Parent Project Tests
-- tests/assert_no_pii_in_public_models.sql
-- Ensure no PII leaks into public models
select *
from {{ ref('customers_anonymized') }}
where customer_key in (
select customer_id::string
from {{ ref('stg_customers') }}
)
2. Child Project Tests
-- tests/validate_mesh_data_quality.sql
-- Ensure mesh data meets quality standards
select *
from {{ ref('customer_analytics') }}
where lifetime_value < 0
or total_orders < 0
or customer_tenure_months < 0
Best Practices
Regular Security Audits: Review anonymized models quarterly
Change Management: Use PR reviews for any changes to public models
Documentation: Keep anonymization logic well-documented
Testing: Implement comprehensive tests for PII detection
Monitoring: Set up alerts for mesh model failures
Version Control: Tag releases when exposing new models
Troubleshooting
Common Issues
Model not found in child: Check access configuration and group assignment
PII exposure: Review anonymization logic and add tests
Stale data: Monitor upstream model runs in parent project
Permission errors: Verify Paradime project dependency configuration
Debug Commands
# Check available models from producer project
dbt list --resource-type model --output name --select customer_data_platform.*
# Validate model access and compilation
dbt compile --select +customer_data_platform.customers_anonymized
# Test mesh connectivity with specific producer models
dbt run --select customer_data_platform.customers_anonymized+
Common Issues and Solutions:
"Model not found" errors
Verify
dbt_loom.config.yml
configurationCheck that environment variables are set correctly
Ensure the Bolt schedule has run successfully in producer project
Confirm model has
access: public
in producer project
API authentication errors
Verify API credentials are correctly set at both workspace and user levels
Check API key permissions include "Bolt schedules metadata viewer"
Ensure API endpoint URL is correct
Stale metadata
Producer project must have successful Bolt schedule runs
Paradime fetches metadata from the specified schedule name
If producer models change, wait for next Bolt schedule run
Model access denied
Check model
access
configuration in producer projectOnly
public
models are available through meshVerify model is in correct group with appropriate access level
This setup ensures that sensitive PII remains secure in the parent repository while providing rich, anonymized datasets for analytics in the child projects through Paradime's dbt mesh capabilities.
Related Docs:
Last updated
Was this helpful?