# Working with Sources

Sources in dbt represent the raw data tables in your warehouse that serve as the foundation for your transformations. Rather than referencing raw tables directly, dbt allows you to define sources in a centralized way, improving maintainability and enabling powerful features like freshness checking.

### What Are Sources?

In dbt, **sources** represent raw data tables from external systems, such as an operational database, CRM, or third-party APIs. Instead of referencing raw tables directly in models, dbt allows you to define sources in a **centralized file** (`sources.yml`) for better organization, maintainability, and documentation.

The `sources.yml` file is a **crucial component** in dbt projects, centralizing metadata about raw data tables. This ensures consistency, maintainability, and automatic documentation.

{% hint style="info" %}
**Why Use Sources?**

* **Centralizes raw table definitions** – Avoids hardcoded table names across multiple models.
* **Improves maintainability** – If raw table locations change, you only need to update `sources.yml`.
* **Enables freshness checks** – dbt can monitor source data latency.
* **Enhances documentation** – Automatically generates lineage graphs and model dependencies.&#x20;
  {% endhint %}

***

### Defining Sources in YAML

Sources are defined in `.yml` files under the `sources:` key. Here's a typical example:

```yaml
version: 2

sources:
  - name: jaffle_shop  # Logical name of the source
    database: raw      # The database where the source is stored (optional)
    schema: jaffle_shop  # Schema containing the source tables
    tables:
      - name: orders
        columns:
          - name: id
            tests:
              - unique
              - not_null
          - name: status
            tests:
              - accepted_values:
                  values: ['placed', 'shipped', 'completed', 'returned']
      - name: customers
```

In this example:

* We've defined a source named `jaffle_shop` that points to tables in the `raw.jaffle_shop` schema
* We've defined two tables: `orders` and `customers`
* We've added column-level tests to the `orders` table

***

### Using Sources in Models

Once sources are defined, you can reference them using the `source()` function in your dbt models:

```sql
-- models/staging/stg_orders.sql
SELECT
  order_id,
  customer_id,
  order_date,
  status,
  amount
FROM {{ source('jaffle_shop', 'orders') }}
```

This offers several advantages:

* **Consistency**: Source references are standardized across your project
* **Refactoring**: If a source table moves, you only need to update one place
* **Documentation**: dbt automatically builds lineage from sources to models
* **Testing**: You can apply tests to sources for early validation

***

### Best Practices for Source Organization

#### Group Related Sources

Organize sources by system or domain:

```yaml
sources:
  - name: stripe       # Payment processing
    schema: raw_stripe
    tables:
      - name: charges
      - name: customers

  - name: shopify      # E-commerce platform
    schema: raw_shopify
    tables:
      - name: orders
      - name: products
```

#### Document Your Sources

Add descriptions to help your team understand the data:

```yaml
sources:
  - name: google_analytics
    description: "Web analytics data from our marketing site"
    tables:
      - name: sessions
        description: "User sessions with UTM parameters"
        columns:
          - name: session_id
            description: "Unique identifier for the session"
```

#### Apply Tests to Sources

Find data quality issues early by testing your sources:

```yaml
sources:
  - name: crm
    tables:
      - name: customers
        columns:
          - name: customer_id
            tests:
              - unique
              - not_null
          - name: email
            tests:
              - unique
              - not_null
```

***

### Source Freshness

One of the most powerful features of sources is the ability to check data freshness - ensuring your source data is up-to-date before you build models on top of it.

#### Configuring Freshness Checks

Add a `freshness` block and specify a `loaded_at_field` in your sources definition:

```yaml
sources:
  - name: sales_data
    schema: raw_sales
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    loaded_at_field: updated_at
    tables:
      - name: transactions
```

This configuration:

* Uses the `updated_at` column to determine when data was last loaded
* Warns if data is more than 12 hours old
* Errors if data is more than 24 hours old

#### Running Freshness Checks

Check freshness with:

```bash
dbt source freshness
```

The output will show the status of each source:

```
16:35:31 | Freshness of jaffle_shop.orders: PASS (0 seconds)
16:35:32 | Freshness of jaffle_shop.customers: WARN (13 hours)
```

#### Table-Specific Freshness

You can override source-level freshness settings for specific tables:

```yaml
sources:
  - name: inventory
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    loaded_at_field: last_updated
    tables:
      - name: daily_stock
      - name: real_time_stock
        freshness:
          warn_after: {count: 15, period: minute}
          error_after: {count: 30, period: minute}
        loaded_at_field: timestamp
```

In this example, `real_time_stock` has stricter freshness requirements than other tables in the source.

***

### Advanced Source Configurations

#### Source Overrides by Environment

You can override source details for different environments by using custom schemas:

```yaml
sources:
  - name: marketing
    database: "{% if target.name == 'prod' %}analytics{% else %}raw_data{% endif %}"
    schema: "{% if target.name == 'prod' %}production{% else %}{{ target.schema }}{% endif %}"
    tables:
      - name: ad_campaigns
```

#### Filtering Source Data

For large source tables, you can define filter conditions:

```yaml
sources:
  - name: logs
    tables:
      - name: application_logs
        external:
          location: "s3://my-bucket/logs/"
          options:
            format: parquet
        freshness:
          filter: "date_column >= dateadd('day', -3, current_date)"
```

***

### Automating Source Definitions with DinoAI

With [DinoAI](https://docs.paradime.io/app-help/documentation/dino-ai) (Paradime's AI Agent), you can automatically generate source definitions with appropriate freshness configurations. See [step by step instructions](https://docs.paradime.io/app-help/documentation/dino-ai/agent-mode/use-cases/creating-sources-from-your-warehouse#step-by-step-instructions) for more details.&#x20;

{% hint style="info" %}
**Benefits of automating Source definitions with DinoAI:**

✅ **Scans your data warehouse** and auto-generates the correct table definitions. \
✅ **Prevents manual errors** in source definitions. \
✅ **Keeps sources up to date** with your evolving data warehouse schema.&#x20;
{% endhint %}

***

### Common Source Patterns

#### Staging Models for Sources

A common pattern is to create staging models that select from sources. These provide a clean interface between raw data and your transformations:

```sql
-- models/staging/stg_customers.sql
SELECT
  customer_id,
  first_name,
  last_name,
  email,
  created_at,
  updated_at
FROM {{ source('crm', 'customers') }}
```

#### Testing Complex Source Relationships

You can test relationships between source tables:

```yaml
sources:
  - name: application
    tables:
      - name: users
        columns:
          - name: user_id
            tests:
              - unique
              - not_null
      - name: orders
        columns:
          - name: user_id
            tests:
              - relationships:
                  to: source('application', 'users')
                  field: user_id
```

By effectively managing sources in dbt, you build a strong foundation for your analytics pipeline, making it easier to maintain, test, and document the origin of your data.
