# Defining Your Sources in sources.yml

Sources in dbt represent the raw data tables in your data warehouse. Defining sources in a `sources.yml` file allows you to centralize table references, enabling cleaner code, testing, and documentation.&#x20;

***

### What is sources.yml?

The `sources.yml` file is where you define the external data tables that your dbt models will transform. This file centralizes information about your raw data sources, making it easier to:

* Reference raw tables consistently throughout your project
* Test source data for quality and freshness
* Document your data pipeline from beginning to end
* Create clear lineage visualization in dbt docs

***

### Basic Structure

A basic `sources.yml` file follows this structure:

```yaml
version: 2

sources:
  - name: jaffle_shop  # Source system name
    database: raw      # Optional: database where source is stored
    schema: jaffle_shop  # Schema containing the source tables
    tables:
      - name: orders   # Table name as it exists in the database
      - name: customers
```

This defines a source called `jaffle_shop` with two tables: `orders` and `customers`.

***

### Required and Optional Fields

| Field         | Required? | Description                                                           |
| ------------- | --------- | --------------------------------------------------------------------- |
| `name`        | Required  | Logical name for the source (used in the `source()` function)         |
| `schema`      | Required  | Database schema where the tables exist                                |
| `database`    | Optional  | Database where the schema exists (if different from target database)  |
| `tables`      | Required  | List of tables in this source                                         |
| `description` | Optional  | Description of the source for documentation                           |
| `loader`      | Optional  | Information about the tool loading this data (e.g., Fivetran, Stitch) |
| `freshness`   | Optional  | Configuration for freshness checks                                    |
| `quoting`     | Optional  | Settings for quoting identifiers                                      |

***

### Detailed Source Configuration

Here's a more complete example with additional configurations:

```yaml
version: 2

sources:
  - name: stripe
    description: "Payment data from Stripe, loaded by Fivetran connector"
    database: raw_data
    schema: stripe
    loader: fivetran
    loaded_at_field: _fivetran_synced
    
    # Source-level freshness (can be overridden at table level)
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    
    tables:
      - name: charges
        description: "Credit card charges, including status and amount"
        columns:
          - name: id
            description: "Primary key - the Stripe charge ID"
            tests:
              - unique
              - not_null
          - name: amount
            description: "Amount in cents"
            tests:
              - not_null
      
      - name: customers
        description: "Customer information from Stripe"
        # Override source-level freshness for this table
        freshness:
          warn_after: {count: 24, period: hour}
          error_after: {count: 48, period: hour}
```

***

### Referencing Sources in Models

Once defined, you can reference sources in your models using the `source()` function:

```sql
-- models/staging/stg_stripe_charges.sql
SELECT
  id as charge_id,
  customer_id,
  amount / 100.0 as amount_usd,  -- Convert cents to dollars
  status,
  created as created_at
FROM {{ source('stripe', 'charges') }}
```

The `source()` function takes two arguments:

1. The source name (defined as `name` in `sources.yml`)
2. The table name (defined under `tables` in `sources.yml`)

***

### Testing Sources

You can apply tests to your source tables just like you would with models:

```yaml
sources:
  - name: crm
    tables:
      - name: customers
        columns:
          - name: customer_id
            tests:
              - unique
              - not_null
          - name: email
            tests:
              - unique
              - not_null
```

This allows you to verify data quality at the entry point to your dbt pipeline.

***

### Source Freshness

One of the most powerful features of sources is the ability to check data freshness - ensuring your source data is up-to-date.

#### Configuring Freshness Checks

To set up freshness checks, you need:

1. A column in your source table that indicates when a record was last updated (`loaded_at_field`)
2. Freshness threshold configurations

```yaml
sources:
  - name: stripe
    loaded_at_field: _etl_loaded_at
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    tables:
      - name: charges
      - name: customers
        # Override for specific table
        freshness:
          warn_after: {count: 36, period: hour}
```

{% hint style="info" %}
**Valid Time Periods for Freshness Checks**

When configuring source freshness thresholds, you must specify both a count and a time period. The time period can be:

* `minute` - For data that updates very frequently (e.g., real-time systems)
* `hour` - For data that updates throughout the day (e.g., transaction systems)
* `day` - For data that updates daily or less frequently (e.g., batch uploads)

For example: `warn_after: {count: 6, period: hour}` would warn if data is more than 6 hours old.
{% endhint %}

#### Running Freshness Checks

Execute freshness checks with:

```bash
dbt source freshness
```

The output indicates which sources are fresh, stale (warning), or too old (error).

***

### Advanced Source Configurations

#### Dynamic Schema Resolution

You can use Jinja in your source definitions to handle environments:

```yaml
sources:
  - name: marketing
    schema: "{% if target.name == 'prod' %}marketing_prod{% else %}marketing_{{ target.name }}{% endif %}"
    tables:
      - name: campaigns
```

This creates different schemas depending on your target environment.

#### Quoting Configuration

Control identifier quoting:

```yaml
sources:
  - name: salesforce
    quoting:
      database: true
      schema: true
      identifier: false
    tables:
      - name: Contacts
        quoting:
          identifier: true  # Override source-level setting
```

#### External Tables

For data sources like S3 or GCS:

```yaml
sources:
  - name: clickstream
    tables:
      - name: raw_events
        external:
          location: "s3://my-bucket/clickstream/events/"
          file_format: "parquet"
```

***

### Best Practices for Source Configuration

| Practice                             | Description                                                      |
| ------------------------------------ | ---------------------------------------------------------------- |
| Group logically related sources      | Organize sources by system or domain (e.g., CRM, ERP, Marketing) |
| Document sources thoroughly          | Add descriptions for sources, tables, and columns                |
| Test source data                     | Apply tests to key columns to catch issues early                 |
| Set appropriate freshness thresholds | Define freshness based on business needs and data load frequency |
| Use consistent naming patterns       | Establish a convention for source names and stick to it          |
| Consider file organization           | Place source definitions close to the models that use them       |

***

### Organizing source.yml Files

There are several approaches to organizing your source definitions:

#### Single File Approach

Good for smaller projects:

```
models/
  └── sources.yml    # All sources defined in one file
```

#### Source-by-Source Approach

Better for larger projects:

```
models/
  └── sources/
      ├── salesforce_sources.yml
      ├── stripe_sources.yml
      └── google_analytics_sources.yml
```

#### Alongside Related Models

Place source definitions with the models that use them:

```
models/
  ├── staging/
  │   ├── stripe/
  │   │   ├── stripe_sources.yml
  │   │   ├── stg_stripe_charges.sql
  │   │   └── stg_stripe_customers.sql
  │   └── salesforce/
  │       ├── salesforce_sources.yml
  │       └── stg_salesforce_contacts.sql
```

***

{% hint style="info" %}

### Automating Source Generation with DinoAI

With [DinoAI](/app-help/documentation/dino-ai.md) (Paradime's AI Agent), you can automatically generate source definitions with appropriate freshness configurations. See [step by step instructions](/app-help/documentation/dino-ai/agent-mode/use-cases/creating-sources-from-your-warehouse.md#step-by-step-instructions) for more details.&#x20;
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.paradime.io/app-help/concepts/dbt-fundamentals/configuring-your-dbt-project/define-your-sources.yml.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
