Python Type Hints and Dataclasses for Data Engineers: Writing Code That Doesn't Surprise You

The Problem with Dictionary-Driven Pipelines

Most data engineering Python looks like this:

def process_order(order: dict) -> dict:
    return {
        "order_id": order["order_id"],
        "revenue": order["amount"] / 100,
        "customer": order["customer_id"],
    }

# Called somewhere else, 200 lines away:
result = process_order(api_response["data"])
print(result["customer"])  # KeyError if API renamed the field

This code has no way to tell you at development time that the API response might not have an amount field, or that customer_id was renamed to customerId in the v2 API. The failure surface is the entire pipeline, and the error appears at runtime on production data, not in your IDE.

Type hints do not eliminate this problem entirely -- Python's type system is gradual and optional -- but they move a significant portion of the error surface to development time when combined with a type checker like mypy or pyright, and to ingestion time when combined with a validation library like Pydantic.

Type Hints: The Foundation

Python type hints are annotations that declare the expected types of function parameters and return values. They do not enforce anything at runtime on their own -- they are checked by static analysis tools and communicate intent to other developers.

from typing import Optional, List, Dict, Any
from datetime import datetime

# Without type hints
def extract_orders(source, start_date, end_date):
    pass

# With type hints -- immediately communicates contract
def extract_orders(
    source: str,
    start_date: datetime,
    end_date: datetime,
    limit: Optional[int] = None,
) -> List[Dict[str, Any]]:
    pass

# Modern Python (3.10+) uses union operator instead of Optional
def get_customer(
    customer_id: str,
    include_deleted: bool = False,
) -> dict | None:
    pass

Run mypy or pyright over a typed codebase and it catches callers passing a string where a datetime is expected, functions returning None where a list is expected, and attribute access on potentially None values. These are the bugs that show up as runtime failures at 2 AM in production.

The investment is low. Adding type hints to function signatures takes minutes per function. The payoff in a team environment is that new engineers understand the API of every function from the signature alone without reading the body.

Dataclasses: Structured Data Without Boilerplate

Python dataclasses, introduced in 3.7, give you structured objects with automatic __init__, __repr__, and __eq__ without writing any boilerplate. For data engineering, they are the right abstraction for representing pipeline records, configuration objects, and intermediate computation results.

from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional

@dataclass
class OrderRecord:
    order_id: str
    customer_id: str
    amount_usd: float
    status: str
    created_at: datetime
    shipped_at: Optional[datetime] = None
    tags: list[str] = field(default_factory=list)
    
    def is_delivered(self) -> bool:
        return self.status == "delivered"
    
    def to_dict(self) -> dict:
        return {
            "order_id": self.order_id,
            "customer_id": self.customer_id,
            "amount_usd": self.amount_usd,
            "status": self.status,
            "created_at": self.created_at.isoformat(),
            "shipped_at": self.shipped_at.isoformat() if self.shipped_at else None,
        }

# Usage -- IDE autocompletes fields, mypy checks types
order = OrderRecord(
    order_id="ord_123",
    customer_id="cust_456",
    amount_usd=49.99,
    status="shipped",
    created_at=datetime(2026, 3, 27, 10, 0, 0),
)
print(order.is_delivered())  # False
print(order.to_dict())

Compare this to a dictionary: you cannot access order.is_delivered() on a dict, your IDE cannot autocomplete order.amount_usd, and nothing tells you at write time that you missed a required field.

For frozen (immutable) configuration objects, add frozen=True:

@dataclass(frozen=True)
class PipelineConfig:
    source_table: str
    destination_schema: str
    batch_size: int = 1000
    dry_run: bool = False

config = PipelineConfig(
    source_table="raw.orders",
    destination_schema="silver",
)
# config.batch_size = 500  # Raises FrozenInstanceError

Pydantic: Validation at the Boundary

Dataclasses describe structure but do not validate values. Pydantic does both: it defines typed models and validates incoming data against them at runtime. In data engineering, this is most valuable at the ingestion boundary -- when you receive data from an external API, webhook, or file upload.

from pydantic import BaseModel, Field, validator, model_validator
from datetime import datetime
from typing import Literal

class IncomingOrder(BaseModel):
    order_id: str = Field(..., min_length=1, max_length=50)
    customer_id: str
    amount_cents: int = Field(..., gt=0)
    currency: str = Field(..., pattern=r"^[A-Z]{3}$")
    status: Literal["pending", "processing", "shipped", "delivered", "cancelled"]
    created_at: datetime
    
    @validator("customer_id")
    def customer_id_not_empty(cls, v: str) -> str:
        if not v.strip():
            raise ValueError("customer_id cannot be empty or whitespace")
        return v.strip()
    
    @model_validator(mode="after")
    def delivered_needs_ship_date(self) -> "IncomingOrder":
        # You could add cross-field validation here
        return self
    
    @property
    def amount_usd(self) -> float:
        return self.amount_cents / 100

# Pydantic raises ValidationError with field-level details on bad data
try:
    order = IncomingOrder(
        order_id="ord_123",
        customer_id="cust_456",
        amount_cents=-50,    # Fails: gt=0
        currency="usd",      # Fails: pattern requires uppercase
        status="unknown",    # Fails: not in Literal
        created_at="2026-03-27T10:00:00Z",
    )
except Exception as e:
    print(e)  # Detailed field-level error messages

The key pattern for pipelines: validate at the source boundary using Pydantic, convert to a dataclass or typed dict for internal processing, and serialize back to a dict or JSON for warehouse loading. This keeps the messy external-world validation logic isolated at the edge and lets your internal pipeline code work with clean, typed objects.

TypedDict: Typing Without Abandoning Dicts

Sometimes you are working with APIs or libraries that expect dictionaries. TypedDict lets you add type information to dicts without converting them to objects:

from typing import TypedDict, Required, NotRequired

class OrderRow(TypedDict):
    order_id: Required[str]
    customer_id: Required[str]
    amount_usd: Required[float]
    status: Required[str]
    notes: NotRequired[str]  # Optional field

def load_to_snowflake(rows: list[OrderRow]) -> None:
    # mypy knows each row has order_id, customer_id, amount_usd, status
    for row in rows:
        print(row["order_id"])  # Autocompleted + type-checked
        print(row.get("notes", ""))  # NotRequired handled correctly

TypedDict is the right tool when you need compatibility with dict-expecting APIs (Snowflake connectors, Pandas, Spark) but want type-checker visibility into the dict structure.

Practical Integration: A Typed Ingestion Pipeline

Putting it together into a typed ingestion pipeline:

from pydantic import BaseModel
from dataclasses import dataclass
from typing import Iterator
import httpx

# 1. Pydantic: validates external API response
class ApiOrder(BaseModel):
    id: str
    customerId: str
    totalCents: int
    orderStatus: str
    createdAt: str

# 2. Dataclass: internal clean representation
@dataclass
class OrderRecord:
    order_id: str
    customer_id: str
    amount_usd: float
    status: str
    created_at: str

def parse_order(raw: dict) -> OrderRecord:
    """Validate external data, convert to internal type."""
    validated = ApiOrder.model_validate(raw)
    return OrderRecord(
        order_id=validated.id,
        customer_id=validated.customerId,
        amount_usd=validated.totalCents / 100,
        status=validated.orderStatus.lower(),
        created_at=validated.createdAt,
    )

def fetch_orders(api_url: str, api_key: str) -> Iterator[OrderRecord]:
    """Typed generator -- callers know they get OrderRecord objects."""
    response = httpx.get(
        api_url,
        headers={"Authorization": f"Bearer {api_key}"},
    )
    response.raise_for_status()
    
    for raw_order in response.json()["orders"]:
        try:
            yield parse_order(raw_order)
        except Exception as e:
            # Log and skip invalid records rather than failing the batch
            print(f"Skipping invalid order {raw_order.get('id')}: {e}")

def load_to_warehouse(orders: Iterator[OrderRecord]) -> int:
    """Returns row count loaded."""
    rows = [
        {
            "order_id": o.order_id,
            "customer_id": o.customer_id,
            "amount_usd": o.amount_usd,
            "status": o.status,
            "created_at": o.created_at,
        }
        for o in orders
    ]
    # ... warehouse insert
    return len(rows)

Every function has a declared input and output type. mypy can verify that fetch_orders returns the right type, that load_to_warehouse receives the right iterator, and that the field names used in the dict comprehension exist on the dataclass. Schema changes in the API now surface as ValidationError at runtime, not as silent wrong data three stages downstream.

Getting Started: The Practical Path

You do not need to type an entire codebase to get value. The highest-leverage starting points:

Type all function signatures. Return types, parameter types. This alone significantly improves IDE autocomplete and mypy coverage without touching the function bodies.
Add Pydantic models to ingestion boundaries. Every place you receive external data (API call, file read, webhook payload), add a Pydantic model. This is where the most unexpected runtime errors originate.
Replace config dicts with dataclasses. Pipeline configuration objects are the easiest refactor -- well-scoped, not in the hot path, and immediately improve readability.
Run mypy in CI. Even with --ignore-missing-imports and lenient settings, running mypy in your CI pipeline catches regressions as the codebase evolves.

The payoff is not just fewer runtime errors. A typed Python codebase is faster to navigate, easier to refactor, and requires less documentation because the types carry the intent. For a data engineering team where pipelines are maintained by multiple people over years, that compounding clarity is worth the upfront annotation work.