Blog
Python Type Hints and Dataclasses for Data Engineers: Writing Code That Does Not Surprise You
Ryan Kirsch · December 3, 2025 · 9 min read
Data engineering Python tends to be loosely typed, dictionary-heavy, and full of implicit contracts between functions. This works until a source API changes a field name, a pipeline function silently accepts the wrong shape, and you find out three stages downstream when something is None that should not be. Type hints and structured data classes are the cheapest way to fix this.
The Problem with Dictionary-Driven Pipelines
Most data engineering Python looks like this:
def process_order(order: dict) -> dict:
return {
"order_id": order["order_id"],
"revenue": order["amount"] / 100,
"customer": order["customer_id"],
}
# Called somewhere else, 200 lines away:
result = process_order(api_response["data"])
print(result["customer"]) # KeyError if API renamed the fieldThis code has no way to tell you at development time that the API response might not have an amount field, or that customer_id was renamed to customerId in the v2 API. The failure surface is the entire pipeline, and the error appears at runtime on production data, not in your IDE.
Type hints do not eliminate this problem entirely -- Python's type system is gradual and optional -- but they move a significant portion of the error surface to development time when combined with a type checker like mypy or pyright, and to ingestion time when combined with a validation library like Pydantic.
Type Hints: The Foundation
Python type hints are annotations that declare the expected types of function parameters and return values. They do not enforce anything at runtime on their own -- they are checked by static analysis tools and communicate intent to other developers.
from typing import Optional, List, Dict, Any
from datetime import datetime
# Without type hints
def extract_orders(source, start_date, end_date):
pass
# With type hints -- immediately communicates contract
def extract_orders(
source: str,
start_date: datetime,
end_date: datetime,
limit: Optional[int] = None,
) -> List[Dict[str, Any]]:
pass
# Modern Python (3.10+) uses union operator instead of Optional
def get_customer(
customer_id: str,
include_deleted: bool = False,
) -> dict | None:
passRun mypy or pyright over a typed codebase and it catches callers passing a string where a datetime is expected, functions returning None where a list is expected, and attribute access on potentially None values. These are the bugs that show up as runtime failures at 2 AM in production.
The investment is low. Adding type hints to function signatures takes minutes per function. The payoff in a team environment is that new engineers understand the API of every function from the signature alone without reading the body.
Dataclasses: Structured Data Without Boilerplate
Python dataclasses, introduced in 3.7, give you structured objects with automatic __init__, __repr__, and __eq__ without writing any boilerplate. For data engineering, they are the right abstraction for representing pipeline records, configuration objects, and intermediate computation results.
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
@dataclass
class OrderRecord:
order_id: str
customer_id: str
amount_usd: float
status: str
created_at: datetime
shipped_at: Optional[datetime] = None
tags: list[str] = field(default_factory=list)
def is_delivered(self) -> bool:
return self.status == "delivered"
def to_dict(self) -> dict:
return {
"order_id": self.order_id,
"customer_id": self.customer_id,
"amount_usd": self.amount_usd,
"status": self.status,
"created_at": self.created_at.isoformat(),
"shipped_at": self.shipped_at.isoformat() if self.shipped_at else None,
}
# Usage -- IDE autocompletes fields, mypy checks types
order = OrderRecord(
order_id="ord_123",
customer_id="cust_456",
amount_usd=49.99,
status="shipped",
created_at=datetime(2026, 3, 27, 10, 0, 0),
)
print(order.is_delivered()) # False
print(order.to_dict())Compare this to a dictionary: you cannot access order.is_delivered() on a dict, your IDE cannot autocomplete order.amount_usd, and nothing tells you at write time that you missed a required field.
For frozen (immutable) configuration objects, add frozen=True:
@dataclass(frozen=True)
class PipelineConfig:
source_table: str
destination_schema: str
batch_size: int = 1000
dry_run: bool = False
config = PipelineConfig(
source_table="raw.orders",
destination_schema="silver",
)
# config.batch_size = 500 # Raises FrozenInstanceErrorPydantic: Validation at the Boundary
Dataclasses describe structure but do not validate values. Pydantic does both: it defines typed models and validates incoming data against them at runtime. In data engineering, this is most valuable at the ingestion boundary -- when you receive data from an external API, webhook, or file upload.
from pydantic import BaseModel, Field, validator, model_validator
from datetime import datetime
from typing import Literal
class IncomingOrder(BaseModel):
order_id: str = Field(..., min_length=1, max_length=50)
customer_id: str
amount_cents: int = Field(..., gt=0)
currency: str = Field(..., pattern=r"^[A-Z]{3}$")
status: Literal["pending", "processing", "shipped", "delivered", "cancelled"]
created_at: datetime
@validator("customer_id")
def customer_id_not_empty(cls, v: str) -> str:
if not v.strip():
raise ValueError("customer_id cannot be empty or whitespace")
return v.strip()
@model_validator(mode="after")
def delivered_needs_ship_date(self) -> "IncomingOrder":
# You could add cross-field validation here
return self
@property
def amount_usd(self) -> float:
return self.amount_cents / 100
# Pydantic raises ValidationError with field-level details on bad data
try:
order = IncomingOrder(
order_id="ord_123",
customer_id="cust_456",
amount_cents=-50, # Fails: gt=0
currency="usd", # Fails: pattern requires uppercase
status="unknown", # Fails: not in Literal
created_at="2026-03-27T10:00:00Z",
)
except Exception as e:
print(e) # Detailed field-level error messagesThe key pattern for pipelines: validate at the source boundary using Pydantic, convert to a dataclass or typed dict for internal processing, and serialize back to a dict or JSON for warehouse loading. This keeps the messy external-world validation logic isolated at the edge and lets your internal pipeline code work with clean, typed objects.
TypedDict: Typing Without Abandoning Dicts
Sometimes you are working with APIs or libraries that expect dictionaries. TypedDict lets you add type information to dicts without converting them to objects:
from typing import TypedDict, Required, NotRequired
class OrderRow(TypedDict):
order_id: Required[str]
customer_id: Required[str]
amount_usd: Required[float]
status: Required[str]
notes: NotRequired[str] # Optional field
def load_to_snowflake(rows: list[OrderRow]) -> None:
# mypy knows each row has order_id, customer_id, amount_usd, status
for row in rows:
print(row["order_id"]) # Autocompleted + type-checked
print(row.get("notes", "")) # NotRequired handled correctlyTypedDict is the right tool when you need compatibility with dict-expecting APIs (Snowflake connectors, Pandas, Spark) but want type-checker visibility into the dict structure.
Practical Integration: A Typed Ingestion Pipeline
Putting it together into a typed ingestion pipeline:
from pydantic import BaseModel
from dataclasses import dataclass
from typing import Iterator
import httpx
# 1. Pydantic: validates external API response
class ApiOrder(BaseModel):
id: str
customerId: str
totalCents: int
orderStatus: str
createdAt: str
# 2. Dataclass: internal clean representation
@dataclass
class OrderRecord:
order_id: str
customer_id: str
amount_usd: float
status: str
created_at: str
def parse_order(raw: dict) -> OrderRecord:
"""Validate external data, convert to internal type."""
validated = ApiOrder.model_validate(raw)
return OrderRecord(
order_id=validated.id,
customer_id=validated.customerId,
amount_usd=validated.totalCents / 100,
status=validated.orderStatus.lower(),
created_at=validated.createdAt,
)
def fetch_orders(api_url: str, api_key: str) -> Iterator[OrderRecord]:
"""Typed generator -- callers know they get OrderRecord objects."""
response = httpx.get(
api_url,
headers={"Authorization": f"Bearer {api_key}"},
)
response.raise_for_status()
for raw_order in response.json()["orders"]:
try:
yield parse_order(raw_order)
except Exception as e:
# Log and skip invalid records rather than failing the batch
print(f"Skipping invalid order {raw_order.get('id')}: {e}")
def load_to_warehouse(orders: Iterator[OrderRecord]) -> int:
"""Returns row count loaded."""
rows = [
{
"order_id": o.order_id,
"customer_id": o.customer_id,
"amount_usd": o.amount_usd,
"status": o.status,
"created_at": o.created_at,
}
for o in orders
]
# ... warehouse insert
return len(rows)Every function has a declared input and output type. mypy can verify that fetch_orders returns the right type, that load_to_warehouse receives the right iterator, and that the field names used in the dict comprehension exist on the dataclass. Schema changes in the API now surface as ValidationError at runtime, not as silent wrong data three stages downstream.
Getting Started: The Practical Path
You do not need to type an entire codebase to get value. The highest-leverage starting points:
- Type all function signatures. Return types, parameter types. This alone significantly improves IDE autocomplete and mypy coverage without touching the function bodies.
- Add Pydantic models to ingestion boundaries. Every place you receive external data (API call, file read, webhook payload), add a Pydantic model. This is where the most unexpected runtime errors originate.
- Replace config dicts with dataclasses. Pipeline configuration objects are the easiest refactor -- well-scoped, not in the hot path, and immediately improve readability.
- Run mypy in CI. Even with
--ignore-missing-importsand lenient settings, running mypy in your CI pipeline catches regressions as the codebase evolves.
The payoff is not just fewer runtime errors. A typed Python codebase is faster to navigate, easier to refactor, and requires less documentation because the types carry the intent. For a data engineering team where pipelines are maintained by multiple people over years, that compounding clarity is worth the upfront annotation work.
Ryan Kirsch
Senior Data Engineer with experience building production pipelines at scale. Works with dbt, Snowflake, and Dagster, and writes about data engineering patterns from production experience. See his full portfolio.