Python for Data Engineers: pandas, PySpark, Polars, and the Modern Python Data Stack
November 12, 2025 · 10 min read
Python became the lingua franca of data engineering for one reason: ecosystem breadth. It won because every layer of the stack, from orchestration to storage to ML, exposed a Python interface that was easy to adopt. Speed did not make it the default, community momentum did. Spark, Airflow, dbt, Dagster, Prefect, and DuckDB all treat Python as a first class API, which means most teams can standardize on one language for data work and keep velocity high.
That ecosystem is now mature enough that you can design a full data platform without leaving Python. A production pipeline might read Parquet with PyArrow, validate inputs with Pydantic, transform with Polars, orchestrate with Dagster, and publish metrics to a warehouse through SQLAlchemy. The question is not whether Python can do the job, it is which tools inside the Python ecosystem are the right fit for each stage of the job.
pandas fundamentals and when it breaks down
pandas is the default DataFrame library for a reason. It has the cleanest ergonomics for exploratory work, and it is the best way to express row and column operations quickly. The API is intuitive for anyone who has written SQL or used spreadsheets. You should be able to read data, filter it, aggregate it, and join it without thinking.
Core DataFrame patterns
These are the operations that show up constantly in production scripts and pipelines: reading Parquet and CSV, filtering, grouping, and merges. Keep them fluid:
import pandas as pd
# Read data
orders = pd.read_parquet("s3://warehouse/orders.parquet")
users = pd.read_csv("/data/users.csv")
# Filter and select
recent = orders.loc[orders["created_at"] >= "2025-01-01", ["order_id", "user_id", "total"]]
# Group and aggregate
daily = (
recent.assign(day=recent["created_at"].str.slice(0, 10))
.groupby("day", as_index=False)
.agg(total_revenue=("total", "sum"), order_count=("order_id", "count"))
)
# Join
result = daily.merge(users[["user_id", "plan"]], on="user_id", how="left")Where pandas breaks
pandas loads everything into memory. If your dataset is larger than RAM, you will hit an out of memory error or your machine will thrash itself into unusable swap. The rule of thumb is simple: you need 5 to 10x the dataset size in RAM for comfortable pandas work. That buffer accounts for intermediate copies, groupby state, and join fan out. If you have a 10GB Parquet dataset, you often need 50GB to 100GB of memory to manipulate it safely in pandas.
That limitation does not make pandas bad, it just makes it the wrong tool for medium to large data. When data grows beyond memory, you need to either move to a more efficient engine on the same machine or go distributed.
PySpark for distributed processing
PySpark exists for the point where a single machine is no longer enough. Spark is a distributed execution engine, and PySpark is the Python interface to it. The win is scale and fault tolerance. The cost is more complexity, more operational overhead, and a steeper learning curve. In practice, PySpark starts to make sense when your datasets are too big for local memory, or when you already operate a Spark cluster for other workloads.
A practical breakpoint is around 50GB of raw data. You can sometimes stretch a single machine with Polars, DuckDB, or Arrow beyond that, but at some point the operational safety of distributed compute wins.
A minimal PySpark job
The core loop is always the same: read, transform, aggregate, write. Keep the I/O and transformation separate so the job is testable and the execution plan stays clear.
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("orders").getOrCreate()
orders = spark.read.parquet("s3://warehouse/orders/")
result = (
orders.filter(F.col("status") == "completed")
.groupBy("customer_id")
.agg(
F.count("order_id").alias("order_count"),
F.sum("total").alias("lifetime_value"),
)
)
result.write.mode("overwrite").parquet("s3://warehouse/analytics/customer_ltv/")Polars: the modern alternative
Polars is the modern in memory DataFrame engine for Python. It is written in Rust, it uses lazy execution, and it is fast enough to make pandas feel outdated for many workloads. If you want a single machine tool that scales to tens of gigabytes, this is it. It also has a strong expression API that reduces the need for row wise Python logic, which is the primary source of slowness in pandas.
For most teams, Polars is the sweet spot for 1GB to 50GB datasets on a developer laptop. It uses less memory, parallelizes operations, and can push predicate filters into file scans. That gives it a real performance advantage without the operational cost of Spark.
Polars vs pandas example
The operations are nearly identical, but Polars encourages you to stay within its expression engine. That pays off when you chain multiple transforms:
import pandas as pd
import polars as pl
# pandas
p_df = pd.read_parquet("/data/orders.parquet")
summary_pd = (
p_df[p_df["status"] == "completed"]
.groupby("customer_id", as_index=False)
.agg(order_count=("order_id", "count"), revenue=("total", "sum"))
)
# Polars
pl_df = pl.read_parquet("/data/orders.parquet")
summary_pl = (
pl_df.lazy()
.filter(pl.col("status") == "completed")
.groupby("customer_id")
.agg(
pl.count("order_id").alias("order_count"),
pl.sum("total").alias("revenue"),
)
.collect()
)When to use what: decision framework
There is no perfect tool, only the right tool for your data size and operational constraints. Use this decision framework to avoid over engineering.
Quick comparison
Tool | Best for data size | Strengths | When it fails pandas | < 1GB | Fast iteration, rich API | Memory bound, slow on big joins Polars | 1GB to 50GB | Fast, lazy execution, low memory | Single machine limit PySpark | 50GB+ | Distributed compute, fault tolerant | Operational overhead, slower dev loop
Decision rules
pandas is best when you are exploring data, writing quick scripts, or doing small batch transforms under 1GB. Polars is the upgrade when you need production grade local processing, larger files, or real performance. PySpark is for data above 50GB, existing Spark infrastructure, or workflows that already live in the Spark ecosystem. If you are on the fence between Polars and PySpark, start with Polars. It will get you 80 percent of the value with 20 percent of the complexity.
The supporting cast: key Python libraries
Data engineering is not just DataFrames. You need stable database connections, validation, and reliable I/O. These libraries show up everywhere in production stacks.
SQLAlchemy for databases
from sqlalchemy import create_engine, text
engine = create_engine("postgresql+psycopg2://user:pass@host:5432/warehouse")
with engine.begin() as conn:
conn.execute(text("""
create table if not exists metrics.daily_revenue (
day date primary key,
revenue numeric
)
"""))
conn.execute(
text("insert into metrics.daily_revenue (day, revenue) values (:day, :rev)"),
[{"day": "2025-07-01", "rev": 12540.50}],
)Pydantic for contracts
from pydantic import BaseModel, Field
from datetime import datetime
class OrderEvent(BaseModel):
order_id: str
user_id: str
total: float = Field(gt=0)
created_at: datetime
payload = {"order_id": "o1", "user_id": "u1", "total": 42.5, "created_at": "2025-07-01T12:34:00"}
validated = OrderEvent.model_validate(payload)httpx for async APIs
import asyncio
import httpx
async def fetch_rates():
async with httpx.AsyncClient(timeout=10) as client:
resp = await client.get("https://api.exchangerate.host/latest?base=USD")
resp.raise_for_status()
return resp.json()["rates"]
rates = asyncio.run(fetch_rates())Orchestration hooks
# Airflow
from airflow.decorators import task
@task
def build_daily_metrics():
return "metrics-ready"
# Dagster
from dagster import asset
@asset
def daily_metrics():
return "metrics-ready"
# Prefect
from prefect import task
@task
def daily_metrics_prefect():
return "metrics-ready"Type hints and testing in DE workflows
Most data bugs are input shape bugs. Python type hints reduce the risk by making contracts explicit and verifiable. mypy or pyright can catch missing fields, wrong types, and illegal return values before the pipeline ever runs.
Type hints on pipeline code
from typing import Iterable
import pandas as pd
class MetricRow(dict):
pass
def build_metrics(rows: Iterable[MetricRow]) -> pd.DataFrame:
df = pd.DataFrame(rows)
return df[["day", "revenue", "orders"]]Testing with pytest
import pandas as pd
def test_build_metrics():
rows = [
{"day": "2025-07-01", "revenue": 100.0, "orders": 2},
{"day": "2025-07-02", "revenue": 200.0, "orders": 3},
]
df = build_metrics(rows)
assert list(df.columns) == ["day", "revenue", "orders"]
assert df["revenue"].sum() == 300.0Data quality with Great Expectations
import great_expectations as ge
ge_df = ge.from_pandas(pd.DataFrame({"order_id": ["o1", "o2"], "total": [10.0, 0.0]}))
ge_df.expect_column_values_to_not_be_null("order_id")
ge_df.expect_column_values_to_be_between("total", min_value=0.01)Closing thoughts
Python is not getting weaker as a data engineering language, it is getting stronger. Polars and DuckDB are proof that the community is willing to solve its own performance problems. The best part is that you can adopt these tools incrementally without rewriting your stack.
If you are building a modern data platform today, the winning move is to stay fluent in the ecosystem and choose the right execution engine for the job. pandas for speed of thought, Polars for fast local compute, PySpark for distributed scale. The stack is cohesive and it keeps improving.
Questions or pushback on any of this? Find me on LinkedIn.
Ryan Kirsch is a senior data engineer with 8+ years building data infrastructure at media, SaaS, and fintech companies. He specializes in Kafka, dbt, Snowflake, and Spark, and writes about data engineering patterns from production experience. See his full portfolio.