Cost-Efficient Data Engineering: How to Spend Less on Infrastructure Without Sacrificing Reliability

This is a production-focused playbook for spending less without breaking data reliability. It covers warehouse right-sizing, Iceberg + S3 tiered storage, spot instances for Spark, dbt incremental cost math, query patterns that quietly destroy your bill, monitoring for cost spikes, and a cost-per-insight culture that keeps teams aligned.

If your data platform bill is growing faster than your value delivered, this is the path back to sanity.

Right-Size Warehouses With Auto-Suspend

The easiest waste to remove is idle compute. If a warehouse runs 24/7 but your workloads run 8 hours a day, you are literally paying for empty queries. Auto-suspend and auto-resume should be enabled by default, and the suspend delay should be short enough that you are not paying for five-minute gaps between jobs.

Cluster size matters too. It is common to start with a large warehouse to make queries fast, then forget to scale down once models stabilize. The pattern I recommend is to keep a smaller default warehouse for BI and ad-hoc analytics, and only scale up for backfills or heavier transforms. That separation stops the long tail of exploratory queries from running on your most expensive compute.

-- Snowflake example
ALTER WAREHOUSE analytics_wh SET
  WAREHOUSE_SIZE = 'SMALL',
  AUTO_SUSPEND = 60,
  AUTO_RESUME = TRUE;

The key mindset is to treat warehouse size as a runtime parameter, not a permanent decision. Use large clusters for timed jobs, not for all workloads.

Tiered Storage With Iceberg + S3

For many teams, 80 percent of warehouse data is rarely queried. Keeping that data in the most expensive storage tier is rarely justified. This is where the lakehouse pattern with Iceberg and S3 pays off. You move colder data to object storage while keeping metadata and recent data optimized for fast queries.

The operational shift is simple: data that needs frequent, fast access lives in the warehouse. Data that is mainly for compliance, audits, or rare historical analysis lives in Iceberg tables on S3. Engines like Snowflake, Trino, and Spark can still query those tables when needed, but you are not paying premium storage costs for the entire history.

CREATE TABLE lakehouse.events
USING iceberg
LOCATION 's3://data-lake/iceberg/events'
TBLPROPERTIES (
  'format-version' = '2',
  'write.target-file-size-bytes' = '536870912'
);

The savings come from storage pricing and the ability to scale compute only when you query cold data. Most teams underuse this lever.

Spot Instances for Batch Spark

If you run Spark for batch processing, spot instances are the single biggest cost lever. They can cut compute costs by 60 to 80 percent, but only if you design your jobs to handle interruptions. The key is to keep shuffle and checkpoint settings resilient so tasks can be restarted without redoing hours of work.

For nightly batch jobs, the tradeoff is almost always worth it. Your cluster is ephemeral anyway, and the job can retry on interruption. This is harder for low-latency streaming jobs, where interruption is a reliability risk.

# Spark on EMR example
Instance fleets:
  - type: CORE
    targetOnDemandCapacity: 2
    targetSpotCapacity: 8
    instanceTypeConfigs:
      - instanceType: r6g.2xlarge
        weightedCapacity: 1
        bidPriceAsPercentageOfOnDemandPrice: 60

You do not need to go all-in. A 70/30 mix of spot and on-demand gives you both savings and stability.

dbt Incremental Models vs Full Refresh: The Cost Math

Full refreshes are the hidden killer in many dbt environments. They are convenient, but they turn every run into a full table scan. The cost math is straightforward: if you have 365 days of data and only 1 day of new events, a full refresh makes you pay 365x more than you need to.

Incremental models are the fix, but only if they are idempotent and use a lookback window to capture late data. The pattern below reprocesses the last two days while still cutting the bulk of cost.

{{
  config(
    materialized='incremental',
    unique_key='event_id'
  )
}}

with source as (
  select *
  from {{ ref('stg_events') }}
  {% if is_incremental() %}
    where event_date >= dateadd(day, -2, current_date)
  {% endif %}
)

select * from source

If the incremental run scans 2 days instead of 365, you just reduced that model’s compute cost by ~99 percent. Multiply that by dozens of models and the savings compound quickly.

Query Patterns That Destroy Your Bill

The top three cost offenders I see are: wide SELECT *statements, missing filters on large tables, and unnecessary cross joins. These are not just style mistakes. They drive scan volume, which drives cost, and they usually happen in ad-hoc analytics where no one is paying attention to efficiency.

Build guardrails: curated data marts with fewer columns, clustering keys that align with filter patterns, and linting rules in SQL review. Most of the time, simply selecting the columns you need and filtering by date reduces cost more than any infrastructure change.

-- Bad
SELECT * FROM raw_events;

-- Better
SELECT event_id, user_id, event_type, event_ts
FROM raw_events
WHERE event_date >= dateadd(day, -7, current_date);

Encourage analysts to explain why a query needs to be unbounded. If the answer is “because it’s easier,” you are paying for convenience instead of insight.

Monitor Cost Spikes Like Reliability Incidents

Cost spikes are often the first sign of a bad deploy or a runaway query. Treat them like reliability incidents. Set budgets, enable alerts, and track spend by project or model. The goal is not to stop people from querying, but to make cost visible in the same way you track uptime.

In practice, I build a daily cost dashboard with a rolling 7-day average and a per-workload breakdown. When the curve diverges, we investigate. Half the time, it is a single new model that is scanning far more data than intended. The other half, it is a batch backfill that someone started without warning.

-- Example: daily warehouse credits by workload
SELECT
  usage_date,
  warehouse_name,
  SUM(credits_used) AS credits
FROM snowflake.account_usage.warehouse_metering_history
WHERE usage_date >= dateadd(day, -30, current_date)
GROUP BY 1, 2
ORDER BY 1 DESC;

Cost-Per-Insight Culture

You cannot optimize what you cannot explain. Cost-per-insight is a simple mental model: for any pipeline or dashboard, what does it cost per business decision it enables? This reframes spend away from “the warehouse bill” and toward value. It is a cultural tool as much as a financial one.

I like to assign clear ownership for high-cost models and dashboards. If the owner can explain the value, the spend is justified. If not, the model is a candidate for deprecation or redesign. This makes cost control a product conversation rather than a finance audit.

Five Quick Wins That Pay for Themselves

These are the fastest, lowest-risk optimizations I recommend to teams that need results in weeks, not quarters.

Set auto-suspend to 60 seconds or less on all non-critical warehouses.
Split heavy transformations into a separate, right-sized warehouse.
Convert the top five full-refresh dbt models to incremental with a lookback window.
Archive cold tables older than 12 months into Iceberg on S3.
Alert on warehouse credits exceeding 2x the 7-day moving average.

None of these require a platform migration. They are configuration and workflow changes that usually pay for themselves in the first month.

Spend Less by Being Deliberate, Not Cheap

The teams that spend the least are not the ones who do the least. They are the ones who are deliberate about sizing, storage tiers, and query behavior. That discipline gives them headroom to invest in the pipelines that matter most.

If you adopt even half of the patterns in this post, you will lower your bill without lowering reliability. That is the real goal: a data platform that is cost-efficient and trusted.

Questions or pushback on any of this? Find me on LinkedIn.

Share on X Share on LinkedIn

Ryan Kirsch is a senior data engineer with 8+ years building data infrastructure at media, SaaS, and fintech companies. He specializes in Kafka, dbt, Snowflake, and Airflow, and writes about data engineering patterns from production experience. See his full portfolio.