Blog
BigQuery for Data Engineers: Architecture, Optimization, and When to Use It
Ryan Kirsch · December 24, 2025 · 9 min read
BigQuery and Snowflake are the two dominant cloud data warehouses, and they make fundamentally different architectural choices. If you are coming from Snowflake and starting work on a BigQuery platform, or evaluating which to adopt, the differences matter more than the surface-level SQL compatibility. This guide covers what a data engineer needs to know to use BigQuery well.
How BigQuery Is Different from Snowflake
The two most significant architectural differences:
Serverless compute. BigQuery has no warehouses to manage. You run a query, BigQuery allocates compute automatically, and you pay per byte scanned (on-demand pricing) or per slot-hour (capacity pricing). There is no “warehouse size” decision, no auto-suspend configuration, and no cold-start penalty. Queries start immediately.
Bytes-scanned pricing. On-demand BigQuery charges per terabyte of data scanned, not per second of compute. This fundamentally changes what optimization means. In Snowflake, you optimize for query execution time (credits per second). In BigQuery on-demand, you optimize for bytes scanned. A query that reads 1TB in 10 seconds costs the same as a query that reads 1TB in 2 minutes -- what matters is the data volume, not the duration.
The practical implication: partition pruning and column projection (SELECT only what you need) are the highest-impact optimizations in BigQuery, even more so than in Snowflake.
Partitioning: The Most Important Optimization Decision
Partitioning in BigQuery divides a table into segments that can be pruned based on query filters. BigQuery supports three partition types:
-- Date/timestamp partition (most common for time-series data)
CREATE TABLE analytics.orders
PARTITION BY DATE(created_at) -- Daily partitions
OPTIONS (
require_partition_filter = true -- Force callers to filter by date
)
AS SELECT * FROM staging.raw_orders;
-- Integer range partition (for user IDs, batch numbers)
CREATE TABLE analytics.user_events
PARTITION BY RANGE_BUCKET(user_id, GENERATE_ARRAY(0, 10000000, 100000))
AS SELECT * FROM staging.raw_events;
-- Ingestion-time partition (automatic, based on load time)
CREATE TABLE analytics.log_events
PARTITION BY _PARTITIONDATE
AS SELECT * FROM staging.raw_logs;The require_partition_filter = true option is the BigQuery equivalent of a Snowflake row access policy for large tables. It forces every query to include a partition filter, preventing accidental full-table scans that scan and bill for terabytes of data.
Check partition pruning after query runs:
-- In the BigQuery console Query Details, check:
-- "Partitions processed" vs "Partitions total"
-- A well-pruned query should process <10% of partitions
-- Or query INFORMATION_SCHEMA for recent job stats
SELECT
job_id,
query,
total_bytes_processed / 1e9 AS gb_processed,
total_bytes_billed / 1e9 AS gb_billed,
total_slot_ms / 1000 AS slot_seconds
FROM region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
ORDER BY total_bytes_billed DESC
LIMIT 20;Clustering: Fine-Grained Data Organization
Clustering in BigQuery sorts data within each partition by specified columns. It does not reduce the number of partitions scanned, but it reduces the bytes scanned within each partition by allowing BigQuery to skip blocks of data that do not match filter conditions.
-- Combined partition + clustering (recommended pattern)
CREATE TABLE analytics.orders
PARTITION BY DATE(created_at)
CLUSTER BY customer_id, status -- Up to 4 cluster columns
AS SELECT * FROM staging.raw_orders;
-- Query that benefits from clustering:
SELECT
order_id,
customer_id,
amount_usd
FROM analytics.orders
WHERE DATE(created_at) = '2026-03-27' -- Partition pruning
AND customer_id = 'cust_12345' -- Cluster pruning within partition
AND status = 'delivered'; -- Cluster pruningThe clustering column order matters. BigQuery prunes most efficiently when you filter on the first cluster column, less efficiently on the second, and so on. Put the highest-cardinality, most-commonly-filtered column first in the cluster definition.
dbt with BigQuery: The Setup That Works
dbt works with BigQuery through the dbt-bigquery adapter. The configuration:
# profiles.yml
myproject:
target: prod
outputs:
prod:
type: bigquery
method: service-account
project: my-gcp-project
dataset: analytics
keyfile: /path/to/service-account.json
location: US
threads: 4
timeout_seconds: 300
dev:
type: bigquery
method: oauth # Use personal credentials in dev
project: my-gcp-project
dataset: analytics_dev_{{ env_var('USER', 'unknown') }}
location: US
threads: 4The partition and cluster configuration in dbt models:
-- models/marts/fct_orders.sql
{{ config(
materialized='table',
partition_by={
"field": "created_at",
"data_type": "timestamp",
"granularity": "day"
},
cluster_by=["customer_id", "status"],
require_partition_filter=True,
labels={"team": "data-platform", "env": "prod"}
) }}
SELECT
order_id,
customer_id,
amount_usd,
status,
created_at
FROM {{ ref('silver_orders') }}
-- Incremental config with BigQuery-specific merge
-- {{ config(
-- materialized='incremental',
-- unique_key='order_id',
-- incremental_strategy='merge',
-- partition_by={...},
-- cluster_by=['customer_id']
-- ) }}
-- {% if is_incremental() %}
-- WHERE created_at >= (SELECT MAX(created_at) FROM {{ this }})
-- {% endif %}BigQuery-Specific SQL Patterns
BigQuery has several SQL features worth knowing:
Nested and repeated fields (STRUCTs and ARRAYs): BigQuery is columnar but supports nested structures natively. JSON-like data can be stored without flattening, and queried with dot notation:
-- Query nested fields
SELECT
order_id,
customer.customer_id,
customer.email,
-- Flatten an array
item.product_id,
item.quantity,
item.unit_price
FROM orders,
UNNEST(line_items) AS item -- Flatten the nested array
WHERE DATE(created_at) = CURRENT_DATE()
AND customer.tier = 'premium';MERGE with BigQuery: BigQuery's MERGE supports upsert patterns for incremental loads:
MERGE analytics.orders AS target
USING (
SELECT * FROM staging.new_orders
WHERE _PARTITIONDATE = CURRENT_DATE()
) AS source
ON target.order_id = source.order_id
AND target.created_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
WHEN MATCHED THEN
UPDATE SET
status = source.status,
updated_at = source.updated_at
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, amount_usd, status, created_at)
VALUES (source.order_id, source.customer_id, source.amount_usd,
source.status, source.created_at);Cost Control in BigQuery
BigQuery on-demand pricing can surprise teams that do not set guardrails. A single unconstrained query against a large unpartitioned table can scan and bill for hundreds of gigabytes.
The three controls that prevent surprise bills:
- require_partition_filter on large tables.Forces every query to include a partition filter. No query can full-scan a multi-terabyte table accidentally.
- Maximum bytes billed per query. Set at the connection or session level to cap scan cost per query:
# In your BigQuery client from google.cloud import bigquery client = bigquery.Client() job_config = bigquery.QueryJobConfig( maximum_bytes_billed=10 * 1024**3 # 10GB limit per query ) query_job = client.query(sql, job_config=job_config) - Custom quotas in IAM. Set project-level or user-level daily byte quotas in the BigQuery IAM settings to cap total daily spend.
For high-query-volume environments, consider BigQuery capacity pricing (flat-rate slots) instead of on-demand. At sufficient query volume, flat-rate is significantly cheaper -- but requires estimating and committing to slot capacity in advance.
When to Choose BigQuery vs. Snowflake
The decision is rarely technical -- it is usually organizational:
- GCP shop: BigQuery integrates deeply with Pub/Sub, Dataflow, Vertex AI, and Cloud Composer. If your engineering infrastructure is GCP-native, BigQuery is the obvious choice.
- AWS or Azure shop: Snowflake runs on all three clouds and integrates well with native cloud services. For multi-cloud or AWS/Azure-primary organizations, Snowflake has better native integration.
- Serverless preference: If the team wants to eliminate warehouse management (sizing, auto-suspend), BigQuery's serverless model is appealing. There is genuinely less operational overhead.
- Extremely high query concurrency: BigQuery handles massive concurrency with no configuration. Snowflake requires tuning warehouse concurrency settings and potentially adding multi-cluster warehouses.
Both are excellent platforms. The best choice is usually the one your team already knows, unless there is a specific organizational or technical reason to switch.
Ryan Kirsch
Senior Data Engineer with experience building production pipelines at scale. Works with dbt, Snowflake, and Dagster, and writes about data engineering patterns from production experience. See his full portfolio.