BigQuery for Data Engineers: Architecture, Optimization, and When to Use It

How BigQuery Is Different from Snowflake

The two most significant architectural differences:

Serverless compute. BigQuery has no warehouses to manage. You run a query, BigQuery allocates compute automatically, and you pay per byte scanned (on-demand pricing) or per slot-hour (capacity pricing). There is no “warehouse size” decision, no auto-suspend configuration, and no cold-start penalty. Queries start immediately.

Bytes-scanned pricing. On-demand BigQuery charges per terabyte of data scanned, not per second of compute. This fundamentally changes what optimization means. In Snowflake, you optimize for query execution time (credits per second). In BigQuery on-demand, you optimize for bytes scanned. A query that reads 1TB in 10 seconds costs the same as a query that reads 1TB in 2 minutes -- what matters is the data volume, not the duration.

The practical implication: partition pruning and column projection (SELECT only what you need) are the highest-impact optimizations in BigQuery, even more so than in Snowflake.

Partitioning: The Most Important Optimization Decision

Partitioning in BigQuery divides a table into segments that can be pruned based on query filters. BigQuery supports three partition types:

-- Date/timestamp partition (most common for time-series data)
CREATE TABLE analytics.orders
PARTITION BY DATE(created_at)  -- Daily partitions
OPTIONS (
    require_partition_filter = true  -- Force callers to filter by date
)
AS SELECT * FROM staging.raw_orders;

-- Integer range partition (for user IDs, batch numbers)
CREATE TABLE analytics.user_events
PARTITION BY RANGE_BUCKET(user_id, GENERATE_ARRAY(0, 10000000, 100000))
AS SELECT * FROM staging.raw_events;

-- Ingestion-time partition (automatic, based on load time)
CREATE TABLE analytics.log_events
PARTITION BY _PARTITIONDATE
AS SELECT * FROM staging.raw_logs;

The require_partition_filter = true option is the BigQuery equivalent of a Snowflake row access policy for large tables. It forces every query to include a partition filter, preventing accidental full-table scans that scan and bill for terabytes of data.

Check partition pruning after query runs:

-- In the BigQuery console Query Details, check:
-- "Partitions processed" vs "Partitions total"
-- A well-pruned query should process <10% of partitions

-- Or query INFORMATION_SCHEMA for recent job stats
SELECT
    job_id,
    query,
    total_bytes_processed / 1e9 AS gb_processed,
    total_bytes_billed / 1e9 AS gb_billed,
    total_slot_ms / 1000 AS slot_seconds
FROM region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
ORDER BY total_bytes_billed DESC
LIMIT 20;

Clustering: Fine-Grained Data Organization

Clustering in BigQuery sorts data within each partition by specified columns. It does not reduce the number of partitions scanned, but it reduces the bytes scanned within each partition by allowing BigQuery to skip blocks of data that do not match filter conditions.

-- Combined partition + clustering (recommended pattern)
CREATE TABLE analytics.orders
PARTITION BY DATE(created_at)
CLUSTER BY customer_id, status  -- Up to 4 cluster columns
AS SELECT * FROM staging.raw_orders;

-- Query that benefits from clustering:
SELECT
    order_id,
    customer_id,
    amount_usd
FROM analytics.orders
WHERE DATE(created_at) = '2026-03-27'  -- Partition pruning
  AND customer_id = 'cust_12345'       -- Cluster pruning within partition
  AND status = 'delivered';            -- Cluster pruning

The clustering column order matters. BigQuery prunes most efficiently when you filter on the first cluster column, less efficiently on the second, and so on. Put the highest-cardinality, most-commonly-filtered column first in the cluster definition.

dbt with BigQuery: The Setup That Works

dbt works with BigQuery through the dbt-bigquery adapter. The configuration:

# profiles.yml
myproject:
  target: prod
  outputs:
    prod:
      type: bigquery
      method: service-account
      project: my-gcp-project
      dataset: analytics
      keyfile: /path/to/service-account.json
      location: US
      threads: 4
      timeout_seconds: 300
      
    dev:
      type: bigquery
      method: oauth  # Use personal credentials in dev
      project: my-gcp-project
      dataset: analytics_dev_{{ env_var('USER', 'unknown') }}
      location: US
      threads: 4

The partition and cluster configuration in dbt models:

-- models/marts/fct_orders.sql
{{ config(
    materialized='table',
    partition_by={
        "field": "created_at",
        "data_type": "timestamp",
        "granularity": "day"
    },
    cluster_by=["customer_id", "status"],
    require_partition_filter=True,
    labels={"team": "data-platform", "env": "prod"}
) }}

SELECT
    order_id,
    customer_id,
    amount_usd,
    status,
    created_at
FROM {{ ref('silver_orders') }}

-- Incremental config with BigQuery-specific merge
-- {{ config(
--     materialized='incremental',
--     unique_key='order_id',
--     incremental_strategy='merge',
--     partition_by={...},
--     cluster_by=['customer_id']
-- ) }}
-- {% if is_incremental() %}
-- WHERE created_at >= (SELECT MAX(created_at) FROM {{ this }})
-- {% endif %}

BigQuery-Specific SQL Patterns

BigQuery has several SQL features worth knowing:

Nested and repeated fields (STRUCTs and ARRAYs): BigQuery is columnar but supports nested structures natively. JSON-like data can be stored without flattening, and queried with dot notation:

-- Query nested fields
SELECT
    order_id,
    customer.customer_id,
    customer.email,
    -- Flatten an array
    item.product_id,
    item.quantity,
    item.unit_price
FROM orders,
UNNEST(line_items) AS item  -- Flatten the nested array
WHERE DATE(created_at) = CURRENT_DATE()
  AND customer.tier = 'premium';

MERGE with BigQuery: BigQuery's MERGE supports upsert patterns for incremental loads:

MERGE analytics.orders AS target
USING (
    SELECT * FROM staging.new_orders
    WHERE _PARTITIONDATE = CURRENT_DATE()
) AS source
ON target.order_id = source.order_id
    AND target.created_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
WHEN MATCHED THEN
    UPDATE SET
        status = source.status,
        updated_at = source.updated_at
WHEN NOT MATCHED THEN
    INSERT (order_id, customer_id, amount_usd, status, created_at)
    VALUES (source.order_id, source.customer_id, source.amount_usd,
            source.status, source.created_at);

Cost Control in BigQuery

BigQuery on-demand pricing can surprise teams that do not set guardrails. A single unconstrained query against a large unpartitioned table can scan and bill for hundreds of gigabytes.

The three controls that prevent surprise bills:

require_partition_filter on large tables.Forces every query to include a partition filter. No query can full-scan a multi-terabyte table accidentally.

Maximum bytes billed per query. Set at the connection or session level to cap scan cost per query:

# In your BigQuery client
from google.cloud import bigquery

client = bigquery.Client()
job_config = bigquery.QueryJobConfig(
    maximum_bytes_billed=10 * 1024**3  # 10GB limit per query
)
query_job = client.query(sql, job_config=job_config)

Custom quotas in IAM. Set project-level or user-level daily byte quotas in the BigQuery IAM settings to cap total daily spend.

For high-query-volume environments, consider BigQuery capacity pricing (flat-rate slots) instead of on-demand. At sufficient query volume, flat-rate is significantly cheaper -- but requires estimating and committing to slot capacity in advance.

When to Choose BigQuery vs. Snowflake

The decision is rarely technical -- it is usually organizational:

GCP shop: BigQuery integrates deeply with Pub/Sub, Dataflow, Vertex AI, and Cloud Composer. If your engineering infrastructure is GCP-native, BigQuery is the obvious choice.
AWS or Azure shop: Snowflake runs on all three clouds and integrates well with native cloud services. For multi-cloud or AWS/Azure-primary organizations, Snowflake has better native integration.
Serverless preference: If the team wants to eliminate warehouse management (sizing, auto-suspend), BigQuery's serverless model is appealing. There is genuinely less operational overhead.
Extremely high query concurrency: BigQuery handles massive concurrency with no configuration. Snowflake requires tuning warehouse concurrency settings and potentially adding multi-cluster warehouses.

Both are excellent platforms. The best choice is usually the one your team already knows, unless there is a specific organizational or technical reason to switch.