GCPBigQueryDataflowPub/SubCloud ComposerApache Beam

GCP for Data Engineers: BigQuery, Dataflow, Pub/Sub, and the Modern Google Cloud Data Stack

November 7, 2025 · 12 min read

Introduction

I build systems that scale. At scale, design decisions about data storage, compute, and orchestration show up in latency budgets, operational overhead, and the kinds of questions your analytics team can answer. Google Cloud Platform has a distinct point of view: analytics should be serverless by default, streaming should be first class, and the warehouse should behave like a fully managed data engine, not a database you babysit.

This post is a senior engineer's guide to the GCP data stack. It focuses on the pieces that matter in production, how they fit together, and where the sharp edges are. BigQuery, Dataflow, Pub/Sub, and Cloud Composer are the foundation. BigQuery ML adds a practical ML layer without pushing teams into a separate platform.

BigQuery Deep Dive (architecture, optimization, partitioning)

BigQuery is not just a warehouse. It is a serverless analytics engine built on columnar storage, distributed execution, and aggressive query optimization. The architecture is designed to separate storage from compute, which means you can scale reads and writes independently. It also means you do not manage clusters, storage nodes, or vacuum jobs. You manage datasets and SQL.

The cost model is the first thing senior teams need to internalize. You pay per query on demand or for reserved slots. That changes behavior. Unbounded scans are a budget killer. Enforce partition filters.

Partitioning is the most important design decision for large tables. BigQuery supports ingestion-time partitioning and column-based partitioning. I almost always use column-based partitioning on a trusted event time or load date, then add clustering on the most selective dimensions. This creates predictable scan reduction and avoids the anti-pattern of querying entire history for every dashboard refresh.

Here is a practical table definition and query pattern:

CREATE TABLE analytics.fact_orders (
  order_id STRING NOT NULL,
  customer_id STRING NOT NULL,
  order_ts TIMESTAMP NOT NULL,
  order_date DATE NOT NULL,
  channel STRING,
  revenue NUMERIC
)
PARTITION BY order_date
CLUSTER BY customer_id, channel;
SELECT
  order_date,
  channel,
  SUM(revenue) AS revenue,
  SUM(SUM(revenue)) OVER (
    PARTITION BY channel
    ORDER BY order_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS revenue_7d
FROM analytics.fact_orders
WHERE order_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
  AND CURRENT_DATE()
GROUP BY order_date, channel
ORDER BY order_date;

The partition filter is explicit, and the window function gives you a rolling seven-day metric without a self-join. This is a pattern I use constantly: partitioned scans, compact aggregations, and windowed metrics for trend analysis.

Optimization in BigQuery is about layout and statistics. Clustering is a hint to the storage engine, not a hard index. It works best when your queries repeatedly filter on the same columns. If you need fast point lookups, BigQuery is not a serving database.

Dataflow and Apache Beam

Dataflow is Google's managed runner for Apache Beam. If you need reliable streaming or batch processing with strong semantics, this is the production-grade option in GCP. The key advantage is Beam's unified model. You write a pipeline once, then run it as batch or streaming without rewriting the logic. That means you can test with bounded data locally and deploy as streaming in production with the same code.

For teams that do event processing at scale, Dataflow provides exactly once semantics, event time processing, and late data handling through watermarks and triggers. This is the hard part of streaming, and it is where Dataflow's managed service shines. You get autoscaling workers, integrated monitoring, and a control plane that handles job upgrades without full downtime.

Here is a minimal Beam pipeline that reads from Pub/Sub, applies a sliding window, and writes to BigQuery. In production you would add schema validation, dead letter queues, and idempotent writes, but the shape is the same.

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.transforms.window import SlidingWindows

options = PipelineOptions(streaming=True, save_main_session=True)

with beam.Pipeline(options=options) as p:
    (
        p
        | "ReadEvents" >> beam.io.ReadFromPubSub(topic="projects/myproj/topics/events")
        | "ParseJson" >> beam.Map(parse_event)
        | "Window5m" >> beam.WindowInto(SlidingWindows(size=300, period=60))
        | "CountByType" >> beam.Map(lambda e: (e["event_type"], 1))
        | "SumCounts" >> beam.CombinePerKey(sum)
        | "ToBQRows" >> beam.Map(to_bq_row)
        | "WriteBQ" >> beam.io.WriteToBigQuery(
            "analytics.event_metrics",
            schema="event_type:STRING, event_count:INTEGER, window_end:TIMESTAMP",
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
        )
    )

The most important decision in Beam is your windowing strategy. If your product has user sessions, you probably want session windows. If you need stable dashboards, you want fixed windows. Choose your windowing and triggering based on the business metric, not the convenience of code.

Pub/Sub for Event Streaming

Pub/Sub is the backbone for event ingestion on GCP. It is simple to operate and designed for massive throughput. The model is classic publish and subscribe with durable message retention. It does not expose partitions directly, which is a tradeoff compared to Kafka, but it scales without operational burden and integrates tightly with Dataflow and Cloud Functions.

The design pattern I use most often is a single topic per event family with multiple subscriptions. One subscription feeds a Dataflow streaming job, another feeds a BigQuery streaming insert pipeline, and a third feeds a dead letter inspection workflow. This gives you multiple consumers without duplicating ingestion logic.

Pub/Sub is not a long term event store. It is a transport layer. If you need a replayable event log for months, the better practice is to land raw events in BigQuery or Cloud Storage and treat that as the source of truth.

Cloud Composer (managed Airflow comparison)

Cloud Composer is managed Airflow. It provides the familiar DAG scheduling model but offloads the infrastructure work. You get managed workers, integrated logging, and a tight connection to GCP IAM and service accounts. If your team already uses Airflow and you want managed operations without a full rewrite, Composer is the straight-line path.

Composer pairs well with BigQuery and Dataflow. I often use it to orchestrate daily batch loads into BigQuery, then trigger Dataflow for backfill jobs when late data arrives. For teams that need lineage, data quality checks, or asset-aware orchestration, a tool like Dagster may be a better fit. But if you need a known quantity and a large ecosystem of operators, Composer is reliable.

BigQuery ML

BigQuery ML is one of the most underrated features in the GCP data stack. It lets you train models with SQL and keeps the data in place. That eliminates costly data movement and enables analytics teams to build baseline models without spinning up separate ML infrastructure. It is not a replacement for Vertex AI in complex workflows, but it is perfect for forecasting, churn models, and anomaly detection where the data already lives in BigQuery.

Here is an example of a simple regression model:

CREATE OR REPLACE MODEL analytics.revenue_forecast
OPTIONS (
  model_type = 'linear_reg',
  input_label_cols = ['revenue']
) AS
SELECT
  EXTRACT(DAYOFWEEK FROM order_date) AS day_of_week,
  EXTRACT(MONTH FROM order_date) AS month,
  channel,
  revenue
FROM analytics.fact_orders
WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY);
SELECT
  *
FROM ML.EVALUATE(
  MODEL analytics.revenue_forecast,
  (
    SELECT
      EXTRACT(DAYOFWEEK FROM order_date) AS day_of_week,
      EXTRACT(MONTH FROM order_date) AS month,
      channel,
      revenue
    FROM analytics.fact_orders
    WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
  )
);

GCP vs AWS Comparison

Every cloud platform can run modern data pipelines. The choice comes down to defaults and operational posture. AWS gives you more primitives and more control. GCP gives you an opinionated, analytics-first stack that feels cohesive. BigQuery is a stronger native warehouse than Redshift for most use cases, especially when you do not want to manage cluster sizing. Dataflow provides a more mature managed Beam environment than AWS equivalents, and Pub/Sub removes a lot of the operational complexity of managed Kafka.

On the other hand, AWS has broader ecosystem integration, a deeper set of managed services, and a larger talent pool. If your organization already runs most workloads on AWS, the path of least resistance is usually to stay there. If you need an analytical platform that feels like a single integrated service, GCP is compelling.

I often frame this as a question of operational focus. If you want to spend time tuning clusters and maintaining more infrastructure, AWS gives you room to do that. If you want to spend time on data modeling and pipeline reliability, GCP removes a layer of operational distraction.

When to Choose GCP

Choose GCP when analytics is a core product capability and you want to move fast without dedicating a large platform team. BigQuery shortens the cycle between question and answer. Dataflow provides production-grade streaming without you running your own clusters. Pub/Sub is a battle-tested ingestion backbone that fits both event and log pipelines. If these are the center of your data platform, GCP is a strong default.

GCP is less compelling if your workloads are deeply tied to services that do not have strong analogs on Google Cloud or if your enterprise governance and security tooling is built around AWS. In those cases, the integration cost may outweigh the platform benefits. The right choice is the one that minimizes friction for your team and maximizes time spent on data value, not on cloud mechanics.

Conclusion

The modern GCP data stack is built around a clear idea: data teams should spend their energy on insight and reliability, not on infrastructure. BigQuery provides a serverless analytics engine that scales without operational pain. Dataflow and Pub/Sub enable streaming systems that handle messy real world data with strong semantics. Cloud Composer gives you a managed orchestration path when Airflow remains the right abstraction. BigQuery ML adds practical modeling without leaving the warehouse.

If you are evaluating your next data platform or looking to modernize your existing one, GCP is worth serious attention. I have built platforms on multiple clouds, and the combination of BigQuery and Dataflow remains one of the most productive stacks for data engineering teams.

Questions or pushback on any of this? Find me on LinkedIn.

Ryan Kirsch is a senior data engineer with 8+ years building data infrastructure at media, SaaS, and fintech companies. He specializes in Kafka, dbt, Snowflake, and Spark, and writes about data engineering patterns from production experience. See his full portfolio.