Data Engineering System Design: How to Approach Architecture Interviews

Data engineering system design interviews are simultaneously the most important and most misunderstood part of the senior DE hiring process. Candidates who have built excellent pipelines often struggle here because they are used to having the requirements handed to them. In a system design interview, scoping the requirements is part of the test.

This post covers a repeatable framework for data engineering system design interviews, with worked examples for common prompts and a guide to communicating tradeoffs in a way that signals senior thinking.

The Framework: Five Steps

Before drawing any architecture, work through these five steps. They apply to almost every data engineering system design prompt.

Step 1: Clarify requirements. Ask questions before proposing anything. What is the data volume? What is the latency requirement (real-time, near-real-time, hourly, daily)? What are the consumers (dashboards, ML models, APIs, analysts writing SQL)? What are the SLA expectations for freshness and availability? What does failure look like and what is the recovery requirement?

Step 2: Establish the data flow. Sketch the high-level flow: source, ingestion, storage, transformation, serving. Do not jump to specific tools yet. Understand the shape of the data movement first.

Step 3: Choose the architecture pattern. Batch, streaming, or lambda (both). The choice follows from the latency requirement, not from tool preference.

Step 4: Make technology decisions with justification. For each layer, propose a tool and say why. Not just Kafka, but Kafka because the volume is high, the consumers need independent offsets, and we have existing expertise. Interviewers want to see that you understand what problems tools solve.

Step 5: Address failure modes and scale. What happens when the ingestion layer goes down? How do you handle schema changes in the source? What does backfill look like? How does the design handle 10x volume?

Common Prompt: Design a Real-Time Analytics System

The prompt: design a system to track user events from a web application and make them queryable for dashboards within 60 seconds of occurring.

Clarifying questions: How many events per second at peak? (Answer: 50,000.) How many unique users? (Answer: 10M.) What dashboard queries need to be supported? (Answer: counts and aggregations by event type and user segment, last 30 days.) What is the source? (Answer: web servers sending JSON events over HTTP.)

Architecture:

Web Servers
    ↓ (HTTP POST, batched)
Kafka (event ingestion)
    ↓ (Kafka Consumer)
Flink / Spark Streaming (enrichment + aggregation)
    ↓
Two outputs:
  1. ClickHouse / Druid (real-time serving, last 30 days)
  2. S3 Parquet (cold storage, Athena queryable)
    ↓
BI Tool (Grafana / Superset) reads from ClickHouse

Justification by layer:

Kafka handles the ingestion spike at 50k events/second without dropping events. The web servers batch events into 500ms windows and POST to a Kafka producer, which handles the buffering. Consumer groups let multiple downstream consumers read independently.

Flink processes the stream with a 30-second window to aggregate event counts by type and segment. It also enriches events with user segment information from a Redis lookup (avoiding a database join in the hot path). Output goes to ClickHouse for the real-time serving layer and S3 for cold storage.

ClickHouse handles the dashboard queries because it is optimized for analytical aggregations on event data with low-latency reads. For 30 days of data at 50k events/second, that is roughly 130 billion events. ClickHouse handles this at query times under a second with proper table partitioning.

S3 Parquet provides the historical archive at low cost. Athena or Trino can query it for ad-hoc analysis beyond the 30-day window that ClickHouse serves.

Failure modes: If Kafka goes down, web servers buffer locally and replay when Kafka recovers. If Flink goes down, Kafka retains events for 7 days, allowing a full replay. If ClickHouse has a node failure, replication handles reads while the node recovers.

Common Prompt: Design a Data Warehouse Ingestion Pipeline

The prompt: design a system to ingest data from 20 operational databases into a central data warehouse and make it available for analysts daily.

Clarifying questions: What database types? (Mix of Postgres and MySQL.) What is the data volume per source? (5M to 50M rows, 100GB total.) Full load or incremental? (Incremental preferred, full acceptable for small tables.) Target warehouse? (Snowflake.) What is the freshness requirement? (Daily, by 8 AM.)

Architecture:

20 Operational DBs (Postgres, MySQL)
    ↓ (Fivetran or Airbyte)
Snowflake Raw Layer (source-aligned schemas)
    ↓ (dbt)
Snowflake Staging Layer (cleaned, typed)
    ↓ (dbt)
Snowflake Marts Layer (business-ready)
    ↓
BI Tool (Tableau, Looker, etc.)

Fivetran for the ingestion layer if budget allows: managed connectors for Postgres and MySQL, handles schema drift, automatic incremental loads via CDC or watermark. Airbyte for a self-hosted, cost-controlled alternative with the same connector coverage.

Dagster or Airflow orchestrates the dbt runs after ingestion completes. The schedule targets completion by 7 AM to leave margin before the 8 AM SLA.

dbt handles all transformations with tested, version-controlled SQL. The three-layer structure (raw, staging, marts) means analysts query only the marts layer, which has well-defined grain, validated relationships, and documented columns.

Schema change handling: Fivetran handles additive changes (new columns) automatically. Breaking changes (column rename, type change) trigger an alert in Fivetran's UI. dbt schema tests catch downstream breaks before they reach the marts layer. The runbook for breaking changes: pause the affected connector, coordinate with the source team on timing, update the dbt model, validate, and re-enable.

What Interviewers Are Actually Evaluating

The system design interview is not primarily a test of whether you know the right tools. It is a test of how you think about problems. Interviewers at senior levels are evaluating four things:

Requirements discipline. Did you ask before designing? A candidate who immediately starts drawing architecture without asking questions is a candidate who would build the wrong thing in production. Ask, even if the questions feel obvious.

Tradeoff awareness. Every architectural decision has a cost. Kafka gives you durability and consumer independence but adds operational complexity. ClickHouse gives you fast analytical queries but requires dedicated infrastructure. Saying what you chose and why, including what you gave up, demonstrates senior thinking.

Failure mode reasoning. What breaks? How bad is it when it breaks? How do you recover? A design that has no answer for what happens when the ingestion layer goes down is not a production design.

Communication clarity. Can you explain the system to someone who is not you? Use diagrams if you have a whiteboard. Summarize each layer before moving on. Check whether the interviewer is following. The ability to communicate architecture clearly is itself a senior engineering skill.

Phrases That Signal Senior Thinking

A few phrasings that consistently land well in system design interviews:

"Before I propose anything, let me make sure I understand the requirements." Opens the requirements discussion without asking permission.

"I would choose X here because of Y, and the tradeoff is Z." Every tool choice should follow this pattern.

"The failure mode I am most concerned about here is..." Proactively surfacing risks before being asked shows production experience.

"I would validate this design by..." Showing how you would prove the design works before committing to it.

"If the volume grew 10x, the bottleneck would be X and we would address it by Y." Scale reasoning does not need to be exhaustive; it needs to demonstrate that you thought about it.

System design interviews are more comfortable once you internalize that the goal is not to produce the perfect architecture. The goal is to demonstrate that you are the kind of engineer who asks before building, justifies every decision, and thinks about what goes wrong. The architecture itself is secondary to the thinking process you show to get there.