Blog
Data Engineering Interview Questions: What Senior Roles Actually Ask
Ryan Kirsch · December 14, 2025 · 10 min read
Most data engineering interview prep resources focus on SQL puzzles and system design frameworks. Senior-level interviews go further: they probe how you handle ambiguity, how you reason about failure, and whether your instincts about trade-offs match the judgment a senior engineer is expected to have. These are the questions that actually separate candidates at senior level, with the framing interviewers are really evaluating.
System Design Questions
System design is the highest-signal section of a senior DE interview. The interviewer is not looking for the right answer -- they are looking for the right questions.
Q: Design a pipeline that ingests clickstream events and produces a daily active users metric available by 7 AM.
What they are evaluating: Do you ask about scale before proposing architecture? Do you identify the late-arrival problem (events submitted after midnight for yesterday's session)? Do you think about failure modes (what if the pipeline fails at 6:55 AM)?
Strong answer structure: clarify scale (events/day), clarify consumer (BI dashboard vs. API), identify the late-data window (typically 2-4 hours for clickstream), propose architecture (Kafka for ingestion, Snowflake/BigQuery for aggregation, dbt or Spark for the DAU model), address failure (monitoring, alerting, on-call runbook, rerun strategy).
Q: Your company is switching from Postgres to a new operational database. How do you migrate the data pipelines that depend on it?
What they are evaluating: Do you think about dual-write periods, data validation, stakeholder communication, and rollback? Do you identify the hidden risks (schema differences, type mapping, timezone handling)?
Strong answer: define success criteria before touching anything, run both sources in parallel for a validation period, build a reconciliation check that compares counts and key aggregates between old and new, plan the cutover with a rollback window, communicate the timeline to downstream consumers before the migration.
Q: How would you build a self-serve analytics platform where business users can query data without engineering involvement?
What they are evaluating: Do you think about the semantic layer (consistent metric definitions)? Do you identify the governance problem (who decides what data is accessible and to whom)? Do you know the difference between self-serve that works and self-serve that creates a mess of conflicting numbers?
Pipeline Failure and Production Mindset Questions
Q: Your morning ETL ran successfully at 6 AM but revenue numbers in the dashboard are wrong. Walk me through how you investigate.
What they are evaluating: Do you have a systematic debugging process? Do you know the difference between a pipeline failure (something broke) and a data quality failure (it ran but produced wrong output)? Do you know where to look (dbt tests, row counts, source comparison)?
Strong answer: check if the pipeline ran at all and completed (scheduler logs), check row counts at each layer (bronze/silver/gold), compare source vs. destination totals for the relevant time period, check dbt test results for the relevant models, identify whether the error is in the transformation logic or the source data, communicate the status to stakeholders with an ETA before you have a fix.
Q: A source API you depend on has been returning duplicate records for the last 3 days. How do you handle this?
What they are evaluating: Idempotency instincts. Do you know how to identify the scope of impact? Do you have a deduplication strategy? Do you know how to communicate this to stakeholders without causing panic?
Strong answer: determine the blast radius (which tables were affected, which reports or dashboards consumed those tables), identify the dedup key (what uniquely identifies a record), run a deduplication query on the affected period, validate the fixed counts against source, communicate impact and resolution timeline to stakeholders, add a data quality check to detect this pattern in the future.
Q: Your incremental dbt model is running and suddenly you realize it has been silently dropping late-arriving records for 6 months. What do you do?
What they are evaluating: Do you know what a full refresh costs vs. a targeted remediation? Do you prioritize communicating the issue or fixing it first? Do you know how to prevent it from happening again?
SQL and Technical Questions That Go Beyond Basics
Q: Write a query to find users who were active in January but not in February.
The basic answer: LEFT JOIN or NOT IN subquery. The senior answer also asks: what is the grain of the events table? What counts as “active”? Is session-level or event-level? And then produces the query with a note about the approach:
-- Left join approach (handles NULLs correctly)
SELECT jan.user_id
FROM (
SELECT DISTINCT user_id
FROM events
WHERE event_date >= '2026-01-01' AND event_date < '2026-02-01'
) jan
LEFT JOIN (
SELECT DISTINCT user_id
FROM events
WHERE event_date >= '2026-02-01' AND event_date < '2026-03-01'
) feb ON jan.user_id = feb.user_id
WHERE feb.user_id IS NULL;
-- EXCEPT approach (cleaner, same result)
SELECT DISTINCT user_id FROM events
WHERE event_date >= '2026-01-01' AND event_date < '2026-02-01'
EXCEPT
SELECT DISTINCT user_id FROM events
WHERE event_date >= '2026-02-01' AND event_date < '2026-03-01';Q: Explain the difference between RANK(), DENSE_RANK(), and ROW_NUMBER(). When does the difference matter in production?
What they are evaluating: Have you actually hit a real bug caused by using the wrong window function? The answer should include a concrete scenario where the choice matters -- typically deduplication logic where ROW_NUMBER is correct and RANK could include duplicate rows.
Q: What is a slowly changing dimension? How would you implement Type 2 in dbt?
What they are evaluating: Do you understand the full lifecycle (valid_from, valid_to, is_current flag)? Can you describe the snapshot strategy in dbt? Do you know the performance implications of Type 2 on a large dimension table?
# dbt_project.yml
snapshots:
myproject:
dim_customers_snapshot:
+target_schema: snapshots
+strategy: timestamp
+unique_key: customer_id
+updated_at: updated_at
# snapshots/dim_customers_snapshot.sql
{% snapshot dim_customers_snapshot %}
{{ config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at',
invalidate_hard_deletes=True,
) }}
SELECT * FROM {{ source('raw', 'customers') }}
{% endsnapshot %}Behavioral Questions That Reveal Engineering Judgment
Q: Tell me about a time you had to push back on a stakeholder request. How did you handle it?
What they are evaluating: Can you disagree professionally? Do you explain the trade-offs rather than just saying no? Did you propose an alternative?
Strong structure: describe the request and why you had concerns (technical risk, timeline, scope creep), explain how you communicated the concern (specific trade-offs, not just “it's hard”), describe the outcome (negotiated a simpler version, found a middle path, agreed to phase it).
Q: Describe the most complex data pipeline you have built. What would you do differently now?
What they are evaluating: Self-awareness and growth. The “what would you do differently” is the real question. A candidate who cannot identify something they would change either lacks experience or lacks honesty.
Q: How do you stay current with the data engineering ecosystem? What was the last thing you learned and how did you apply it?
What they are evaluating: Genuine curiosity vs. resume padding. The specific “how did you apply it” follow-up separates people who read newsletters from people who actually experiment. Have a concrete answer ready -- a specific tool, pattern, or approach you learned and used.
Questions You Should Ask the Interviewer
The questions you ask reveal as much as the questions you answer. These signal senior-level thinking:
- “What does a good first 90 days look like on this team? What would make you say someone hit the ground running?”
- “What is the biggest data quality or reliability problem the team is dealing with right now?”
- “How does the data team collaborate with engineering on schema changes and new data sources?”
- “What does on-call look like for this team? What kinds of issues typically come up?”
- “How are technical decisions made? Does the data team own its own tooling choices, or does it flow through a platform org?”
These questions demonstrate that you are evaluating the role as seriously as they are evaluating you -- which is exactly what a senior engineer does.
Ryan Kirsch
Senior Data Engineer with experience building production pipelines at scale. Works with dbt, Snowflake, and Dagster, and writes about data engineering patterns from production experience. See his full portfolio.