Data engineering has become one of the most in-demand specializations in tech. Every AI company, every analytics-driven startup, and every enterprise undergoing digital transformation needs people who can build reliable data pipelines, model complex domains, and keep data flowing at scale. The interview process reflects this breadth: you will be tested on everything from SQL fundamentals to distributed system design to how you handle a 3 AM data quality incident.

This guide covers 40 real data engineer interview questions across six categories, with detailed answers that go beyond surface-level definitions. Whether you are preparing for a role at a company like Databricks, Snowflake, or a fast-growing AI startup, these questions reflect what hiring managers actually ask in 2026.

What Data Engineering Interviews Look Like in 2026

Before diving into the questions, it helps to understand the typical interview structure. Most data engineering interviews follow a similar pattern across companies, though the emphasis varies depending on whether the company is a data platform vendor, an AI lab, or a product company with a data team.

Typical Rounds4–5 rounds
Timeline2–4 weeks
SQL AssessmentAlmost universal — expect a live SQL coding round
System DesignRequired for senior roles, common for mid-level
Take-HomeSome companies use a data pipeline coding challenge (2–4 hours)
Behavioral1 round focused on stakeholder communication and incident handling
Salary Range$150k – $350k+ TC
40
Questions Covered
6
Topic Categories
$350k+
Senior DE Comp (Top Co.)

SQL & Data Modeling (Questions 1–10)

SQL is the lingua franca of data engineering. Every interview will test your SQL skills, from basic joins to complex window functions and performance optimization. Data modeling questions test whether you can design schemas that serve both analytical queries and operational workloads.

1. Explain the three normal forms and when you would denormalize.

First Normal Form (1NF) requires atomic values in each column — no repeating groups or arrays. Second Normal Form (2NF) adds the requirement that every non-key column depends on the entire primary key, not just part of it. Third Normal Form (3NF) requires that non-key columns depend only on the primary key, not on other non-key columns (no transitive dependencies).

You would denormalize when read performance matters more than write consistency — which is common in analytical workloads. Data warehouses intentionally denormalize into star or snowflake schemas because joins are expensive at scale, and analytical queries benefit from pre-computed wide tables. The trade-off is increased storage and more complex update logic.

2. What is a star schema? How does it differ from a snowflake schema?

A star schema organizes data around a central fact table (transactions, events, measurements) surrounded by dimension tables (customers, products, dates). Dimension tables are denormalized — all attributes are in one table. A snowflake schema normalizes the dimensions further into sub-dimensions. For example, a product dimension might be split into product, category, and subcategory tables.

Star schemas are preferred in most modern warehouses because they are simpler to query and faster for aggregations. Snowflake schemas save storage but add join complexity. With modern columnar storage and compression, the storage savings rarely justify the added query complexity.

3. What are slowly changing dimensions (SCDs)? Describe Type 1, Type 2, and Type 3.

Slowly changing dimensions handle the fact that dimension data changes over time (a customer moves, a product is reclassified). Type 1 overwrites the old value — you lose history but keep things simple. Type 2 creates a new row with version tracking (effective_from, effective_to, is_current flag) — you preserve full history but the table grows and joins become more complex. Type 3 adds a column for the previous value (e.g., current_address and previous_address) — you track one level of history without table growth, but it is limited to a single change.

Type 2 is by far the most common in practice. Most companies want full history for compliance, analytics, and debugging data issues.

4. Write a SQL query using window functions to find the top 3 orders by revenue per customer.

This is a classic interview question that tests your understanding of window functions and ranking:

Sample Answer WITH ranked AS (SELECT customer_id, order_id, revenue, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY revenue DESC) AS rn FROM orders) SELECT customer_id, order_id, revenue FROM ranked WHERE rn <= 3;

Interviewers will often follow up by asking the difference between ROW_NUMBER(), RANK(), and DENSE_RANK(). ROW_NUMBER assigns unique sequential numbers even for ties. RANK skips numbers after ties (1, 1, 3). DENSE_RANK does not skip (1, 1, 2). Choose based on whether you want exactly 3 rows per customer or all rows tied for the top 3 values.

5. How do you optimize a slow SQL query?

Start with the execution plan (EXPLAIN/EXPLAIN ANALYZE). Look for full table scans, nested loops on large tables, and missing indexes. Common fixes: add indexes on join and filter columns, rewrite subqueries as CTEs or joins, reduce the dataset early with WHERE clauses, avoid SELECT *, and partition large tables. In warehouse contexts, clustering keys and materialized views are often more effective than traditional indexes.

6. What is the difference between a LEFT JOIN and a LEFT ANTI JOIN?

A LEFT JOIN returns all rows from the left table plus matching rows from the right table (NULLs for non-matches). A LEFT ANTI JOIN returns only rows from the left table that have no match in the right table. Anti joins are useful for finding orphaned records, data quality checks, and incremental loads (what is new since last run). In SQL, you typically express an anti join as LEFT JOIN ... WHERE right.key IS NULL.

7. Explain partitioning vs. clustering in a data warehouse.

Partitioning physically divides a table into segments based on a column (usually date). Queries that filter on the partition column only scan relevant partitions, dramatically reducing I/O. Clustering sorts data within partitions by specified columns, improving query performance for range filters and joins on those columns. In BigQuery, partitioning is your first optimization lever; clustering adds secondary sort order within partitions. In Snowflake, micro-partitioning is automatic, and clustering keys define the sort order.

8. How would you model a many-to-many relationship in a data warehouse?

Use a bridge table (also called a junction or associative table). For example, if students enroll in multiple courses and courses have multiple students, create an enrollment fact table with student_id, course_id, and any attributes of the relationship (enrollment_date, grade). In dimensional modeling, bridge tables sit between fact and dimension tables and are sometimes called factless fact tables when they contain only keys.

9. What is a surrogate key and why use one?

A surrogate key is a system-generated identifier (usually an auto-incrementing integer or UUID) that replaces the natural business key as the primary key. Surrogate keys are used because natural keys can change (a product SKU gets reassigned), can be composite (making joins expensive), or may not exist consistently across source systems. They also enable SCD Type 2 tracking — you need a unique key per version of a dimension record.

10. How do you handle NULL values in aggregate functions?

Most aggregate functions (SUM, AVG, COUNT(column), MIN, MAX) ignore NULLs. COUNT(*) counts all rows including NULLs, while COUNT(column) counts only non-NULL values. This distinction matters for averages — AVG ignores NULLs, which can produce different results than dividing SUM by COUNT(*). Use COALESCE to replace NULLs with defaults when the business logic requires it, and always document your NULL-handling decisions in data contracts or dbt model descriptions.

ETL/ELT Pipelines (Questions 11–18)

Pipeline design is the core competency of data engineering. These questions test whether you can build systems that reliably move and transform data at scale — and recover gracefully when things break.

11. What is the difference between ETL and ELT? When would you use each?

ETL (Extract, Transform, Load) transforms data before loading it into the target system, typically using an external processing engine like Spark or Informatica. ELT (Extract, Load, Transform) loads raw data into the target warehouse first, then transforms it in place using the warehouse's own compute.

ELT has become the dominant pattern in 2026 because modern cloud warehouses (Snowflake, BigQuery, Redshift) have massive, elastic compute. It is cheaper, faster, and more flexible: raw data is always preserved for reprocessing, and transformations are written in SQL (via dbt), making them accessible to analytics engineers and analysts. Use ETL when data must be cleaned before it enters the warehouse (PII scrubbing, compliance requirements) or when the volume exceeds what the warehouse can handle cost-effectively.

Key Insight The shift from ETL to ELT is not just a technical choice — it reflects a philosophical change. ELT says "load everything, decide later." This enables more agile analytics but requires strong data governance and cataloging.

12. How does Apache Airflow work? What are DAGs, tasks, and operators?

Airflow is a workflow orchestration platform that defines, schedules, and monitors data pipelines as directed acyclic graphs (DAGs). A DAG is a collection of tasks with defined dependencies — no cycles allowed. Each task uses an operator (PythonOperator, BashOperator, BigQueryOperator, etc.) to execute a unit of work. The scheduler reads DAG definitions (Python files), determines which tasks are ready to run based on dependencies and schedule, and dispatches them to workers. Key concepts: XComs for passing data between tasks, sensors for waiting on external conditions, and task groups for organizing complex DAGs.

13. What is dbt and how does it fit into the modern data stack?

dbt (data build tool) is a transformation framework that runs SQL-based transformations inside the data warehouse. It sits in the "T" of ELT — after data is loaded by tools like Fivetran, Airbyte, or custom extractors. dbt manages dependencies between models, runs tests (schema tests, custom data tests), generates documentation, and supports version-controlled transformations. It turns your warehouse into a development environment with software engineering practices: version control, CI/CD, testing, and modularity. For data engineering interviews at companies using the modern data stack, dbt fluency is increasingly expected.

14. Explain batch processing vs. stream processing. When do you use each?

Batch processing handles data in discrete chunks on a schedule (hourly, daily). It is simpler to build, easier to debug, and cheaper to run. Stream processing handles data continuously as it arrives, with latencies from milliseconds to seconds. Use batch when freshness requirements are hours or days (daily dashboards, monthly reports, training ML models). Use streaming when the business needs real-time or near-real-time data (fraud detection, live dashboards, event-driven architectures). Many companies use a hybrid: streaming for critical paths and batch for everything else. Tools like Apache Spark support both paradigms.

15. How do you handle schema evolution in a data pipeline?

Schema evolution is inevitable — source systems add columns, change types, and rename fields. Handle it by: (1) using schema-flexible formats like Parquet, Delta Lake, or Apache Iceberg that support column additions and type widening; (2) implementing a schema registry or contract between producers and consumers; (3) designing pipelines to be additive-only (new columns are safe, removing columns requires migration); (4) using dbt's source freshness tests and schema tests to detect unexpected changes; (5) maintaining a staging layer where raw data lands before transformation, giving you a buffer to handle breaking changes.

16. What is idempotency and why is it critical for data pipelines?

An idempotent operation produces the same result whether you run it once or multiple times. In data pipelines, idempotency means re-running a pipeline for the same time window produces the same output without duplicating data. This is critical because pipelines fail and must be retried. Achieve idempotency by: using MERGE/UPSERT instead of INSERT, processing data by time partitions (overwrite the entire partition on re-run), using deduplication keys, and designing each pipeline step to be independently re-runnable. Without idempotency, every retry risks creating duplicate records that cascade through downstream systems.

17. How would you design a pipeline to incrementally load data from a source system?

Track the high-water mark — the maximum value of a monotonically increasing column (timestamp, auto-increment ID) from the last successful run. On each run, extract only records where the tracking column exceeds the high-water mark, transform them, and load them into the target. Store the high-water mark in a metadata table after successful completion. Handle edge cases: late-arriving data (use a lookback window), deletes in the source (implement soft deletes or use CDC), and clock skew between systems. For more robust incremental loads, use Change Data Capture (CDC) via tools like Debezium, which captures every insert, update, and delete from the source database's transaction log.

18. What is Apache Spark and when would you use it over SQL-based transformations?

Spark is a distributed computing engine for large-scale data processing. It processes data in memory across a cluster, supporting batch, streaming (Structured Streaming), ML (MLlib), and graph processing. Use Spark over SQL when: (1) data volumes exceed what a single warehouse can handle efficiently; (2) transformations involve complex logic that is difficult or impossible in SQL (custom ML feature engineering, graph algorithms, iterative processing); (3) you need to process unstructured data (text, images, logs); (4) you are building ML training pipelines. For most analytics transformations on structured data, SQL in the warehouse (via dbt) is simpler and cheaper.

Data Warehousing (Questions 19–24)

Understanding how modern data warehouses work under the hood — and the trade-offs between platforms — is essential for senior data engineering roles.

19. Compare Snowflake, BigQuery, and Redshift. What are their architectural differences?

Snowflake separates compute and storage completely — you can spin up multiple virtual warehouses against the same data, scale them independently, and shut them down when not in use. BigQuery is serverless: no clusters to manage, you pay per query (bytes scanned) or flat-rate. It uses a columnar storage format called Capacitor and a distributed execution engine called Dremel. Redshift is cluster-based (provisioned nodes) with more traditional MPP architecture, though Redshift Serverless now offers a pay-per-query option. Snowflake excels at multi-tenant workload isolation. BigQuery excels at massive ad-hoc queries with zero ops. Redshift integrates tightly with the AWS ecosystem.

20. What is a materialized view and when should you use one?

A materialized view is a precomputed query result stored as a physical table. Unlike a regular view (which re-executes the query every time), a materialized view serves cached results. Use them for expensive aggregations that are queried frequently but do not need to be perfectly real-time. The trade-off is staleness: materialized views must be refreshed (manually, on schedule, or incrementally). Modern warehouses support incremental refresh for certain query patterns, making materialized views practical for dashboards and reporting layers.

21. Explain the medallion architecture (bronze, silver, gold).

The medallion architecture organizes a data lake into three quality tiers. Bronze is the raw, unprocessed data exactly as ingested from sources — it is your source of truth and append-only. Silver applies data cleaning, deduplication, standardization, and joins across sources to create conformed, validated datasets. Gold contains business-level aggregations, metrics, and feature tables optimized for specific consumption patterns (dashboards, ML models, APIs). This layered approach provides clear data lineage, enables reprocessing from raw data, and separates concerns between ingestion, quality, and consumption.

22. How do Delta Lake and Apache Iceberg improve data lakes?

Traditional data lakes (files on S3/GCS/ADLS) lack transactions, schema enforcement, and time travel. Delta Lake and Iceberg add ACID transactions (safe concurrent reads and writes), schema evolution, time travel (query historical versions), and efficient upserts/deletes. They achieve this through metadata layers that track file-level changes. Delta Lake is tightly integrated with Databricks and Spark. Iceberg is more vendor-neutral and has gained broad adoption across engines (Spark, Trino, Flink, Snowflake). Both solve the "data swamp" problem where unmanaged files become unreliable.

23. What is data partitioning and how do you choose a partition key?

Partitioning divides a large table into smaller physical segments, enabling query engines to skip irrelevant data (partition pruning). Choose the partition key based on how data is queried: date is the most common because most analytical queries filter by time range. Good partition keys have moderate cardinality — too few partitions (only 3 regions) provides minimal pruning benefit, while too many partitions (partitioning by user_id with millions of users) creates tiny files that hurt performance. Aim for partition sizes of 100MB to 1GB for optimal file I/O.

24. How would you migrate a data warehouse from one platform to another?

Warehouse migrations are complex multi-month projects. The approach: (1) inventory all objects (tables, views, stored procedures, permissions, scheduled queries); (2) convert SQL dialect differences (every warehouse has syntax quirks); (3) migrate in layers — start with bronze/raw data, then silver, then gold; (4) run dual systems in parallel, comparing outputs for a validation period; (5) migrate downstream consumers (dashboards, reports, APIs) incrementally; (6) handle the long tail of edge cases (timezone handling, NULL behavior, decimal precision differences). The biggest risk is not the migration itself but the downstream systems that depend on undocumented warehouse behavior.

System Design for Data (Questions 25–30)

System design questions test your ability to architect data infrastructure from scratch. Interviewers care less about the "right" answer and more about how you think through trade-offs, estimate scale, and handle failure modes.

25. Design an end-to-end data pipeline for an e-commerce company.

Start by clarifying requirements: what data sources (orders, clickstream, inventory, user profiles), what consumers (analytics dashboards, ML models, product recommendations), and what latency requirements (real-time for fraud, daily for reports). Architecture: ingest transactional data via CDC (Debezium) into a message queue (Kafka), clickstream via an event collector, and third-party data via scheduled API pulls. Land everything in a data lake (S3/GCS) in the bronze layer. Use Spark or dbt to clean and model data into silver and gold layers. Serve dashboards from the warehouse, ML features from a feature store, and real-time signals from a streaming layer. Key considerations: data quality checks at each layer, alerting on pipeline failures, and cost management across compute and storage.

26. How would you design a real-time analytics system?

Real-time analytics requires a streaming architecture. Events flow from producers (application servers, IoT devices) through a message broker (Kafka, Kinesis) to a stream processing engine (Flink, Spark Structured Streaming) that computes aggregations in time windows (tumbling, sliding, session). Results are written to a serving layer optimized for low-latency reads (ClickHouse, Druid, or Redis for small result sets). Key design decisions: exactly-once vs. at-least-once processing semantics, watermark strategies for late-arriving data, and how to handle out-of-order events. The biggest trade-off is between latency and accuracy — tighter time windows give fresher data but miss late events.

27. Design a data quality monitoring framework.

A data quality framework needs five components: (1) schema validation (expected columns, types, nullability); (2) volume checks (row count anomaly detection — did today's load have significantly more or fewer rows than expected?); (3) freshness monitoring (when was the table last updated?); (4) value distribution checks (statistical profiling — did the average order value suddenly change by 50%?); (5) cross-system reconciliation (do counts match between source and target?). Implement checks at each pipeline stage (ingestion, transformation, serving). Alert on failures with severity levels: critical (pipeline halt), warning (investigate), info (log for trending). Tools like Great Expectations, dbt tests, and Monte Carlo automate many of these checks.

28. How would you design a data lake architecture for a company with 100+ data sources?

At this scale, governance and discoverability matter as much as the technical architecture. Design: a centralized data lake on object storage (S3/GCS) with medallion architecture. Each source gets a dedicated bronze namespace. Use a data catalog (DataHub, Amundsen, or cloud-native options) for discovery. Implement a schema registry for all sources. Standardize ingestion patterns: CDC for databases, API connectors for SaaS tools, event streaming for real-time sources. Enforce access controls at the namespace level. Build a self-service silver layer where domain teams own their transformations. The gold layer serves cross-domain use cases. Critical: invest heavily in metadata management and data lineage from day one — at 100+ sources, it is impossible to debug issues without knowing where data came from and how it was transformed.

29. Design a Change Data Capture (CDC) pipeline.

CDC captures row-level changes (inserts, updates, deletes) from a source database's transaction log. Architecture: deploy Debezium (or a similar connector) to read the database's write-ahead log (WAL for PostgreSQL, binlog for MySQL). Publish change events to Kafka topics, one per source table. A stream processor (Flink or Spark Streaming) consumes events and applies them to the target: inserts become INSERTs, updates become UPSERTs (MERGE), and deletes become soft deletes (set is_deleted flag + deleted_at timestamp). Key considerations: initial snapshot (backfill existing data before CDC starts), schema evolution (new columns in source), and exactly-once delivery (use Kafka consumer offsets and transactional writes to the target). CDC enables near-real-time data replication without impacting source system performance.

30. How would you handle a pipeline that processes 10TB of data daily?

At 10TB/day, you need to think carefully about cost, parallelism, and failure recovery. Use columnar formats (Parquet, ORC) for 5-10x compression over CSV/JSON. Partition data by date and a secondary key (customer_id, region) for parallel processing. Use Spark with auto-scaling clusters — process partitions in parallel across workers. Design for incremental processing: only process new/changed data, not the full 10TB. Implement checkpointing so that failures restart from the last checkpoint, not from scratch. Monitor costs closely — at this scale, a misconfigured shuffle or unnecessary full scan costs thousands of dollars. Consider data sampling for development and testing: write pipeline logic against a 1% sample, validate against full data only in staging.

Python & Coding (Questions 31–35)

Most data engineering interviews include a coding round focused on Python. These questions test whether you can write production-quality data processing code, not just scripts that work on small datasets.

Python PySpark pandas SQL Airflow dbt Spark Kafka

31. How do you handle large datasets in pandas that do not fit in memory?

Several approaches: (1) use chunked reading with pd.read_csv(chunksize=N) to process data in batches; (2) use more memory-efficient dtypes (category for low-cardinality strings, int32 instead of int64, float32 instead of float64); (3) read only the columns you need with usecols parameter; (4) switch to a library designed for larger-than-memory data: Polars (Rust-based, much faster), Dask (parallel pandas), or Vaex (lazy evaluation). For truly large datasets (above 10GB), pandas is the wrong tool — use PySpark or process in the warehouse with SQL. Knowing when to stop using pandas and switch to a distributed framework is what distinguishes senior data engineers.

32. Write a Python function that deduplicates records based on a composite key, keeping the most recent.

Sample Answer def deduplicate(records, key_fields, timestamp_field): seen = {}; for record in sorted(records, key=lambda r: r[timestamp_field]): key = tuple(record[f] for f in key_fields); seen[key] = record; return list(seen.values())

Interviewers will follow up on edge cases: what if the timestamp field has NULLs? What about timezone-naive vs. timezone-aware timestamps? How would you scale this to millions of records? (Answer: partition by key hash and deduplicate each partition in parallel, or use a database MERGE statement.)

33. Explain PySpark's lazy evaluation and why it matters.

PySpark uses lazy evaluation: transformations (map, filter, groupBy) are not executed when called. Instead, Spark builds a logical execution plan (DAG). Execution only happens when an action is triggered (collect, count, write). This matters because it allows Spark's Catalyst optimizer to rearrange and optimize the entire computation before running it. For example, if you filter after a join, the optimizer can push the filter before the join, dramatically reducing the amount of data processed. Understanding lazy evaluation helps you write efficient Spark code and debug performance issues using the Spark UI's execution plan view.

34. How do you implement data quality checks in Python?

Build reusable check functions that validate data at each pipeline stage: null_check(df, columns) verifies no unexpected NULLs, uniqueness_check(df, key_columns) confirms primary key integrity, range_check(df, column, min, max) catches outliers, referential_check(df, lookup_df, key) validates foreign keys. Use the Great Expectations library for a more comprehensive framework — it generates expectations from data profiles, runs validations, and produces documentation. In production pipelines, checks should fail loudly: halt the pipeline if critical checks fail, log warnings for non-critical anomalies, and always produce a data quality report for each run.

35. Write a PySpark job that reads a JSON file, filters records, and writes partitioned Parquet.

Sample Answer from pyspark.sql import SparkSession; spark = SparkSession.builder.appName("etl").getOrCreate(); df = spark.read.json("s3://bucket/raw/events/"); filtered = df.filter(df.event_type == "purchase").filter(df.amount > 0); filtered.write.partitionBy("date").mode("overwrite").parquet("s3://bucket/processed/purchases/")

Follow-up questions: why partitionBy("date")? (Common filter column, enables partition pruning.) Why mode("overwrite")? (Idempotency — re-running rewrites the same output.) What if the JSON has nested fields? (Use explode() to flatten arrays, or select("nested.*") to unpack structs.)

Behavioral Questions (Questions 36–40)

Behavioral questions for data engineers focus on stakeholder communication, incident management, and prioritization. These are not afterthoughts — many data engineering hires fail because of communication skills, not technical ones.

36. Tell me about a time you discovered a data quality issue in production. How did you handle it?

Structure your answer using the situation-action-result framework. Describe the specific issue (duplicate records, missing data, incorrect joins), how you discovered it (alerting, user report, QA check), the immediate mitigation (pause downstream consumers, communicate to stakeholders), the root cause analysis (what broke and why), the fix (pipeline change, data backfill), and the prevention measure (new tests, monitoring, schema contracts). Interviewers want to see that you think beyond the fix to prevention. A good answer includes: "After fixing the issue, I added three data quality checks so this class of problem would be caught automatically in the future."

37. How do you prioritize requests from multiple stakeholders?

Acknowledge that prioritization is the hardest part of the job, not the SQL. Framework: assess each request on business impact (revenue, compliance, user-facing), urgency (deadline-driven or flexible), and effort (quick win or multi-week project). Communicate transparently: share your prioritized backlog with stakeholders so they see where their request sits and why. Push back when appropriate — not every "urgent" request is actually urgent. Build self-service tools (dbt models, well-documented data marts) so that common requests do not require data engineering involvement at all.

38. Describe a data pipeline you designed from scratch. What trade-offs did you make?

This is your chance to demonstrate architectural thinking. Walk through the end-to-end design: what problem it solved, what technologies you chose and why, what alternatives you considered and rejected, what trade-offs you made (latency vs. cost, complexity vs. flexibility, build vs. buy). Great answers include specific numbers: "The pipeline processed 2 million events per hour with a p95 latency of 3 minutes" or "We chose daily batch over streaming because the business only needed data refreshed once per day, and streaming would have tripled our infrastructure costs."

39. How do you handle a situation where the source data is unreliable?

Unreliable sources are the norm, not the exception. Strategies: implement defensive schema validation at ingestion (reject malformed records to a dead-letter queue), build reconciliation checks that compare source counts against what landed in your warehouse, add data freshness monitoring to detect when a source stops sending data, design transformations to handle NULLs and unexpected values gracefully, and maintain a direct relationship with source system owners. Document known source issues and their workarounds. The goal is to make your pipeline resilient to source problems without losing data or producing silent errors.

40. What would you do in your first 90 days as a new data engineer on this team?

First 30 days: learn the data architecture, read existing pipeline code, understand the data model, meet stakeholders and learn their top pain points. Days 30-60: pick up a medium-complexity project that addresses a real pain point, establish credibility through delivery. Days 60-90: propose improvements to the data platform based on what you have learned — maybe it is better testing, faster pipelines, or improved documentation. The key signal interviewers want: you listen before you prescribe, you deliver before you architect, and you think about the team's productivity, not just your own projects.

ETL vs. ELT: A Quick Comparison

ApproachETL
Transform LocationExternal engine (Spark, Informatica)
Best ForPII scrubbing, compliance, very large volumes
FlexibilityAny language, any logic
Raw Data Preserved?Only if explicitly designed
ApproachELT
Transform LocationInside the warehouse (SQL, dbt)
Best ForAnalytics, reporting, most modern workloads
FlexibilitySQL-centric, limited to warehouse capabilities
Raw Data Preserved?Always — raw data lands first

For a deeper look at how top data companies approach these challenges, see our profiles of Databricks and Snowflake — two companies building the platforms that data engineers use every day. Our Working at Databricks and Databricks Compensation guides cover what it is like to be a data engineer at one of the most respected data companies in the industry.

Frequently Asked Questions

What are the most important topics for a data engineer interview?+
The most important topics are SQL and data modeling (normalization, star schema, slowly changing dimensions), ETL/ELT pipeline design (Airflow, dbt, Spark), data warehousing (Snowflake, BigQuery, Redshift), system design for data infrastructure, and Python programming (PySpark, pandas). Behavioral questions about working with stakeholders and handling data quality incidents are also common. SQL remains the single most tested skill across all data engineering interviews.
What is the difference between ETL and ELT?+
ETL (Extract, Transform, Load) transforms data before loading it into the target system, typically using an external processing engine. ELT (Extract, Load, Transform) loads raw data into the target warehouse first, then transforms it in place using the warehouse's compute power. ELT has become the dominant pattern because modern cloud warehouses have massive compute capacity, making it cheaper and faster to transform data after loading. ELT also preserves raw data for future reprocessing and supports more agile analytics workflows.
How much do data engineers make in 2026?+
Data engineer salaries vary significantly by company and level. At top tech companies and AI labs, total compensation ranges from $180,000 to $350,000+ for mid-level to senior roles. Well-funded startups typically offer $150,000 to $280,000. Staff and principal data engineers at top companies can exceed $400,000 in total compensation. Use our salary negotiation calculator to estimate your market value based on level, location, and company type.
Should I learn Spark or dbt for data engineering interviews?+
Both are important but serve different purposes. dbt is essential for analytics engineering and SQL-based transformations within the warehouse. Spark is critical for large-scale data processing and ETL pipelines that handle terabyte-scale datasets. For most data engineering roles, you should be comfortable with both. If forced to prioritize, learn dbt for analytics-heavy roles and Spark for infrastructure-heavy or ML pipeline roles.
What system design topics come up in data engineering interviews?+
Common topics include designing end-to-end data pipelines (batch and streaming), data lake architecture (medallion/bronze-silver-gold pattern), real-time analytics systems, change data capture (CDC) pipelines, data quality monitoring frameworks, and schema evolution strategies. Interviewers evaluate your ability to make trade-offs between consistency and latency, choose appropriate storage formats, design for fault tolerance and idempotency, and estimate data volumes.

Find your next data engineering role

Browse data engineering jobs at companies with strong engineering cultures, transparent comp, and real growth paths.

Browse Data Engineering Jobs → Explore Company Cultures →