Data analyst interviews in 2026 test a broad range of skills: SQL fluency, Python data manipulation, statistical reasoning, business acumen, and communication. Whether you are interviewing at a startup or a large enterprise, the core question types are remarkably consistent. This guide covers 50 questions across every category you will encounter, with detailed answers that explain not just the what, but the why.
We organized the questions by category so you can focus your preparation on the areas where you need the most work. If you are short on time, start with SQL (it appears in every data analyst interview) and business case studies (they are the hardest to prepare for on the fly). For company-specific preparation, use our Culture Fit Interview Questions tool to generate targeted questions.
SQL Questions (1–10)
SQL is the single most important skill in any data analyst interview. Every company tests it, and the depth of questions scales with the seniority of the role.
Q1 What is the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN?
INNER JOIN returns only rows where there is a match in both tables. LEFT JOIN returns all rows from the left table and matched rows from the right (NULLs where no match). RIGHT JOIN is the mirror. FULL OUTER JOIN returns all rows from both tables, with NULLs where there is no match on either side. In practice, you will use INNER and LEFT JOINs 95% of the time. Key follow-up: "When would a LEFT JOIN give you more rows than an INNER JOIN?" Answer: when the right table has no matching rows for some left-table rows.
Q2 Write a query to find the second-highest salary in an employees table.
The cleanest approach uses DENSE_RANK():
SELECT salary FROM (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as rnk
FROM employees
) ranked
WHERE rnk = 2;
Using DENSE_RANK() instead of RANK() or ROW_NUMBER() handles ties correctly. If two employees share the highest salary, the second-highest is still the next distinct value.
Q3 Explain window functions. Write a query using a running total.
Window functions perform calculations across a set of rows related to the current row, without collapsing them into a single output row. They use the OVER() clause:
SELECT order_date, amount,
SUM(amount) OVER (ORDER BY order_date) as running_total
FROM orders;
The ORDER BY inside OVER() defines the frame. Add PARTITION BY to restart the calculation per group (e.g., per customer).
Q4 What is the difference between WHERE and HAVING?
WHERE filters individual rows before aggregation. HAVING filters groups after aggregation. Example: "Show departments with more than 10 employees" requires HAVING COUNT(*) > 10. A common mistake: writing WHERE COUNT(*) > 10 — this errors because WHERE executes before GROUP BY.
Q5 Write a query to find users who made purchases on consecutive days.
This tests LAG() for sequential analysis:
SELECT DISTINCT user_id FROM (
SELECT user_id, purchase_date,
LAG(purchase_date) OVER (PARTITION BY user_id ORDER BY purchase_date) as prev_date
FROM purchases
) t
WHERE purchase_date - prev_date = 1;
LAG() grabs the previous row's value within each user's partition. Date arithmetic syntax varies by database (this is PostgreSQL-style).
Q6 What is a Common Table Expression (CTE) and when would you use one?
A CTE is a temporary named result set defined with WITH. It makes complex queries more readable by breaking them into logical steps. Use CTEs when you need to reference the same subquery multiple times, when a query has 3+ levels of nesting, or when you want to make logic self-documenting. CTEs are primarily a readability tool — in most databases, the optimizer may inline them.
Q7 How do you handle NULL values in SQL?
NULLs represent missing or unknown data. Any comparison with NULL returns NULL (not TRUE or FALSE), so WHERE col = NULL never returns rows — use WHERE col IS NULL. NULLs are excluded from aggregate functions like COUNT(col), AVG(col), SUM(col). Use COALESCE(col, default) to replace NULLs. Use COUNT(*) vs COUNT(col) deliberately — the former counts all rows, the latter counts non-NULL values only.
Q8 Write a query to calculate month-over-month revenue growth.
Combines date aggregation with window functions:
WITH monthly AS (
SELECT DATE_TRUNC('month', order_date) as month,
SUM(revenue) as total_revenue
FROM orders GROUP BY 1
)
SELECT month, total_revenue,
ROUND(100.0 * (total_revenue - LAG(total_revenue) OVER (ORDER BY month))
/ LAG(total_revenue) OVER (ORDER BY month), 1) as growth_pct
FROM monthly;
Q9 What is the difference between UNION and UNION ALL?
UNION combines result sets and removes duplicates. UNION ALL keeps all rows including duplicates. UNION ALL is faster because it skips deduplication. Use UNION ALL by default unless you specifically need deduplication.
Q10 Write a query to find the top 3 products by revenue in each category.
SELECT category, product_name, revenue FROM (
SELECT category, product_name, revenue,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) as rn
FROM products
) ranked
WHERE rn <= 3;
Follow-up: "What if you want ties included?" Use DENSE_RANK() instead of ROW_NUMBER().
Python/Pandas Questions (11–18)
Python has become a standard requirement for data analysts. These questions focus on pandas DataFrame operations and data manipulation.
Q11 How do you handle missing values in a pandas DataFrame?
Use df.isnull().sum() to check for missing values per column. Key methods: df.dropna() removes rows with any NaN, df.fillna(value) replaces NaN with a specified value, df.fillna(method='ffill') forward-fills. For numerical columns, filling with the median is usually better than the mean (less sensitive to outliers). Always document your imputation strategy.
Q12 What is the difference between merge, join, and concat in pandas?
pd.merge() is the most flexible — it works like SQL JOIN with explicit on, how, and left_on/right_on parameters. df.join() joins on the index by default. pd.concat() stacks DataFrames vertically (axis=0) or horizontally (axis=1). Use merge() for key-based joins, concat() for appending datasets with the same structure.
Q13 How would you group data and apply multiple aggregations?
Use groupby() with agg():
df.groupby('category').agg(
total_revenue=('revenue', 'sum'),
avg_order=('revenue', 'mean'),
num_orders=('order_id', 'nunique')
).reset_index()
Always call .reset_index() if you want the grouping column as a regular column rather than the index.
Q14 What is the difference between apply, map, and applymap?
map() works element-wise on a Series. apply() works row-wise or column-wise on a DataFrame. applymap() (renamed to map() on DataFrame in pandas 2.1+) works element-wise on an entire DataFrame. Performance tip: avoid apply() when vectorized operations exist — df['col'] * 2 is faster than df['col'].apply(lambda x: x*2).
Q15 How do you pivot and unpivot data in pandas?
# Wide to long
pd.melt(df, id_vars=['product'], value_vars=['jan','feb','mar'],
var_name='month', value_name='sales')
# Long to wide
df.pivot_table(index='product', columns='month',
values='sales', aggfunc='sum')
Q16 How would you detect and remove duplicate rows?
Use df.duplicated() to identify duplicates and df.drop_duplicates() to remove them. Key parameter: subset lets you define which columns determine uniqueness. keep='first' (default) keeps the first occurrence, keep=False removes all duplicates. Always check duplicates before removing them.
Q17 How do you create a new column based on conditional logic?
# Simple binary
df['tier'] = np.where(df['revenue'] > 1000, 'high', 'low')
# Multiple conditions
conditions = [df['revenue'] > 5000, df['revenue'] > 1000]
choices = ['enterprise', 'mid-market']
df['tier'] = np.select(conditions, choices, default='smb')
Avoid chained .apply() with if/else — vectorized operations are 10–100x faster on large DataFrames.
Q18 How do you optimize pandas operations for large datasets?
Key strategies: (1) Use appropriate dtypes — category for low-cardinality strings. (2) Read only needed columns: pd.read_csv('file.csv', usecols=['col1','col2']). (3) Use chunksize to process large files in batches. (4) Replace iterrows() with vectorized operations. (5) Use .query() for complex filtering. (6) Consider polars for truly large datasets — it is often 5–10x faster than pandas for aggregation-heavy workflows.
Statistics Questions (19–26)
Statistics questions test your ability to reason about data rigorously. The key is explaining concepts clearly with practical examples.
Q19 Explain p-values in plain language. What does a p-value of 0.03 mean?
A p-value is the probability of observing results as extreme as the actual data, assuming the null hypothesis is true. A p-value of 0.03 means: "If there truly were no effect, there is a 3% chance of seeing data this extreme by random chance alone." It does NOT mean there is a 97% chance the effect is real. Common pitfall: a small p-value does not tell you anything about the size or practical importance of the effect. Always pair statistical significance with effect size.
Q20 What is the difference between Type I and Type II errors?
Type I error (false positive): concluding there is an effect when there is not. Example: declaring an A/B test winner when the variant is actually no better. Type II error (false negative): failing to detect an effect that exists. Example: declaring "no significant difference" when your sample size was too small. There is a trade-off: reducing Type I errors (lower alpha) increases Type II errors.
Q21 How do you determine sample size for an A/B test?
Sample size depends on four inputs: (1) baseline conversion rate, (2) minimum detectable effect (MDE), (3) significance level (alpha, typically 0.05), and (4) statistical power (typically 0.80). Key insight: detecting smaller effects requires exponentially more data. Detecting a 1% lift requires roughly 16x the sample size of detecting a 4% lift. Always calculate sample size BEFORE running the test.
Q22 Explain the Central Limit Theorem and why it matters.
The CLT states that the sampling distribution of the mean approaches a normal distribution as sample size increases, regardless of the underlying population distribution. This justifies using normal-distribution-based tests (z-tests, t-tests) on data that is not itself normally distributed, as long as the sample is large enough (typically n > 30). Even if your revenue data is heavily skewed, the average revenue across many samples will be approximately normal.
Q23 What is the difference between correlation and causation?
Correlation means two variables move together. Causation means one directly influences the other. Classic example: ice cream sales and drowning deaths are positively correlated, but ice cream does not cause drowning — both increase in summer (confounding variable). To establish causation, you need a randomized controlled experiment (like an A/B test) or careful causal inference techniques. In a data analyst role, your job is often to flag when stakeholders confuse the two.
Q24 When would you use a t-test vs. a chi-squared test vs. ANOVA?
t-test: comparing means of two groups (e.g., A/B test). Chi-squared test: testing relationships between categorical variables (e.g., is conversion rate independent of marketing channel?). ANOVA: comparing means across 3+ groups (e.g., revenue across 5 pricing plans). If ANOVA shows significance, use post-hoc tests (Tukey's HSD) to determine which pairs differ.
Q25 What is selection bias and how do you avoid it?
Selection bias occurs when your sample is not representative of the population. Examples: analyzing only survey completers (survivorship bias), studying treatment effects only for patients who stayed in a trial (attrition bias). Mitigation: randomized assignment, intention-to-treat analysis, stratified sampling, and always asking "who is missing from this dataset and why?"
Q26 Explain what R-squared means.
R-squared measures the proportion of variance in the dependent variable explained by the model. An R-squared of 0.75 means the model explains 75% of the variation. There is no universal "good" value — it depends on the domain. In physical sciences, >0.95 is expected. In behavioral data, 0.30 might be excellent. Caution: R-squared always increases when you add variables, even noise. Use adjusted R-squared to penalize unnecessary complexity.
Business Case Studies (27–34)
Case studies test your ability to structure analysis around a business problem. Interviewers evaluate your framework, metric choices, and how you communicate trade-offs.
Q27 User signups increased 20% last month but revenue stayed flat. What would you investigate?
Framework: (1) Segment new signups by source — are they from a lower-quality channel? (2) Check conversion rates from signup to paid. (3) Analyze revenue per user — are new users selecting cheaper plans? (4) Check if existing customer revenue declined (churn, downgrades). (5) Look at time-to-revenue — new signups may not monetize in month one. Decompose: revenue = users x conversion rate x ARPU, then identify which component underperforms.
Q28 How would you design an A/B test for a new checkout flow?
(1) Primary metric: checkout completion rate. Secondary: time to complete, cart value. (2) Hypothesis: "The new flow will increase checkout completion by at least 3%." (3) Calculate sample size. (4) Randomize users (not sessions). (5) Run for at least 2 full business cycles. (6) Check for novelty effects. (7) Segment results by device, user tenure, and cart value to identify heterogeneous effects.
Q29 DAU dropped 15% yesterday. Walk me through your investigation.
(1) Check if it is a data issue — did the tracking pipeline run? (2) Check calendar effects: holiday, weekend. Compare to same day last week and last year. (3) Check by platform: iOS, Android, web. Platform-specific drops suggest deployment issues. (4) Check by geography. (5) Check by segment: new vs. returning. (6) Check recent feature launches or A/B tests. (7) Check external factors: competitor launch, app store issues. Always start with the simplest explanation.
Q30 How would you measure the success of a recommendation engine?
Multi-layer: Engagement: CTR on recommended items, time spent. Conversion: % of recommended items purchased. Discovery: catalog coverage, recommendation diversity. Satisfaction: user ratings, explicit feedback. Business: incremental revenue (A/B test recs ON vs. OFF). Key trade-off: optimizing purely for CTR leads to clickbait. Include diversity and satisfaction metrics.
Q31 How would you define and measure customer churn for a SaaS product?
Churn = customers who cancel or fail to renew in a given period. Monthly churn rate = customers lost / customers at start. Nuances: (1) Distinguish voluntary vs. involuntary churn. (2) Revenue churn vs. logo churn. (3) Net revenue churn accounts for expansion. (4) Cohort analysis is more useful than aggregate churn. (5) Leading indicators: declining usage, fewer logins, support ticket increase.
Q32 You are given user activity logs. How would you segment users?
Start with RFM (Recency, Frequency, Monetary) as baseline. Beyond RFM: (1) Behavioral clustering with k-means on feature vectors. (2) Engagement tiers: power (daily), regular (weekly), casual (monthly), dormant. (3) Lifecycle stage: new, activated, retained, at-risk. The right segmentation depends on the business question. Always start with "What decision will these segments inform?"
Q33 How would you evaluate whether a pricing change was successful?
(1) Define success criteria before the change. (2) Compare against a holdout group if possible. (3) Look at short-term and long-term effects — a price increase might boost revenue but increase churn over 3–6 months. (4) Segment by customer type. (5) Monitor leading indicators: trial-to-paid conversion, demo requests. (6) Calculate price elasticity: % change in demand / % change in price.
Q34 How would you measure the impact of a marketing campaign?
Gold standard: randomized holdout test. If not possible: (1) Pre/post analysis with seasonality controls. (2) Geographic lift test. (3) Incrementality analysis. (4) Multi-touch attribution. Key metric: incremental cost per acquisition = campaign cost / incremental conversions. Always distinguish correlation from causation.
Behavioral Questions (35–42)
Behavioral questions assess communication, problem-solving, and stakeholder management. Use the STAR method for structured answers.
Q35 Tell me about a time you communicated complex analysis to a non-technical audience.
Good answers show you (1) led with the business insight, not the methodology, (2) used visualizations instead of tables, (3) anticipated questions, (4) confirmed understanding by asking for feedback. The interviewer is assessing whether you can bridge the gap between data and decisions.
Q36 Describe a time your analysis was wrong. What happened?
Strong answers include: what the mistake was, how you discovered it, how you communicated it, and what process changes you made. Interviewers want self-awareness and accountability, not perfection.
Q37 How do you prioritize multiple requests from different stakeholders?
Framework: (1) Understand business impact and urgency. (2) Estimate effort. (3) Communicate timelines transparently. (4) Group quick wins. (5) For conflicts, escalate with a recommendation. The key signal: you prioritize based on business impact, not who shouts loudest.
Q38 Tell me about a time you disagreed with a stakeholder's interpretation of data.
Strong answers show you (1) listened first, (2) identified the specific disagreement, (3) presented evidence, (4) focused on the business decision. If you were wrong, say so — that shows more maturity than only sharing vindication stories.
Q39 How do you approach analysis when data quality is poor?
(1) Quantify the quality issues. (2) Document every assumption. (3) Conduct sensitivity analysis. (4) Communicate uncertainty ranges, not point estimates. (5) Recommend data collection improvements. Never present results from poor data without disclosing limitations.
Q40 What is your process when starting analysis on a new dataset?
Structured exploration: (1) Understand business context. (2) Check schema, types, row counts. (3) Profile each column: distributions, cardinality, null rates. (4) Check for duplicates. (5) Validate against known facts. (6) Look for outliers. (7) Document data quality issues before starting analysis.
Q41 How do you stay current with new tools and techniques?
Interviewers want genuine learning habits, not a list of MOOCs. Strong answers mention: following specific practitioners, building side projects, applying new techniques to work problems, and having opinions about tools. Example: "I switched from pandas to polars for a large dataset and reduced our ETL runtime from 45 minutes to 6 minutes."
Q42 What strengths do you bring to a data team?
Avoid generic answers. Identify 2–3 specific strengths with evidence. Example: "I bridge data engineering and analytics — I can write production-quality SQL that runs in dbt, not just ad-hoc queries." Pick strengths relevant to the specific role.
Take-Home Assignments (43–46)
Many data analyst interviews include a take-home assignment. Here is what to expect and how to present your work.
Q43 What should you include in a take-home data analysis presentation?
Structure: (1) Executive summary — the answer in 2–3 sentences, placed first. (2) Methodology — approach, tools, assumptions. (3) Findings — 3–5 insights with visualizations. (4) Limitations — what could go wrong, what data would improve it. (5) Recommendations — specific, actionable next steps. (6) Appendix — code, additional charts, methodology details.
Q44 How do you handle a take-home dataset that is messy?
This is often intentional. Document every cleaning step. Show your work: "Column X had 12% null values. I verified they were missing at random by checking distributions across segments. I imputed with the segment median." Interviewers who send messy data are specifically evaluating your data cleaning discipline.
Q45 What makes a good data visualization for a take-home?
Rules: (1) Every chart must have a title that states the insight, not just the metric. (2) Label axes, include units. (3) Use color purposefully — highlight the insight. (4) Choose the right chart type: bar for comparisons, line for trends, scatter for relationships. (5) Remove clutter: no 3D effects, minimal legends.
Q46 How much time should you spend on a take-home assignment?
Respect the stated time limit (usually 2–4 hours). Going over shows poor time management. Allocate roughly: 30% exploring and cleaning, 40% analysis and visualization, 30% writing up and polishing. A focused analysis of the most important question beats superficial coverage of everything.
Questions to Ask the Interviewer (47–50)
Thoughtful questions demonstrate genuine interest. Use our Culture Fit Interview Questions tool for more targeted questions for any company.
Q47 What does the data stack look like, and how much influence does the analytics team have over tool selection?
This reveals both the technical environment and the team's organizational influence. Listen for: modern stack (cloud warehouse, dbt, version-controlled SQL) vs. legacy (Excel, manual ETL).
Q48 Can you walk me through a recent analysis that changed a business decision?
Tests whether the team's work actually influences decisions. If the interviewer struggles to give a concrete example, that is a signal about the team's influence.
Q49 How does the team balance ad-hoc requests with proactive analysis?
You want to hear: a process for intake and prioritization, dedicated time for exploratory analysis, and a roadmap for building self-serve tools to reduce ad-hoc volume.
Q50 What does success look like for this role in the first 90 days?
The answer reveals expectations, onboarding process, and whether the role has clear deliverables. Vague answers may indicate the role is not well-defined. Strong answers include specific projects and metrics.
Preparation Tip
"The best data analyst candidates do not just answer questions correctly — they explain their reasoning, acknowledge limitations, and connect technical answers to business impact. Practice explaining your thought process out loud."
For more preparation, explore our guide to structured interviews, our interview prep timer tool, and the full interview prep article library covering 100+ companies. Also see our guides on REST API interview questions and system design interview questions for technical role preparation.
Frequently Asked Questions
What SQL topics are most commonly tested in data analyst interviews?+
The most commonly tested topics are JOINs, GROUP BY with aggregates, window functions (ROW_NUMBER, RANK, LAG, LEAD, running totals), subqueries and CTEs, CASE statements, date manipulation, and NULL handling. Window functions are the number one topic that separates junior from senior candidates.
Do data analyst interviews include coding in Python?+
Yes, most 2026 data analyst interviews include Python questions focused on pandas and data manipulation. Expect DataFrame operations (merge, groupby, pivot, melt), data cleaning, and basic visualization. Senior roles may include more complex transformations.
What statistics concepts should I know?+
Core concepts: hypothesis testing, p-values, Type I/II errors, probability distributions, descriptive statistics, correlation vs. causation, A/B testing methodology, and regression basics. Focus on explaining these in plain language with business examples.
How should I approach case study questions?+
Framework: (1) Clarify the business objective. (2) Identify key metrics. (3) Describe needed data. (4) Outline your analysis approach. (5) Discuss limitations. (6) Present how you would communicate findings. The interviewer cares more about your framework than the right answer.
What tools should a data analyst know in 2026?+
Core: SQL, Python with pandas, a BI tool (Tableau, Looker, Power BI), Excel/Sheets, basic git. Bonus: dbt, Airflow, and cloud warehouses (Snowflake, BigQuery, Redshift).
How long does a typical data analyst interview process take?+
Typically 2 to 4 weeks with 3 to 5 stages: recruiter screen, technical phone screen, take-home or live coding, on-site loop, and sometimes a final presentation. Startups are faster (1–2 weeks), larger companies may take 4–6 weeks.