1) “Walk me through an ML project you shipped end-to-end.”
Answer (structure):
“I start with the business goal + success metric, then define the dataset and leakage risks. I do a simple baseline first, then iterate on features/model with proper validation. I pick metrics that match the cost of mistakes, and I check calibration and slice performance (by region/device/new users). After that: deployment plan, monitoring (data drift + KPI drift), retraining triggers, and a short post-launch review to capture learnings.”
2) “How do you know if you have bias vs variance, and what do you do?”
Answer:
“If train score is high but validation/test is much lower → high variance (overfitting). Fix: more data, stronger regularization, simpler model, early stopping, better CV.
If both train and validation are low → high bias (underfitting). Fix: richer features, more flexible model, reduce regularization, longer training.”
3) “What’s data leakage? Give a real example.”
Answer:
“Leakage is when training uses information that wouldn’t exist at prediction time. Example: using ‘delivered_date’ to predict ‘late delivery’, or including future aggregates (like next week’s total spend) in features. Prevention: strict time-based splits, clear feature timestamping, and a feature store or pipeline rules that enforce ‘only past data’.”
4) “For an imbalanced classification problem, what do you change first?”
Answer:
“I start with the right metric: PR-AUC / recall at precision, not accuracy. Then I try class weights or focal loss, tune the threshold based on business cost, and validate with stratified or time-aware CV. If needed: smart sampling, better features, and calibration so predicted probabilities are usable.”
5) “Explain precision vs recall like you’re talking to a product manager.”
Answer:
“Precision: when we alert, how often we’re correct. Recall: out of all real cases, how many we catch. If false alarms are expensive, prioritize precision. If missing a case is expensive (fraud/safety), prioritize recall—then use thresholds and guardrails to control damage.”
6) “When would you choose logistic regression over XGBoost?”
Answer:
“When interpretability, stability, and latency matter—and the relationship is roughly linear with good feature engineering. It’s also easier to debug, less likely to overfit on small data, and faster to retrain. If interactions and non-linearity drive performance, boosting usually wins.”
7) “How do you pick features for categorical variables?”
Answer:
“For low-cardinality: one-hot. For high-cardinality: target encoding with leakage-safe CV, hashing, or learned embeddings. I always check for rare categories, unseen categories handling, and whether the encoding is stable over time.”
8) “Design a model to predict churn. What does your system look like?”
Answer:
-
Define churn window + prediction horizon (e.g., predict churn in next 30 days).
-
Build features from behavior up to ‘today’ only (avoid leakage).
-
Train with time-based splits.
-
Serve via batch scoring daily/weekly, or real-time if needed.
-
Add monitoring: data freshness, feature drift, calibration drift, and KPI impact (retention uplift).
-
Retrain monthly or on drift triggers.
9) “Your AUC improved, but business KPI got worse. How is that possible?”
Answer:
“Model metric improvements don’t guarantee KPI improvements. Common reasons: threshold not tuned to costs, distribution shift, worse performance on key slices (new users, high-value customers), feedback loops, or the model is better at easy cases but worse on the cases that matter. I’d audit slices, recalibrate, re-tune threshold, and run an online A/B test with guardrails.”
10) “Explain p-value and confidence interval without hand-waving.”
Answer:
“p-value is the probability of seeing results at least this extreme if the null hypothesis is true. A confidence interval is a range of plausible effect sizes under repeated sampling. In A/B testing I focus on effect size + CI, power, and practical significance—not just p < 0.05.”
11) “Write a SQL query: top 3 products by revenue per category.”
Answer (what you’d say):
“I’d aggregate revenue by category/product, then rank within each category using a window function (ROW_NUMBER or DENSE_RANK), and filter to <= 3. Also confirm revenue definition (net vs gross, refunds) and the time window.”
12) “How do you handle missing data?”
Answer:
“First I ask why it’s missing (MCAR/MAR/MNAR). For simple baselines: median/mode + missing indicator. For trees: often fine with simple imputations. For time series: forward-fill with care. If missingness is informative, the missing flag can be a strong feature.”
GenAI / LLM questions you’re likely to get in 2026
13) “What is RAG, and when would you use it instead of fine-tuning?”
Answer:
“RAG retrieves relevant documents at query time and feeds them to the LLM, so answers can be grounded in current or private data. I use RAG when facts must be correct and up-to-date, and when the knowledge changes often. I consider fine-tuning when I need consistent style, domain behavior, or task patterns—not just new facts.”
14) “How do you evaluate a RAG system?”
Answer:
“I break it into retrieval + generation.
-
Retrieval: recall@k, MRR, ‘did we fetch the right source?’
-
Generation: answer correctness vs references, faithfulness/grounding, and refusal quality when docs don’t support an answer.
Then I run an error log: bad chunking, weak queries, wrong filters, or the model ignoring context.”
15) “Why do LLMs hallucinate, and how do you reduce it?”
Answer:
“Hallucination is often the model filling gaps when context is missing or ambiguous. I reduce it with: better retrieval (hybrid search), stricter prompting (‘cite sources, say ‘not found’), constrained decoding/guardrails, and evaluation sets that include ‘unanswerable’ questions. For high-risk use cases, I add verification steps or tool-based checks.”
16) “Prompting vs fine-tuning vs adapters (LoRA)—how do you choose?”
Answer:
“Prompting is fastest for prototyping. Fine-tuning/LoRA is for consistent behavior at scale (format, tone, domain reasoning patterns). RAG is for fresh/private knowledge. In practice, I often do RAG + good prompts first, then add LoRA if we need reliability and reduced token cost.”



