Technical Assessments for Data Scientists That Aren't Just SQL Trivia
What "data scientist" actually means at your company
Before designing the assessment, name the role honestly. The label covers wildly different jobs:
- Analytics-leaning DS. SQL, dashboards, experiment analysis, stakeholder communication.
- ML-leaning DS. Model training, feature engineering, evaluation, sometimes productionization.
- Research-leaning DS. Novel modelling, statistical rigor, publication-quality work.
A single test cannot measure all three. Picking which one this hire is for is the first decision.
Assessment shapes by role flavor
Analytics DS
Give them a messy dataset (CSV, ~10MB, intentionally with duplicates, nulls, and a subtle definition mismatch in one column). Ask three business questions of increasing ambiguity:
- Concrete: "What's the 7-day retention rate?"
- Slightly ambiguous: "Has retention changed since feature X launched?"
- Open: "What in this data should the product team know about?"
Score: SQL/Python correctness on Q1, statistical reasoning on Q2, judgment and communication on Q3.
ML DS
A tabular dataset with a target. 90 minutes. Notebook environment.
Score: feature engineering choices, model evaluation methodology (not final metric — how they evaluated), awareness of leakage and overfitting, communication of trade-offs in a short writeup.
The metric does not matter. A candidate who gets 0.82 AUC with a clean cross-validation setup beats a candidate who gets 0.91 by leaking the target through a feature.
Research DS
A short paper or technical proposal review. Or a methodology critique of a flawed analysis. Tests rigor and reading skill, both of which matter more than coding for this flavor.
Grading without bias
Anonymize. Always. Names, schools, prior employers — strip them before review.
Use rubric-anchored grading. ClarityHire's grading service does first-pass rubric scoring with an LLM, anonymized; reviewers see the AI score plus the work and override with a reason. For DS submissions specifically this surfaces things like missing cross-validation or improper train/test splits the reviewer can verify quickly.
What to never do
- Whiteboard SQL questions. The medium changes the skill — many great analysts cannot write joins by memory but write them fluently against a real database.
- "Implement gradient descent from scratch." Tests memorization of an undergraduate exercise, not job skill.
- Take-homes longer than 3 hours for screen stage. You're paying in pipeline width.
Pair with an interview
Whatever the assessment, follow it with a 45-minute discussion of the candidate's submission. The walkthrough catches almost all the integrity issues the assessment alone misses, and the rubric for the discussion (probing depth on their own choices) is straightforward.