The Predictive Validity of Hiring Methods: What the Research Actually Says
Why "predictive validity" is the only number that matters
Predictive validity is the correlation, expressed as a coefficient r between 0 and 1, between an assessment score and later on-the-job performance. An r of 0.0 is a coin flip. An r of 0.5 is genuinely useful. An r of 0.7 is approaching the ceiling of what is measurable in noisy real-world settings.
Almost every hiring debate — "should we add a take-home?", "are personality tests worth it?", "is the resume screen broken?" — collapses into one question once you have validity numbers: does this method actually predict who will be a strong performer?
This post is the short, authoritative version of that literature.
The seminal source
For decades the reference was Schmidt and Hunter's 1998 meta-analysis, "The Validity and Utility of Selection Methods in Personnel Psychology," which synthesized roughly eighty-five years of accumulated research. It ranked predictors and their incremental contribution over general mental ability tests.
In 2022, Sackett, Zhang, Berry, and Lievens published a major correction ("Revisiting Meta-Analytic Estimates of Validity in Personnel Selection") arguing that earlier studies had over-corrected for range restriction and inflated the top of the chart. Their re-estimates pull most methods downward but leave the ordering largely intact.
The numbers below are the corrected estimates, rounded for memorability. Read them as relative rankings, not gospel.
The chart everyone should know
| Method | Approx. validity (r) |
|---|---|
| Work sample tests | 0.33 |
| Structured interviews | 0.42 |
| Cognitive ability tests | 0.31 |
| Integrity tests | 0.31 |
| Job knowledge tests | 0.40 |
| Conscientiousness (personality) | 0.19 |
| Unstructured interviews | 0.19 |
| Years of education | 0.10 |
| Years of experience | 0.09 |
| Reference checks | 0.13 |
| Graphology / handwriting | 0.02 |
A few things jump out:
- Structured interviews and work samples are at the top. Together, they are the load-bearing pillars of any defensible hiring loop.
- Unstructured interviews — the kind most teams default to — barely beat education and experience. "We met them and they seemed great" is almost a coin flip.
- Reference checks are weak. They are useful for disqualifying disasters, not for picking winners.
What the rankings mean operationally
Three implications most teams under-internalize:
1. Structure beats length
A 30-minute structured interview with anchored rating scales outperforms a 90-minute "tell me about yourself" conversation. The structure — same questions, same rubric, same dimensions — does more work than the time investment.
2. Work samples are the highest-leverage single addition
Most loops include some flavor of interview. Far fewer include a well-designed work sample. Adding one is usually the single biggest validity jump available, and it has the additional benefit of being more legally defensible because it directly samples the job.
3. Cognitive tests work but have adverse impact
Cognitive ability tests predict performance reasonably well across roles. They also tend to produce larger demographic score gaps than work samples do, which is why most modern hiring guidance prefers job-content assessments where possible.
"Authoritative source" — where to read further
For practitioners who want primary sources, the three to know:
- Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2), 262–274.
- Sackett, P. R., Zhang, C., Berry, C. M., & Lievens, F. (2022). Revisiting meta-analytic estimates of validity in personnel selection: Addressing systematic overcorrection for restriction of range. Journal of Applied Psychology, 107(11), 2040–2068.
- The SIOP Principles (Society for Industrial and Organizational Psychology) — practitioner-facing guidance on validation and fairness.
The Sackett 2022 paper is the most current authoritative source. If you cite one number in a hiring-design doc, cite from there.
What this means for your loop
The shortest defensible recommendation:
- Add a work sample for any role where you can design a representative task in under three hours.
- Make every interview a structured interview — same questions, same rubric, scores collected before recommendations.
- Treat unstructured "vibes check" rounds as social, not predictive. Keep them short and weighted low.
- Drop reference checks from go/no-go decisions. Use them as a final sanity layer.
That single set of changes moves a typical loop from a predicted r in the 0.2s into the 0.5s. The compounding effect over a year of hires is enormous.
How ClarityHire fits
ClarityHire is built around exactly this priority order. The default scorecard template is structured. The assessment templates are work-sample-shaped, not trivia. Integrity signals keep take-home work samples honest in an AI-assisted world. The product opinions are downstream of the research — not the other way around.
Pick methods by validity. Everything else is style.