Bayesian Psychometric Modeling for Hiring Assessment Validity — Alva Labs

PyMCArviZ

The Challenge

Alva Labs faced two challenges that looked different but turned out to be deeply connected. First, their personality assessment needed more sophisticated scoring — they wanted trait estimates that carry honest uncertainty rather than single-number summaries, and that handle the subtleties of how people actually respond to rating scales.

Second, and arguably harder: they needed to prove to enterprise buyers that their assessments actually predict job performance. That proof is tricky to produce. You only have performance data for people who got hired, which means you're looking at a censored slice of the candidate pool. Individual clients rarely hire enough people to draw conclusions on their own. And job performance ratings are inherently ordinal — a 4 out of 5 isn't twice a 2 out of 5.

Our Approach

Better trait estimation

We followed a careful build-and-validate cycle: simulate data first to stress-test assumptions, then build the model, then prove it works on real data, then benchmark it against what Alva already had. For trait estimation, we used a response model designed specifically for ordered rating scales — the kind used in personality assessments — producing full distributions over each candidate's traits rather than point estimates.

Proving predictive validity

For validity, we built a hierarchical model that pools evidence across client companies and roles. Small clients borrow statistical strength from larger ones, so you can generate meaningful validity estimates even when any single company has hired only a handful of people. The model accounts for the fact that you only observe performance among the hired, correcting for the bias that otherwise makes assessments look less predictive than they actually are.

Results

The new models substantially outperformed Alva Labs' existing system across key metrics, while also running faster and using less memory. On the assessment side, the team got production-ready trait estimates with credible intervals — a genuine upgrade from point scores that pretend certainty they don't have.

On the validity side, they got something arguably more valuable: a rigorous evidence package they could bring into enterprise sales conversations, showing prospective clients that the assessments predict real outcomes even after accounting for the statistical pitfalls that plague most validity studies.

PyMC Labs Team

Thomas Wiecki
Morgan
Tomi
Christian
Niall Oulton

Let's Chat, We Respond Fast

Tell us about your problem. We typically respond within 24 hours.

Schedule a Consultation