title: "How We Built TalentFilter: AI Candidate Screening Across Three Evaluation Layers" slug: talentfilter-ai-candidate-screening-platform
The client's legal team had one requirement that reshaped the entire evaluation pipeline: every AI scoring decision had to be explainable. Not auditable in the sense of "we can review the logs" — explainable in the sense that, for any candidate score, a recruiter could read a plain-language account of what the AI assessed and why it scored the way it did.
The standard approach to GPT-4o evaluation produces a score. Maybe a brief rationale. That's not sufficient for a bias audit. A bias audit requires that you can examine the reasoning across a candidate cohort — grouped by any demographic dimension — and confirm that the scoring criteria are being applied consistently. If the model says "clear communicator" for some candidates and "struggled to articulate" for others, you need the context to know whether that reflects actual communication quality differences or an inconsistency in how the model applies the criterion.
The constraint forced us to redesign the GPT-4o evaluation pipeline to produce structured JSON with explicit reasoning at the criterion level — content relevance reasoning, communication clarity reasoning, confidence indicators — rather than a composite score with a brief summary. It added processing complexity. It also produced better scores, because the chain-of-thought reasoning in structured output consistently outperforms summary scoring on subjective evaluation tasks.
TalentFilter is an AI-powered recruitment assessment platform for high-volume technical hiring. Three sequentially gated evaluation layers: an async video interview scored by computer vision and NLP, an adaptive examination with AI marking, and an interview intelligence brief that gives the final interviewer everything they need to probe the candidate's actual gaps. Each layer serves as a filter for the next. Only candidates who clear the previous layer advance.
Who This Was Built For
The client is an enterprise HR technology organisation — name confidential — that operates recruitment processes for high-volume technical roles across software engineering, data science, finance, and operations. The problem they were solving is one that any company hiring at scale encounters: a job posting for a senior software engineer role receives 2,000 applications. Each one deserves honest evaluation. The capacity to evaluate 2,000 applications with the depth and consistency that a strong hire requires — in the time window between posting and offer — simply doesn't exist with human reviewers alone.
The platform is designed for companies where both hard skills and communication competency matter equally. A data scientist who can't explain their modelling decisions to a business stakeholder is a harder hire to justify than a slightly weaker modeller who communicates clearly. The three-layer architecture exists to evaluate both dimensions at scale.
The Problem
The failure modes in high-volume technical hiring are well-documented. CV screening is inconsistent — two reviewers evaluating the same CV on the same criteria often disagree significantly. Phone screens scale poorly — an hour-per-candidate screen with 50 qualified candidates is 50 hours of recruiter time before a single technical interview. Final interviews are the most expensive step and frequently surface problems — communication issues, knowledge gaps in specific areas — that earlier evaluation could have caught.
The AI augmentation brief: automate the highest-volume, most-repetitive evaluation steps without sacrificing evaluation quality. The key word is "without sacrificing" — a faster process that surfaces worse candidates isn't a win. The platform had to demonstrate that AI-scored candidates who advanced to final interview were higher quality, on average, than the historically human-scored pool on the same roles.
The bias audit requirement came from the client's legal team and HR leadership jointly. The EU AI Act (in force for high-risk AI systems, with HR and recruitment explicitly categorised as high-risk) requires that AI systems used in employment decisions maintain sufficient transparency for non-discrimination monitoring. The client's legal assessment concluded that a system producing composite scores without criterion-level reasoning did not meet that threshold.
What We Built
Three evaluation layers, sequentially gated:
Layer 1 — Async Video Interview: Candidates record responses to 3–5 role-specific questions. AI scores each response on three dimensions: content relevance (40% weight), communication clarity (30%), and delivery confidence (30%). Each dimension is scored with explicit criterion-level reasoning in structured JSON. Composite score: 0–100.
Layer 2 — Adaptive Examination: MCQ bank with IRT-based difficulty adaptation — question difficulty adjusts based on the candidate's running performance — supplemented by short-form written answers. MCQs are auto-graded. Short answers are evaluated by GPT-4o against a human-curated rubric with partial credit scoring and criterion-level reasoning output.
Layer 3 — Interview Intelligence Brief: For candidates who clear Layer 1 and 2, the technical interviewer receives a structured dossier: Layer 1/2 score breakdown, identified knowledge gaps from the exam, and AI-generated probing questions targeted at filling those gaps. Live scoring rubric embedded in the interviewer UI for consistent structured evaluation capture.
The recruiter dashboard shows the full pipeline: per-stage funnel metrics, cohort score distributions, and the bias audit log — all AI decisions with reasoning, searchable and filterable by demographic dimension.
How We Built It
Layer 1: The Video Evaluation Pipeline
Video submissions are recorded via WebRTC (MediaRecorder API) in the browser, uploaded to AWS S3, and processed through a three-stage pipeline.
Stage 1 — Transcription: OpenAI Whisper processes the audio track. We tested several ASR options; Whisper was notably better than Azure Speech on candidates with non-native English accents — a significant factor when hiring globally. Word error rate on accented English averaged 6.2% on our validation set versus 14.8% for a comparable Azure Speech configuration.
Stage 2 — Delivery analysis: Azure Speech SDK analyses the audio independently from Whisper transcription, scoring: speech rate (words per minute deviation from natural range), pause pattern (frequency and placement of pauses — appropriately placed pauses signal organised thinking; excessive fillers signal uncertainty), filler word frequency, and volume consistency. These deliver the 30% confidence/delivery component of the composite score.
Stage 3 — Content evaluation: GPT-4o evaluates the Whisper transcript against the question's evaluation rubric. The rubric is defined per question by the client's hiring team — specific technical criteria, expected answer elements, and clarity indicators. The model returns structured JSON: { "content_score": 82, "content_reasoning": "...", "clarity_score": 74, "clarity_reasoning": "...", "flags": ["mentioned X but did not explain Y"] }. The flags field is the key bias audit signal — it captures specific gaps that can be reviewed across a cohort.
Layer 2: IRT-Based Adaptive Examination
Standard MCQ exams have a fixed difficulty level that creates two failure modes: easy exams let weak candidates score well by guessing, hard exams penalise strong candidates with inconsistent failure on items near their ceiling. Item Response Theory (IRT) solves this by modelling each question's difficulty parameter and selecting the next question based on the candidate's estimated ability from prior responses.
We implemented a two-parameter IRT model (difficulty and discrimination). The question bank was initially calibrated by the client's hiring team assigning difficulty estimates to each question; after the first 200 candidate sessions, we recalibrated the difficulty parameters from actual response data using maximum likelihood estimation. The recalibrated parameters were measurably more accurate — the average score variance on the recalibrated bank was 18% lower than on the initial estimates, indicating that the adaptive algorithm was more accurately targeting the candidate's ability level.
Short-answer marking by GPT-4o uses the same structured JSON pattern as Layer 1: a rubric-defined by the hiring team, per-criterion scoring with reasoning, and partial credit available for answers that demonstrate correct understanding of core concepts while missing peripheral details.
Layer 3: Interview Intelligence Brief Generation
The brief is generated by a LangChain RAG pipeline that takes the candidate's Layer 1 and 2 data as context and retrieves role-specific probing question frameworks from the question library.
The key design decision: the brief is not a summary of what the candidate got right — it is a targeting document for the interviewer to probe what they might have wrong or shallow. Layer 1's flags field and Layer 2's incorrect answers are the primary inputs. The brief maps identified gaps to specific probing questions from the library.
Interviewers who used the brief reported probing into genuine knowledge gaps 67% more often than interviewers using standard question lists on the same roles — measured by post-interview scoring on the rubric dimensions that corresponded to identified gaps. The brief is doing its job when the interviewer's questions reveal something the prior layers correctly predicted might be a weakness.
What Made It Hard
1. Structured JSON Evaluation with Consistent Criterion Naming
The bias audit requirement demanded not just structured output — it demanded consistent structured output. If the reasoning for "communication clarity" is described in different terms across 5,000 candidate evaluations, comparing reasoning across a demographic cohort is impossible. You need consistent terminology to detect pattern differences.
We solved this by including the criterion taxonomy directly in the evaluation prompt: a fixed vocabulary of allowed reasoning phrases per criterion, provided as a constrained output specification. GPT-4o was instructed to select from the defined vocabulary where possible and to flag novel observations as separate additional_notes fields rather than conflating them with the criterion reasoning. The consistency rate — proportion of reasoning outputs that used the defined vocabulary correctly — reached 94% on our validation set, which was sufficient for the bias audit use case.
2. IRT Calibration With Limited Initial Data
The IRT model needs calibrated difficulty parameters to work correctly. With zero historical data at launch, we had to use client-estimated parameters — which are useful but imprecise. The first 200 candidate sessions were effectively calibration data as well as real evaluations.
The practical implication: the first cohort of candidates were not evaluated with the same adaptive precision as later cohorts. We disclosed this to the client, who accepted it as an inherent limitation of launching any psychometric system. The recalibration run at 200 sessions moved average ability estimation accuracy from RMSE 0.31 to RMSE 0.19 on our validation set — a meaningful improvement that justified the two-phased launch approach rather than deploying uncalibrated parameters indefinitely.
3. WebRTC Recording Quality Variance
Candidates record from their own devices, in their own environments, on their own network connections. The variance in recording quality is significant: some candidates record from a professional setup with good audio; others record from a phone in a noisy room on a 3G connection. The audio quality affects ASR accuracy and delivery analysis reliability in ways that are outside the platform's control.
We built a pre-submission quality check: before the candidate submits their responses, the platform runs a local audio quality assessment (signal-to-noise ratio estimate, volume level, connection stability indicator) and alerts candidates whose recording conditions are below the minimum threshold. The alert doesn't block submission — it informs. Candidates who rerecord after a quality alert submit audio that scores 22% better on ASR word error rate than those who submitted without seeing the alert.
The more important mitigation: the Whisper model's robustness to noisy audio at standard quality levels is genuinely good. For recordings above the minimum SNR threshold — approximately 85% of submissions — transcription accuracy is sufficient for the evaluation pipeline. The remaining 15% generate lower-confidence transcription outputs that are flagged in the evaluation JSON, which the recruiter dashboard surfaces as a data quality indicator alongside the score.
What Changed
The platform is operating in production with a rolling cohort of candidates across multiple client roles. Layer 1's video AI evaluation reduces the recruiter workload on the highest-volume screening step by approximately 80% — a 200-candidate pool is processed overnight. Layer 2's adaptive exam surfaces knowledge depth more accurately than a fixed MCQ format, with interviewers reporting that Layer 2 scores correlate better with final interview performance than CV-based screening scores did historically.
The bias audit log has been used in two post-hoc reviews by the client's HR compliance team and has not produced any findings requiring model or rubric adjustment. The consistent criterion vocabulary approach has been the key enabler — the reasoning outputs are comparable across cohorts in a way that makes demographic parity analysis tractable.
What's Next
The product roadmap includes: role-specific skills tests for Layer 2 — coding challenges in an integrated IDE, case study simulations, and financial modelling tasks as variants on the standard MCQ/short-answer exam; a candidate coaching mode — providing candidates with anonymised feedback on Layer 1 performance areas where their delivery was weak, without revealing the scoring rubric; and predictive performance modelling — using post-hire performance data to refine the Layer 1–3 scoring weights for roles where sufficient hire-outcome data has accumulated.
Common Questions About AI Candidate Screening
Is AI video interview scoring legal under the EU AI Act?
AI systems used in employment selection are classified as high-risk under Annex III of the EU AI Act, which entered full enforcement in 2025. High-risk AI requirements include: transparency to candidates (they must be informed that AI is used in evaluation), human oversight (AI scores must be reviewable by a human before affecting decisions), non-discrimination monitoring (scoring must be auditable for demographic parity), and technical documentation of the system's design and training data. TalentFilter's structured JSON evaluation with criterion-level reasoning is specifically designed to meet the non-discrimination monitoring requirement.
How accurate is AI video interview scoring compared to human evaluation?
Accuracy depends on what you're measuring. On content relevance — does the candidate's answer address the question's technical substance — AI scoring with a well-defined rubric is at least as consistent as human scoring, and more consistent across a large cohort (humans show fatigue and halo effects; the model doesn't). On communication quality, AI scoring correlates with human ratings at approximately 0.72 on our validation set — strong enough to be useful as a filter, not strong enough to be the sole determinant. The platform uses AI scores as a shortlist mechanism, not as a hiring decision.
What is Item Response Theory and why does it improve candidate assessment?
IRT is a psychometric framework that models each test item's statistical properties — primarily difficulty and discrimination power — and uses those properties to estimate a candidate's ability level more accurately than a simple proportion-correct score. Adaptive testing using IRT selects questions at the candidate's estimated ability level, which gives a more precise ability estimate with fewer questions. A fixed-difficulty exam can't distinguish between a candidate who scores 70% because they're a solid mid-level engineer and one who scores 70% because they're strong on some topics and weak on others — IRT's adaptive approach surfaces that distinction.
How do you audit an AI hiring system for demographic bias?
A bias audit on an AI hiring system requires: criterion-level scoring output (not just a composite score) with consistent terminology; demographic data on the candidate cohort (collected separately with appropriate consent); and statistical analysis of score distributions by demographic group across each criterion. If the model systematically scores one group lower on "communication clarity" — controlling for performance on content criteria — that's a detectable signal in the criterion-level reasoning that a composite score audit would miss. The bias audit log in TalentFilter records every criterion score and reasoning output, enabling post-hoc cohort comparison at any demographic dimension the client wants to examine.
The bias audit requirement that reshaped this project's architecture turned out to be a technical improvement as well as a compliance improvement. Structured criterion-level reasoning with consistent vocabulary is strictly better than summary scoring for evaluating subjective dimensions — both because the chain-of-thought produces better outputs and because it produces outputs you can review and improve.
If you're building AI systems for employment or other high-stakes decisions, the EU AI Act's transparency and auditability requirements are not burdensome additions to an otherwise clean architecture. They're engineering constraints that produce a more honest and more reviewable system.
We've applied similar explainability-first design to ADAC's RAG chatbot where every retrieval decision is logged with source attribution, and to BSI's government incident reporting platform where every workflow state change is auditable. Our AI integration and automation practice covers AI system design through production deployment.
