The Evaluation Pipeline

No Black Box. Just Evidence.

Every score Storm produces points to a real quote from the candidate’s actual work. The AI does the homework. The hiring manager makes the call. Here’s exactly how it works — step by step, in plain English.

See the Pipeline The Guardrails

Single Observation · Anatomy eval #4172

Criterion Funnel Metric Definition

Question “Did the candidate distinguish ‘started application’ from ‘submitted application’ in their reasoning?”

Citation “The dashboard’s counting started apps, but conversion is measured against submitted apps — that’s where the drop is coming from.”

Answer Yes

Reasoning Candidate identified the metric mismatch directly and explained the funnel semantics before recommending a fix.

100% Cited or Insufficient

Reproducible Same input → same score

The Promise

Three things every Storm score guarantees

Most AI hiring tools hand you a number and call it done. We hand you the number, the question that produced it, the quote from the candidate’s work that justifies it, and the reasoning that connects them.

Cited

Every “yes” or “no” the AI gives must point to an exact quote from the candidate’s chat, code, spreadsheet, or slide. If the AI can’t find evidence, it must say so — explicitly.

no quote → no score → “Insufficient Evidence”

Deterministic

Same evidence, same prompt, same model — same result. Scores are reproducible quarters from now, with the prompt version and model snapshot stored alongside the score.

temperature: 0 · seed: 42 · model: pinned

Human-decides

The AI scores. You decide. Hiring managers see every observation, every quote, every “Insufficient Evidence” tag — and can override any score with a comment that becomes part of the audit trail.

AI assists. Humans hire.

The Pipeline

Six stages, end to end

From defining what “good” looks like for a role, all the way through to the hiring-manager view — here’s every step the platform takes, and what we do at each one to make the result trustworthy.

Turn the role into testable questions

You define what matters for the role — competencies, weights, importance. The AI takes each criterion and writes a small set of yes/no observation questions a hiring manager could ask while watching the candidate work. “Did they distinguish leading from lagging metrics?” “Did they explain the trade-off before picking a path?”

These aren’t trivia questions. They’re the operational definition of “good” for this role. They get versioned, reviewed, and bound to the position.

You can edit, add, or veto any question

Criterion → Observation Questions

Funnel Metric Definition weight: 0.25 · critical

AI-Generated Questions

Q1 Did the candidate distinguish “started application” from “submitted application” in their reasoning?
Q2 Did the candidate verify the metric definition against the source before recommending a fix?
Q3 Did the candidate connect the funnel mismatch to the post-launch conversion drop?
Q4 Did the candidate name a concrete remediation rather than a generic “needs investigation”?

Capture what they actually did, not what they said they’d do

As the candidate works through the simulation, every artifact gets captured — their voice-call transcript, their code edits, their spreadsheet formulas, their slide content, every line of chat. We bundle it into a structured evidence package tagged by source and timestamp.

Before the bundle ever reaches an AI evaluator, we run it through a redactor that strips the candidate’s name, location, and identifying details. The model scores the work , not the person .

PII-redacted before scoring

Evidence Bundle · sim-uniera-funnel

💬

Chat Transcript

14 turns · stakeholder Q&A

3.2 KB

💻

Code Editor

LMS application count fix

+34 / -12

📊

Spreadsheet

Pre/post conversion analysis

2 sheets

🎤

Voice Call

VP Ops debrief · transcript

8m 14s

Redactor pass: replaced [CANDIDATE_NAME] , [LOCATION] before evaluator handoff

Score every question against the evidence

For each criterion’s question, we run an observation check : a focused AI call that gets one yes/no question, the candidate’s full evidence bundle, and a single instruction — “answer the question, but only if you can quote the exact text that justifies your answer.”

The AI must return three things together: an answer, a citation (a literal quote), and a one-sentence reasoning. If it can’t find a quote, the only legal answer is Insufficient Evidence .

One question · one citation · one answer

Observation Checks · Funnel Metric Definition

Q1 · Distinguished started vs. submitted apps?

cite: chat[turn 6]

Yes

Q2 · Verified the metric against the source?

cite: code[diff line 42]

Yes

Q3 · Connected mismatch to post-launch drop?

cite: voice[03:14]

Yes

Q4 · Named a concrete remediation?

cite: not found in bundle

Insufficient

Verify every citation. Punish hallucinations.

AI models can fabricate plausible-sounding quotes. We don’t take their word for it. Every citation goes through a deterministic validator — pure code, no AI — that checks the quoted text actually appears in the candidate’s evidence.

If the validator can’t find the quote, the answer is automatically downgraded to Insufficient Evidence and a citation-violation event is logged. We track those rates per evaluation as a leading indicator of prompt drift.

Hallucinated quotes are auto-downgraded

Citation Validator · Pure Code Path

Q1 — citation submitted VERIFIED

“The dashboard’s counting started apps, but conversion is measured against submitted apps.”

✓ matched in chat[turn 6] · answer accepted: Yes

Q4 — citation submitted NOT FOUND

“We should rebuild the funnel attribution from the database up.”

✗ no match in evidence bundle · answer coerced to Insufficient · kind=citation_violation

Aggregate honestly. Never fake confidence.

We don’t average our way to a clean number when the evidence isn’t there. Each criterion gets one of three states based on what we actually saw:

Assessed means we have enough evidence to give a real score. Partially Assessed means we have some signal but missed pieces of the question. Not Assessed means the simulation didn’t surface evidence one way or the other — and we say so plainly, instead of guessing.

Honesty about what we don’t know

Per-Criterion Aggregation

Assessed

Funnel Metric Definition
3 of 4 questions confirmed from evidence

Partial

Stakeholder Communication
Voice call evidence; deck evidence missing

Not Assessed

—

Production Hardening
No evidence surfaced in this simulation

Make every score reproducible — forever

Every evaluation writes a structured audit line: which prompt version generated the questions, which prompt version scored the observations, which model snapshot produced the answer, which seed was used, and how many citation violations occurred.

Six months from now, when someone questions a decision, we can replay the exact evaluation against the exact evidence and produce the same score. That’s not a marketing claim — it’s how the pipeline is built.

Full chain of custody

[GEVAL_AUDIT] · structured log

                    kind
                    =
                    eval_complete
                    ·
                    evaluationId
                    =
                    4172
                    ·
                    positionId
                    =
                    pos-uniera-da-l3
                  

                    questionGenPrompt
                    =
                    v1
                    ·
                    observationCheckPrompt
                    =
                    v1
                  

                    model
                    =
                    <pinned-snapshot>
                    ·
                    seed
                    =
                    42
                    ·
                    temperature
                    =
                    0
                  

                    criteria
                    =
                    8
                    ·
                    questions
                    =
                    31
                    ·
                    citationViolations
                    =
                    1
                  

                    redaction
                    =
                    name+location
                    ·
                    samples
                    =
                    1
                    ·
                    mode
                    =
                    shadow
                  

The Guardrails

What we built so the AI can’t drift

Five engineering decisions that turn a clever LLM into a defensible evaluator. None of these are toggles or nice-to-haves — they’re the spine of the pipeline.

Versioned prompts

The instructions we give the AI are file-versioned and the version is stamped onto every score. Changing a prompt means a new version — and a new test gauntlet.

Pinned model snapshot

We don’t ride the latest model — we pin to a specific snapshot, recorded alongside every score in the audit trail, so a vendor update can’t silently change a score you already shipped.

Self-consistency sampling

For high-stakes criteria we can run each observation check N times in parallel and take a majority vote, with each call seeded independently so a flake in one lane can’t bias the result. Sampling ties are surfaced as their own signal, not silently averaged away.

PII redaction

Names and locations are stripped from the evidence bundle before it ever reaches the evaluator. The model scores the candidate’s work, not who they are.

Built for ADET compliance

AI hiring tools now face explicit US regulation — NYC Local Law 144, the Illinois AI Video Interview Act, the Colorado AI Act. Storm is engineered to clear those bars: questions referencing protected characteristics are rejected at generation, the observation-check prompt blocks demographic and stylistic signals, and PII (name, location) is stripped from the evidence bundle before the evaluator sees it. Independent annual bias audit and public summary publication are on the roadmap as the standard requires.

Human-in-the-Loop

AI scores. You decide.

The pipeline produces evidence — a structured argument for or against each criterion, with citations attached. It does not produce decisions. Hiring managers see every piece of work, every observation, every quote, and every “Insufficient Evidence” tag.

Every override a hiring manager makes is recorded with a comment and becomes part of the audit trail. Over time, those comments are exactly the data we’d use to improve the questions for the role.

Every observation, citation, and reasoning trail is visible in the evaluation view.
Hiring managers can override any score, with a required comment that’s stored with the evaluation.
“Insufficient Evidence” is a first-class outcome — never coerced into a fake number.
The platform never auto-rejects, auto-rolls-forward, or hides candidates from a human reviewer.

Senior Data Analyst · Hiring Review 3 candidates

Candidate A

87 Assessed · 1 Partial · 0 Not Assessed

87 Advance

Candidate B

62 Partial · communication strong; production weak

62 Override → 74

Priya · Hiring Manager Disagreed with “production hardening” mark. The candidate’s slide deck addressed it directly; the AI missed the slide because the artifact was attached after the voice call. Re-running with full bundle.

Candidate C

— Not Assessed · simulation cut off at chapter 2

— Re-run

Common Questions

Things buyers ask us, with honest answers

If a question you have isn’t here, tell us — we’d rather answer it directly than wave it away.

How do I know it’s not just a black box?

Every score on Storm breaks down into the questions that produced it, the literal quote from the candidate’s work that justified each answer, and the AI’s one-sentence reasoning. If you can read the evaluation page, you can audit the score. We do not show a number without showing the evidence behind it.

What happens when the AI gets it wrong?

Two things. First — because every score is cited, you can see why the AI got it wrong: it cited the wrong passage, missed an artifact, or stopped at one quote when there was a better one. Second — overrides are first-class. The hiring manager corrects the score with a comment, and the override becomes part of the permanent audit trail. Patterns of overrides for a role are how we improve the questions for that role over time.

How do you prevent demographic bias from creeping in?

Three layers. Names and locations are stripped from the evidence bundle before the evaluator sees it. We run a paired-candidate bias audit — same evidence, different surface signals — and require the score delta to stay within tolerance before any prompt change can ship. And the questions themselves are tied to operational competencies, not communication style or self-presentation. The scoring is on what the candidate did, not how they framed it.

Can I reproduce a score from six months ago?

Yes. Every evaluation persists the prompt version, model snapshot, seed, and evidence bundle. The same inputs through the same pinned model produce the same outputs. If a vendor releases a new model, your old scores are unaffected — they’re bound to the snapshot they were produced under.

Is the AI making the hiring decision?

No. The AI produces an evidence-backed assessment for each criterion. The hiring manager makes every advance / no-advance call. The platform never auto-rejects, auto-progresses, or hides a candidate from a human reviewer. The AI’s job is to make the manager’s job easier — not to replace it.

What if a simulation didn’t surface enough evidence?

That criterion is marked Not Assessed , plainly. We do not average missing evidence into the score, and we do not pretend confidence we don’t have. The evaluation view shows exactly which criteria were assessed, partially assessed, or not assessed — and lets the hiring manager re-run with a different simulation if they need more signal.

Want to see this on a real candidate?

We’ll walk you through a live evaluation — every observation, every citation, every override — for a role you’re hiring for right now.

Book a 30-min walkthrough