Interview Design

How to Evaluate AI-Assisted Coding Ability in Technical Interviews

ClarityHire Team(Editorial)2026-06-207 min read

The interview question has changed

For two years the loudest question in technical hiring was did the candidate use ChatGPT? That was the wrong question. The real one is how well do they use it? — because by mid-2026 every engineer you hire will spend half their day with an AI assistant open. Selecting against AI use is selecting against the work.

Meta, Shopify, and Google moved their loops first: AI is allowed, sometimes required, and the rubric grades how the candidate prompts, verifies, and recovers. If your interview is still scored on whether the solution compiles cleanly in 30 minutes, you are measuring a skill that no longer maps to the job.

This is the rubric and format we recommend for evaluating AI-assisted coding ability — and how to keep the signal honest with the integrity tools you already have.

Why the old "no AI" interview fails

Three reasons it stopped working:

The on-the-job task is AI-assisted. Your senior engineers ship code paired with Copilot or Claude. Hiring on the unpaired version selects for a smaller, narrower skill than the role actually needs.
Detection is asymmetric. Even with code coherence analysis and keystroke biometrics, disciplined cheaters slip through and panicked good candidates get flagged. The cost-benefit only works for high-stakes roles.
Strong candidates self-select out. A senior who works with AI all day is not going to pretend they don't for your 45-minute screen. They will take an offer at a company that lets them work the way they work.

The fix is not to drop integrity — keystroke and code-coherence signals still matter for spotting impersonation. It is to change what you are grading from "wrote it alone" to "drove the AI well."

The four-dimension rubric

Score each dimension 1–5 with the anchors below. Drop them into your existing interview scorecard — they replace the "code quality" axis at senior, and they sit alongside it at junior.

1. Prompt quality

What you are measuring: can they get the model to do useful work without a long back-and-forth?

5 — Prompts include the goal, the constraints, the relevant code or schema, and the desired output format. A single prompt gets a usable response.
3 — Prompts get there in 2–3 rounds. Some unnecessary scope or missing context.
1 — One-liners like "fix this." Hopes the model guesses what they mean.

2. Output verification

What you are measuring: do they trust the model or do they check?

5 — Runs the generated code against test cases before reading it. Spots a missing edge case, a subtly wrong default, a deprecated API. Writes one more test to confirm.
3 — Reads the code carefully but mostly trusts it runs. Catches obvious bugs only.
1 — Pastes and submits. When asked "are you sure?" cannot tell you why.

3. Recovery from bad generations

What you are measuring: when the model goes off the rails, can they steer it back?

5 — Recognizes the failure pattern (hallucinated method, wrong framework, made-up env var) within seconds. Reframes the prompt with the missing context or drops to writing the fix themselves.
3 — Notices after a run fails. Tries another prompt. Sometimes gets there.
1 — Stuck in a loop. Keeps re-prompting the same way and pasting the same broken code back.

4. Communication

What you are measuring: can they explain what they asked the AI, what they got, and why they kept or discarded it?

5 — Narrates the choice live: "I gave it the schema and asked for a query that handles the soft-delete column — it returned a join I don't need so I'm trimming it." Owns the result.
3 — Explains after the fact. Mostly accurate.
1 — Cannot articulate why the code looks the way it does. The AI wrote it; they shipped it.

These four axes correlate with on-the-job AI productivity far better than the legacy "writes correct code in 25 minutes" score.

The interview format that surfaces these signals

A 60-minute slot built around AI-assisted work, not against it:

0–5 min: brief. Hand the candidate a small ambiguous problem — extend an existing 200-line repo with a new feature, or debug a subtle failure in real code. Tell them an AI assistant is encouraged and the prompts they use are part of the evaluation.
5–45 min: build with AI. Watch them work in a collaborative code editor with their AI assistant of choice. Do not interrupt for the first 10 minutes. Take notes per rubric axis.
45–55 min: defend the result. Run the code together. Ask them to walk through one function they accepted from the AI and one they rewrote. Why each? What was the model wrong about?
55–60 min: extension. "If the input were 100x larger, how would your AI's solution break?" The candidate who actually understood the code will name the problem in 30 seconds.

A small twist on a LeetCode-free format: rather than asking the candidate to extend the feature without AI, ask them to do it with AI and grade the collaboration.

Questions that pull AI fluency to the surface

Use these mid-interview, not at the end:

"Show me the prompt you would write for this."
"What is one thing the model just told you that you don't trust? Why?"
"If your AI was offline, how would you solve this differently?"
"Walk me through this function — what did it generate vs. what did you change?"

The last one is the highest-signal question in the loop. A candidate who blanks on a function they "wrote" two minutes ago is shipping AI output they don't understand. That is the failure mode you are hiring against.

Common failure modes and what they tell you

Prompt-and-paste, never run. Low verification score. Will ship broken code in production.
Endless re-prompting on the same broken output. Low recovery score. Cannot debug their tools.
Slick verbal explanation, can't answer a follow-up. Likely pre-rehearsed with the AI before the call. Push harder on the "what would you change?" question, similar to the technique for grilling a take-home submission.
Refuses to use the AI at all. Not a fail in itself — but in 2026 it is worth asking how they ship at their current job.

Keep integrity signals on

Allowing AI is not the same as turning off integrity verification. Keep keystroke event capture and face-presence checks on so you still catch impersonation and screen handoff. What changes is what gets flagged: a burst paste from ChatGPT into the editor is no longer a red flag — it is a data point about prompt quality. The integrity report becomes evidence of how the candidate worked, not whether they cheated.

What to do next

Three changes pay off this quarter:

Add the four-dimension AI rubric to your scorecard for one role. Run it on five candidates and compare with old scores.
Rewrite one technical interview prompt to assume AI use. Keep the structured interview format — only the rubric changes.
Calibrate the panel. Watch one recorded session together and rate each axis independently before discussing. Disagreement on the communication axis is usually where you find your strongest signal.

Hire for the work the role actually does. In 2026 that work is AI-assisted; your interview should be too.

ai assistedcoding interviewsinterview designrubric