Assessment Design

How to Write AI-Resistant Coding Interview Questions

ClarityHire Team(Editorial)7 min read

The premise has changed

By mid-2026, the interesting question is not whether the candidate has ChatGPT open. It is whether your interview measures anything once they do. Most teams reacted by piling on detection — keystroke biometrics, paste flags, code coherence scoring. Detection helps. Question design helps more. A problem an LLM solves in eight seconds tells you nothing about the human behind the keyboard, no matter how cleanly you flag the paste.

This is a field guide to writing coding interview questions that still produce signal when AI assistance is assumed.

What makes a question easy for an LLM

Three properties stack the deck against you:

  1. It appears in the training corpus. LeetCode, HackerRank, Codeforces, GitHub trending repos — all of it is in the weights. If your question has a Wikipedia-shaped name (Two Sum, Word Ladder, Longest Increasing Subsequence), assume the answer is memorized.
  2. The full spec is handed over upfront. A complete, unambiguous prompt is exactly the shape an LLM is trained to consume. The model needs nothing from the candidate.
  3. The output is closed-form. A function with one input, one output, one correct answer is trivially gradable — and trivially generatable.

LeetCode #1 hits all three. So does most of the question library you inherited.

Design principle 1: Hide the spec, surface the symptom

Instead of "implement a function that returns the longest substring without repeating characters," give the candidate a failing test case and a vague bug report: "Customers are seeing duplicate items in their cart after this endpoint runs. Here is one failing input. Make it stop."

The candidate now has to:

  • Read the existing code
  • Form a hypothesis about what is wrong
  • Decide what "correct" even means

The model can still help — but only after the candidate frames the problem. Framing is the skill you wanted to measure anyway.

Design principle 2: Hand over the codebase, not the prompt

Replace standalone puzzles with a small repository. 100–300 lines, multiple files, one or two intentional bugs, a sparse README. Ask: "Find the bug that causes the 500 on POST /orders, and fix it."

A model that has not seen your repo is no better at this than the candidate. It will guess plausibly and confidently — and be wrong — because the answer is in the surrounding code, not the prompt.

This format also doubles as a LeetCode-free interview for teams that want to measure real engineering rather than puzzle speed.

Design principle 3: Build in a mid-question pivot

Halfway through, change the requirement. "Now the customer wants to support partial refunds — extend your solution." Or: "This endpoint now needs to be idempotent. What changes?"

The pivot does three things:

  • Forces the candidate to demonstrate they understand their own code (a fresh paste from a chat window cannot).
  • Surfaces architecture instincts that a one-shot solution hides.
  • Makes prompting a model mid-interview obvious — the candidate either stops, asks for a few minutes, or produces a second clean paste that looks nothing like the first. All of which become signal in the integrity report.

Design principle 4: Ask for judgment, not output

Some of the highest-signal questions have no code to write at all:

  • Code review exercises. Hand over a 60-line diff. Ask the candidate to leave comments as if they were the reviewer on a PR. LLMs produce earnest, generic code review ("consider adding a comment here"). Senior engineers produce specific, prioritized critique ("this lock is held across the network call — that will deadlock under load").
  • Tradeoff conversations. "You have two options for this cache invalidation. Which would you pick and why?" The candidate has to defend a choice. The model has no skin in the game and will hedge.
  • Reading an unfamiliar PR. "Walk me through what this change does and what could go wrong."

These cluster better with system design rubrics than with algorithm scorecards. Grade them on the reasoning, not the verdict.

Design principle 5: Novel constraints over novel problems

You do not need to invent a new problem class. You just need an unusual constraint on a familiar one.

  • "Implement a rate limiter — but the clock can run backwards by up to 30 seconds because of NTP drift."
  • "Build a queue — but enqueue must be O(1) worst case, not amortized, because we run this in a hard-real-time control loop."
  • "Parse this config file — but the spec evolves. Show me how you would handle backwards compatibility three versions from now."

LLMs can solve these. They will often miss the constraint. The candidate who catches it, asks about it, and designs around it is the candidate you want.

What not to do

A few common reactions that backfire:

  • Banning AI without changing the question. You will lose top candidates who already use AI well, keep the cheaters who do not care about your rules, and learn nothing new.
  • Making problems gratuitously hard to "out-pace" the model. Adversarial difficulty selects for people willing to grind under pressure, not for people who can do the job. The pass rate collapses and the funnel seizes up.
  • Switching everything to whiteboard / no-laptop interviews. This is a real option, but it has its own tradeoffs — accessibility, candidate stress, and the fact that no engineer codes that way in their actual job.
  • Trusting "AI-resistant" labels from question vendors. Some vendors verify that a question scores below a threshold when fed to GPT-4 today. Tomorrow's model retests differently. Treat the label as a starting point, not a guarantee.

Pair question design with the measurement layer

Even a well-designed question benefits from an integrity layer that gives you a second axis of signal. ClarityHire's code coherence AI reads the trajectory of a submission, not just the final code — did the candidate iterate, test, and refactor, or did the solution arrive in one large coherent paste? Combined with keystroke timing and paste/tab events, you get a confidence score that pairs with the design choices above instead of replacing them.

The strongest pattern we see: an AI-resistant question for the async stage, followed by a 30-minute live follow-up walk-through where the candidate has to explain decisions they made hours earlier. The combination is significantly harder to fake than either step alone.

What to do next

Pick the three questions you assign most often and grade them against the principles above:

  1. Does each question hide some part of the spec the candidate must elicit?
  2. Does the question require reading code the candidate has not seen before?
  3. Is there a pivot or judgment call that an off-the-shelf model cannot easily own?

For each question that scores poorly, rewrite it once and pilot the new version on five real candidates before rolling it out. Measure two things: the average submission quality from your top quartile, and the pass-through rate to the next round. If quality goes up and pass-through stays steady, you have a better question. If pass-through collapses, the new version is adversarial, not selective — revise and try again.

AI in coding interviews is not the problem. AI-shaped questions are. Fix the questions and the rest of the integrity stack starts doing what it was always supposed to do: confirm the signal, not generate it.

ai resistant questionsinterview designchatgptcoding assessments

Related Articles