How to Interview LLM Application Engineers (Not ML Engineers)
The role you are actually hiring for
"LLM engineer" or "AI application engineer" is a role that barely existed three years ago and now sits in every product org. It is not the same job as machine learning engineer. The skills overlap maybe 30%.
A machine learning engineer trains and serves models. An LLM application engineer takes pre-trained models and turns them into product features: prompt design, evaluation harnesses, retrieval pipelines, tool-use scaffolding, latency and cost optimization, graceful failure when the model goes off-script. The most senior ones spend their time on the boring half of the work — evals, observability, guardrails — not on prompt cleverness.
If you assess these candidates with the ML engineer interview kit, you will hire researchers who cannot ship and reject shippers who never trained a model. If you assess them with the generic software engineer kit, you miss the entire AI-specific skill surface.
Five competencies worth testing
The job decomposes into five testable areas. Most loops can cover four out of five in three hours of candidate time.
- Prompt design and iteration. Can they write a prompt that reliably produces structured output, and can they iterate when it does not?
- Evaluation thinking. Can they design an eval that catches a regression that a vibes-check would miss?
- Failure-mode reasoning. When the model returns garbage, do they have a model of why and a plan to fix it?
- System integration. Can they wire the model into a real product surface — retrieval, tool calls, streaming, retries, caching — without it falling over in production?
- Cost and latency awareness. Do they know what their choices cost? Do they know what a 30-second wait does to user trust?
Cognitive ability and general software engineering still matter. Treat them as table stakes, not as the focus of the LLM-specific rounds.
A four-stage loop that works
This is the loop we recommend for a mid-to-senior LLM application engineer role.
Stage 1: Async prompt-engineering work sample (90 minutes)
Give the candidate a small dataset — 30 to 50 input-output pairs representing a realistic product task. Examples: extract structured fields from customer support emails, classify a support ticket into one of seven categories with a confidence score, generate a personalized email summary from a customer-event log.
Their task: write a prompt (or short prompt-and-code pipeline) that produces the desired output on the dataset, plus a one-paragraph note on where it still fails and why.
Allow any AI tool. This is a Policy B round — see our AI usage policy guide for the exact framing. The skill you are measuring is how they use the tools to build with the tools, not whether they can do it unaided.
Grade on:
- Reliability of the output across the dataset
- Quality of the failure-mode analysis (the paragraph is more telling than the prompt itself)
- Whether they reached for evals, or just eyeballed the results
Stage 2: Live evaluation design (45 minutes)
Show them the prompt from Stage 1 and a deliberately broken version of it. Ask them to design an evaluation that would catch the regression — not just verbally, but by writing the eval test code in a real editor.
This is the most discriminating round in the loop. Strong candidates will:
- Distinguish exact-match, semantic similarity, and rubric-graded eval styles, and pick the right one for the task
- Build in distinct positive cases, edge cases, and adversarial cases
- Talk about how the eval suite will run in CI, what the failure threshold should be, and how they will update it over time
Weak candidates will write three asserts that test the happy path and call it done.
Run this in a collaborative code editor so you can spectate the candidate's thought process keystroke by keystroke. The pauses tell you a lot.
Stage 3: System design — building with the model (60 minutes)
Hand them a product brief: "Design the backend for a feature that lets a sales rep paste a customer's last 90 days of support tickets and get a one-paragraph 'should we be worried about churn' analysis they can share with the account manager."
Probe on:
- How they would structure the prompt (system, user, tool calls?)
- How they would handle the cost and latency profile (synchronous? async with a webhook? streamed?)
- How they would test it (golden examples, A/B against a baseline, human-in-the-loop?)
- How they would catch a regression after the next model upgrade
- What they would do if the model hallucinates a specific account name
The artifact is the conversation, not a diagram. See our system design rubric for scoring guidance.
Stage 4: Walk-through plus structured behavioral (45 minutes)
A single round split into two halves. First, 25 minutes walking through the Stage 1 work sample with the follow-up question template. Then 20 minutes of structured behavioral questions on collaboration with non-engineering stakeholders — product, design, support — since most failure modes in LLM features are caught by people who do not write code.
Question patterns that produce real signal
Three patterns worth stealing for any LLM application engineering round.
"Show me a prompt you wrote that broke in production. What did you change?" This is the single best question in the kit. Strong candidates have specific stories with specific fixes. Weak candidates talk about prompting in the abstract.
"What's the cheapest model you could use for this feature? Why aren't you using it?" Tests cost awareness and forces them to articulate the quality–cost trade-off they implicitly chose. The answer "I always reach for the biggest model" is concerning at any seniority level.
"How do you know your prompt is better than the previous version?" Tests whether they actually run evals or just trust vibes. A surprising fraction of candidates have shipped LLM features without ever measuring quality.
What does not work
A few approaches we have watched teams try and abandon.
- LeetCode-style algorithm interviews. Almost no LLM application work is algorithmic. The signal you get is uncorrelated with the job. See screen developers without LeetCode for the broader argument.
- "Build a chatbot from scratch in 90 minutes." Tests speed-typing, not engineering. The interesting decisions live in eval design and failure handling, not in wiring up an API call.
- Prompt-trivia screens. "What is temperature?" tests vocabulary, not judgment. Test the judgment directly.
- Math whiteboards on transformer internals. Useful if you are hiring a model researcher. Irrelevant for someone who will spend their week on retrieval pipelines and eval suites.
The integrity question for an AI-using role
This is the round-tripping question every team hits: if the candidate is allowed to use Claude or ChatGPT during the work sample, what stops them from just having the model do the work?
Three things, in combination:
- The Stage 4 walk-through. A candidate who cannot defend their own prompt, eval, or system design did not really do the work — whatever the artifact says.
- The Stage 2 live evaluation design. This round runs in a proctored editor with no chat tools, in 45 minutes. It is hard to fake.
- Code coherence analysis on the Stage 1 submission. AI-generated code has structural tells even when the candidate also used AI to disguise it.
The point is not to catch them red-handed. It is to make sure every candidate's score reflects what they can do in a real engineering week, with the tools they will actually use.
What to do next
If you are spinning up LLM hiring this quarter, the highest-leverage move is to stop reusing the senior backend engineer loop. Pick three of the five competencies above for your first version of the rubric, write one work sample and one eval-design exercise, and run it on two internal engineers to calibrate.
Then run it on five external candidates. By candidate three you will already know whether the loop is sharper than what you had before. The roles change. The discipline — pick methods that produce real signal, then verify the signal is the candidate's — does not.