Best Kubernetes Test for Hiring SRE and DevOps Engineers
Why standard Kubernetes tests fail
Most Kubernetes assessments ask: "What is a StatefulSet?" or "Explain taints and tolerations." These measure memorization, not whether someone can design a cluster that survives node failure, detect when your app is oscillating, or debug why a deployment never stabilizes.
An SRE or DevOps engineer with deep Kubernetes experience knows that the real questions are about operational stability, cost, and debugging production problems under pressure.
Here's what a signal-bearing Kubernetes assessment looks like.
Core skills to assess
1. Cluster design and resilience
Can they design a cluster that survives zone failures?
Assessment question:
"You're running a production workload that must stay up during zone failures. You have a 3-node cluster in a single AZ. What do you change? What are the trade-offs?"
Good answer:
- Spread nodes across 3 AZs (at least)
- Use PodDisruptionBudgets to prevent cascading failures
- Use node affinity to avoid putting all replicas in one zone
- Configure health checks for your nodes
- Understands the cost trade-off (3x the infrastructure)
Weak answer:
- "Just run more replicas" (doesn't address the zone failure)
- "Use anti-affinity" (without understanding pod topology spread constraints)
2. Workload configuration and debugging
Can they configure a workload for production and debug it when it fails?
Assessment question:
"You deployed a new microservice. The deployment shows 3 replicas running, but only 2 are handling traffic. Readiness checks pass. Why might this be happening?"
Good answer considers (in order):
- Is the endpoint missing from the service selector? (kubectl get endpoints)
- Is the pod network policy blocking traffic?
- Is the pod's network interface misconfigured?
- Is the load balancer health check different from the readiness check?
This tests systematic debugging, not Kubernetes facts.
3. Storage and state management
Do they understand when to use StatefulSets vs. Deployments? When to use node-local vs. persistent storage?
Assessment question:
"We run Elasticsearch on Kubernetes. Do we use StatefulSet or Deployment? What about storage?"
Good answer:
- StatefulSet for stable identity and ordering
- Persistent volumes for data durability
- Anti-affinity to spread across nodes
- Careful upgrade strategy (rolling restarts can cause quorum loss)
Weak answer:
- "Use Deployment, it's simpler"
- No mention of data persistence or quorum
4. Scaling and cost optimization
Do they know when to scale up vs. when to optimize workloads?
Assessment question:
"Your cluster CPU utilization is at 85%. Cost is $10k/month. You have two options: add more nodes, or optimize pod requests. What do you check first?"
Good answer:
- Check if pods are over-requesting resources (set requests too high)
- Check if there are noisy neighbor issues (one pod using way more than expected)
- Check if the workload naturally peaks (could shift traffic patterns instead)
- Only then add nodes if it's a real resource constraint
This tests operational judgment.
5. Observability and debugging production
Can they design monitoring for Kubernetes? Can they diagnose why an app is flaky?
Assessment question:
"Your application in Kubernetes has a 0.1% error rate spike that lasts 30 seconds, then recovers. It happens once per day at random times. How would you investigate?"
Good answer mentions:
- Pod restart events (check node logs and kubelet)
- Resource exhaustion (check memory and CPU)
- Network timeouts (check service endpoints and DNS)
- Application-level retries (is the app recovering from transient errors?)
Assessment structure
Part 1: Take-home design exercise (2 hours)
Scenario:
"Design a production Kubernetes cluster for a SaaS application with the following constraints:
- 99.9% uptime SLA
- Stateless frontend and stateful backend (database)
- 50-500 concurrent users (variable load)
- Budget constraint: $5,000/month maximum
- Must survive single zone failure
- Scaling must be automatic"
The candidate should provide:
- Node architecture (how many nodes, instance types, zones)
- Workload configuration (deployments, resource requests, scaling policies)
- Storage strategy (persistent volumes, backups)
- Monitoring and alerting plan
- Cost breakdown
This tests whether they've built systems at scale.
Part 2: Configuration review (30 minutes)
Give them a Kubernetes manifest with intentional issues:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 1 # Issue: should be 3 for HA
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: api
image: api:latest # Issue: should use semantic versioning
resources:
requests:
cpu: 100m # Issue: too low; app will be CPU throttled
memory: 64Mi # Issue: too low; will OOM
# Issue: no readiness probe, no liveness probe
# Issue: no node affinity; could run on same node
---
apiVersion: v1
kind: Service
metadata:
name: api-server
spec:
selector:
app: api-server
ports:
- port: 80
targetPort: 8080
# Issue: no headless service for ordering; not using StatefulSet for ordering
Ask them to identify all issues and explain fixes. A strong candidate catches 8+ issues; a weak one catches 1–2 obvious ones.
Part 3: Troubleshooting conversation (30 minutes)
Scenario:
"You've deployed a new service to Kubernetes. Traffic is hitting it, but latency is 10x normal. Prometheus shows normal CPU and memory usage. Walk me through how you'd debug."
The candidate should ask:
- Is the service endpoint up to date? (kubectl get endpoints)
- Are pods receiving traffic evenly, or is one pod slow?
- Is the network path the issue (kube-proxy, CNI)?
- Is the application blocking on something (external API, database)?
- Did the deploy include a code change that's unoptimized?
This tests whether they think systematically.
Kubernetes assessment rubric
| Skill | Level 1 (Below) | Level 2 (Meets) | Level 3 (Exceeds) |
|---|---|---|---|
| Cluster design | Single-node or single-AZ cluster | Multi-AZ with HA considerations | Designs for zone failure with explicit recovery SLA |
| Workload configuration | No resource requests; no health checks | Appropriate requests; liveness and readiness probes | Sophisticated probes; graceful shutdown; disruption budgets |
| Debugging methodology | Guesses at solutions | Systematic approach; checks kubectl get/describe | Deep observability; can read kubelet logs |
| Scaling strategy | Manually scales or scales on CPU only | HPA with custom metrics | Predictive scaling; understands resource limits |
| Cost awareness | Ignores cost implications | Balances HA and cost | Optimizes resource allocation without sacrificing reliability |
What NOT to test
Don't ask candidates to:
- Write a Kubernetes Operator from scratch. (That's systems programming, not SRE.)
- Memorize API versions. (They'll use the docs in production anyway.)
- Set up a cluster from scratch during the interview. (That's an operations task, not an assessment of judgment.)
Do ask them to:
- Design a cluster for realistic constraints
- Debug why a deployment is failing
- Explain trade-offs between options
- Configure workloads for production
For teams not yet using Kubernetes
If you're hiring for DevOps or SRE but not using Kubernetes yet, test containerization and orchestration concepts without requiring Kubernetes expertise. Many teams over-weight Kubernetes knowledge when they really need systems thinking.
For teams using Kubernetes in production, use this assessment structure to evaluate operational depth before hiring. A candidate who passes this assessment will be productive on day one.
Next: understand how to interpret the results of your DevOps assessment.