Technical Hiring

Best Kubernetes Test for Hiring SRE and DevOps Engineers

ClarityHire Team(Editorial)2026-05-096 min read

Why standard Kubernetes tests fail

Most Kubernetes assessments ask: "What is a StatefulSet?" or "Explain taints and tolerations." These measure memorization, not whether someone can design a cluster that survives node failure, detect when your app is oscillating, or debug why a deployment never stabilizes.

An SRE or DevOps engineer with deep Kubernetes experience knows that the real questions are about operational stability, cost, and debugging production problems under pressure.

Here's what a signal-bearing Kubernetes assessment looks like.

Core skills to assess

1. Cluster design and resilience

Can they design a cluster that survives zone failures?

Assessment question:

"You're running a production workload that must stay up during zone failures. You have a 3-node cluster in a single AZ. What do you change? What are the trade-offs?"

Good answer:

Spread nodes across 3 AZs (at least)
Use PodDisruptionBudgets to prevent cascading failures
Use node affinity to avoid putting all replicas in one zone
Configure health checks for your nodes
Understands the cost trade-off (3x the infrastructure)

Weak answer:

"Just run more replicas" (doesn't address the zone failure)
"Use anti-affinity" (without understanding pod topology spread constraints)

2. Workload configuration and debugging

Can they configure a workload for production and debug it when it fails?

Assessment question:

"You deployed a new microservice. The deployment shows 3 replicas running, but only 2 are handling traffic. Readiness checks pass. Why might this be happening?"

Good answer considers (in order):

Is the endpoint missing from the service selector? (kubectl get endpoints)
Is the pod network policy blocking traffic?
Is the pod's network interface misconfigured?
Is the load balancer health check different from the readiness check?

This tests systematic debugging, not Kubernetes facts.

3. Storage and state management

Do they understand when to use StatefulSets vs. Deployments? When to use node-local vs. persistent storage?

Assessment question:

"We run Elasticsearch on Kubernetes. Do we use StatefulSet or Deployment? What about storage?"

Good answer:

StatefulSet for stable identity and ordering
Persistent volumes for data durability
Anti-affinity to spread across nodes
Careful upgrade strategy (rolling restarts can cause quorum loss)

Weak answer:

"Use Deployment, it's simpler"
No mention of data persistence or quorum

4. Scaling and cost optimization

Do they know when to scale up vs. when to optimize workloads?

Assessment question:

"Your cluster CPU utilization is at 85%. Cost is $10k/month. You have two options: add more nodes, or optimize pod requests. What do you check first?"

Good answer:

Check if pods are over-requesting resources (set requests too high)
Check if there are noisy neighbor issues (one pod using way more than expected)
Check if the workload naturally peaks (could shift traffic patterns instead)
Only then add nodes if it's a real resource constraint

This tests operational judgment.

5. Observability and debugging production

Can they design monitoring for Kubernetes? Can they diagnose why an app is flaky?

Assessment question:

"Your application in Kubernetes has a 0.1% error rate spike that lasts 30 seconds, then recovers. It happens once per day at random times. How would you investigate?"

Good answer mentions:

Pod restart events (check node logs and kubelet)
Resource exhaustion (check memory and CPU)
Network timeouts (check service endpoints and DNS)
Application-level retries (is the app recovering from transient errors?)

Assessment structure

Part 1: Take-home design exercise (2 hours)

Scenario:

"Design a production Kubernetes cluster for a SaaS application with the following constraints:

99.9% uptime SLA
Stateless frontend and stateful backend (database)
50-500 concurrent users (variable load)
Budget constraint: $5,000/month maximum
Must survive single zone failure
Scaling must be automatic"

The candidate should provide:

Node architecture (how many nodes, instance types, zones)
Workload configuration (deployments, resource requests, scaling policies)
Storage strategy (persistent volumes, backups)
Monitoring and alerting plan
Cost breakdown

This tests whether they've built systems at scale.

Part 2: Configuration review (30 minutes)

Give them a Kubernetes manifest with intentional issues:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 1  # Issue: should be 3 for HA
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
      - name: api
        image: api:latest  # Issue: should use semantic versioning
        resources:
          requests:
            cpu: 100m  # Issue: too low; app will be CPU throttled
            memory: 64Mi  # Issue: too low; will OOM
      # Issue: no readiness probe, no liveness probe
      # Issue: no node affinity; could run on same node
---
apiVersion: v1
kind: Service
metadata:
  name: api-server
spec:
  selector:
    app: api-server
  ports:
  - port: 80
    targetPort: 8080
  # Issue: no headless service for ordering; not using StatefulSet for ordering

Ask them to identify all issues and explain fixes. A strong candidate catches 8+ issues; a weak one catches 1–2 obvious ones.

Part 3: Troubleshooting conversation (30 minutes)

Scenario:

"You've deployed a new service to Kubernetes. Traffic is hitting it, but latency is 10x normal. Prometheus shows normal CPU and memory usage. Walk me through how you'd debug."

The candidate should ask:

Is the service endpoint up to date? (kubectl get endpoints)
Are pods receiving traffic evenly, or is one pod slow?
Is the network path the issue (kube-proxy, CNI)?
Is the application blocking on something (external API, database)?
Did the deploy include a code change that's unoptimized?

This tests whether they think systematically.

Kubernetes assessment rubric

Skill	Level 1 (Below)	Level 2 (Meets)	Level 3 (Exceeds)
Cluster design	Single-node or single-AZ cluster	Multi-AZ with HA considerations	Designs for zone failure with explicit recovery SLA
Workload configuration	No resource requests; no health checks	Appropriate requests; liveness and readiness probes	Sophisticated probes; graceful shutdown; disruption budgets
Debugging methodology	Guesses at solutions	Systematic approach; checks kubectl get/describe	Deep observability; can read kubelet logs
Scaling strategy	Manually scales or scales on CPU only	HPA with custom metrics	Predictive scaling; understands resource limits
Cost awareness	Ignores cost implications	Balances HA and cost	Optimizes resource allocation without sacrificing reliability

What NOT to test

Don't ask candidates to:

Write a Kubernetes Operator from scratch. (That's systems programming, not SRE.)
Memorize API versions. (They'll use the docs in production anyway.)
Set up a cluster from scratch during the interview. (That's an operations task, not an assessment of judgment.)

Do ask them to:

Design a cluster for realistic constraints
Debug why a deployment is failing
Explain trade-offs between options
Configure workloads for production

For teams not yet using Kubernetes

If you're hiring for DevOps or SRE but not using Kubernetes yet, test containerization and orchestration concepts without requiring Kubernetes expertise. Many teams over-weight Kubernetes knowledge when they really need systems thinking.

For teams using Kubernetes in production, use this assessment structure to evaluate operational depth before hiring. A candidate who passes this assessment will be productive on day one.

Next: understand how to interpret the results of your DevOps assessment.

kubernetessredevopscontainer orchestration

Best Kubernetes Test for Hiring SRE and DevOps Engineers

Why standard Kubernetes tests fail

Core skills to assess

1. Cluster design and resilience

2. Workload configuration and debugging

3. Storage and state management

4. Scaling and cost optimization

5. Observability and debugging production

Assessment structure

Part 1: Take-home design exercise (2 hours)

Part 2: Configuration review (30 minutes)

Part 3: Troubleshooting conversation (30 minutes)

Kubernetes assessment rubric

What NOT to test

For teams not yet using Kubernetes

Related Articles

How to Assess DevOps Engineers' Skills: Methodology & Rubric

DevOps Engineer Test: Example Questions & Assessment Template

DevOps Test Validity Pitfalls: What Makes Assessment Fail