Technical Hiring

Best Kubernetes Test for Hiring SRE and DevOps Engineers

ClarityHire Team(Editorial)6 min read

Why standard Kubernetes tests fail

Most Kubernetes assessments ask: "What is a StatefulSet?" or "Explain taints and tolerations." These measure memorization, not whether someone can design a cluster that survives node failure, detect when your app is oscillating, or debug why a deployment never stabilizes.

An SRE or DevOps engineer with deep Kubernetes experience knows that the real questions are about operational stability, cost, and debugging production problems under pressure.

Here's what a signal-bearing Kubernetes assessment looks like.

Core skills to assess

1. Cluster design and resilience

Can they design a cluster that survives zone failures?

Assessment question:

"You're running a production workload that must stay up during zone failures. You have a 3-node cluster in a single AZ. What do you change? What are the trade-offs?"

Good answer:

  • Spread nodes across 3 AZs (at least)
  • Use PodDisruptionBudgets to prevent cascading failures
  • Use node affinity to avoid putting all replicas in one zone
  • Configure health checks for your nodes
  • Understands the cost trade-off (3x the infrastructure)

Weak answer:

  • "Just run more replicas" (doesn't address the zone failure)
  • "Use anti-affinity" (without understanding pod topology spread constraints)

2. Workload configuration and debugging

Can they configure a workload for production and debug it when it fails?

Assessment question:

"You deployed a new microservice. The deployment shows 3 replicas running, but only 2 are handling traffic. Readiness checks pass. Why might this be happening?"

Good answer considers (in order):

  • Is the endpoint missing from the service selector? (kubectl get endpoints)
  • Is the pod network policy blocking traffic?
  • Is the pod's network interface misconfigured?
  • Is the load balancer health check different from the readiness check?

This tests systematic debugging, not Kubernetes facts.

3. Storage and state management

Do they understand when to use StatefulSets vs. Deployments? When to use node-local vs. persistent storage?

Assessment question:

"We run Elasticsearch on Kubernetes. Do we use StatefulSet or Deployment? What about storage?"

Good answer:

  • StatefulSet for stable identity and ordering
  • Persistent volumes for data durability
  • Anti-affinity to spread across nodes
  • Careful upgrade strategy (rolling restarts can cause quorum loss)

Weak answer:

  • "Use Deployment, it's simpler"
  • No mention of data persistence or quorum

4. Scaling and cost optimization

Do they know when to scale up vs. when to optimize workloads?

Assessment question:

"Your cluster CPU utilization is at 85%. Cost is $10k/month. You have two options: add more nodes, or optimize pod requests. What do you check first?"

Good answer:

  • Check if pods are over-requesting resources (set requests too high)
  • Check if there are noisy neighbor issues (one pod using way more than expected)
  • Check if the workload naturally peaks (could shift traffic patterns instead)
  • Only then add nodes if it's a real resource constraint

This tests operational judgment.

5. Observability and debugging production

Can they design monitoring for Kubernetes? Can they diagnose why an app is flaky?

Assessment question:

"Your application in Kubernetes has a 0.1% error rate spike that lasts 30 seconds, then recovers. It happens once per day at random times. How would you investigate?"

Good answer mentions:

  • Pod restart events (check node logs and kubelet)
  • Resource exhaustion (check memory and CPU)
  • Network timeouts (check service endpoints and DNS)
  • Application-level retries (is the app recovering from transient errors?)

Assessment structure

Part 1: Take-home design exercise (2 hours)

Scenario:

"Design a production Kubernetes cluster for a SaaS application with the following constraints:

  • 99.9% uptime SLA
  • Stateless frontend and stateful backend (database)
  • 50-500 concurrent users (variable load)
  • Budget constraint: $5,000/month maximum
  • Must survive single zone failure
  • Scaling must be automatic"

The candidate should provide:

  • Node architecture (how many nodes, instance types, zones)
  • Workload configuration (deployments, resource requests, scaling policies)
  • Storage strategy (persistent volumes, backups)
  • Monitoring and alerting plan
  • Cost breakdown

This tests whether they've built systems at scale.

Part 2: Configuration review (30 minutes)

Give them a Kubernetes manifest with intentional issues:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 1  # Issue: should be 3 for HA
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
      - name: api
        image: api:latest  # Issue: should use semantic versioning
        resources:
          requests:
            cpu: 100m  # Issue: too low; app will be CPU throttled
            memory: 64Mi  # Issue: too low; will OOM
      # Issue: no readiness probe, no liveness probe
      # Issue: no node affinity; could run on same node
---
apiVersion: v1
kind: Service
metadata:
  name: api-server
spec:
  selector:
    app: api-server
  ports:
  - port: 80
    targetPort: 8080
  # Issue: no headless service for ordering; not using StatefulSet for ordering

Ask them to identify all issues and explain fixes. A strong candidate catches 8+ issues; a weak one catches 1–2 obvious ones.

Part 3: Troubleshooting conversation (30 minutes)

Scenario:

"You've deployed a new service to Kubernetes. Traffic is hitting it, but latency is 10x normal. Prometheus shows normal CPU and memory usage. Walk me through how you'd debug."

The candidate should ask:

  • Is the service endpoint up to date? (kubectl get endpoints)
  • Are pods receiving traffic evenly, or is one pod slow?
  • Is the network path the issue (kube-proxy, CNI)?
  • Is the application blocking on something (external API, database)?
  • Did the deploy include a code change that's unoptimized?

This tests whether they think systematically.

Kubernetes assessment rubric

SkillLevel 1 (Below)Level 2 (Meets)Level 3 (Exceeds)
Cluster designSingle-node or single-AZ clusterMulti-AZ with HA considerationsDesigns for zone failure with explicit recovery SLA
Workload configurationNo resource requests; no health checksAppropriate requests; liveness and readiness probesSophisticated probes; graceful shutdown; disruption budgets
Debugging methodologyGuesses at solutionsSystematic approach; checks kubectl get/describeDeep observability; can read kubelet logs
Scaling strategyManually scales or scales on CPU onlyHPA with custom metricsPredictive scaling; understands resource limits
Cost awarenessIgnores cost implicationsBalances HA and costOptimizes resource allocation without sacrificing reliability

What NOT to test

Don't ask candidates to:

  • Write a Kubernetes Operator from scratch. (That's systems programming, not SRE.)
  • Memorize API versions. (They'll use the docs in production anyway.)
  • Set up a cluster from scratch during the interview. (That's an operations task, not an assessment of judgment.)

Do ask them to:

  • Design a cluster for realistic constraints
  • Debug why a deployment is failing
  • Explain trade-offs between options
  • Configure workloads for production

For teams not yet using Kubernetes

If you're hiring for DevOps or SRE but not using Kubernetes yet, test containerization and orchestration concepts without requiring Kubernetes expertise. Many teams over-weight Kubernetes knowledge when they really need systems thinking.

For teams using Kubernetes in production, use this assessment structure to evaluate operational depth before hiring. A candidate who passes this assessment will be productive on day one.

Next: understand how to interpret the results of your DevOps assessment.

kubernetessredevopscontainer orchestration

Related Articles