The AI Agent That Cleaned Up Our K8s Manifests and Crashed Production
← Back
March 15, 2026AI10 min read

The AI Agent That Cleaned Up Our K8s Manifests and Crashed Production

Published March 15, 202610 min read

2:47 AM. PagerDuty. P1. Error rate spiked to 34%, p99 latency at 18 seconds, pods restarting every 90 seconds, 847 OOMKill events in the last four hours. The last deploy had been six hours ago and everything had looked fine when I went to bed.


The tuesday afternoon that felt like progress

Our Kubernetes manifests were a mess. Four years of patches, rushed hotfixes, and copy-paste deployments had left us with 31 YAML files across 4 services. Inconsistent label conventions, stale annotations, duplicate env vars, and resource definitions that nobody could confidently explain anymore.

I decided to use Cursor's agent mode to clean it all up. The prompt felt reasonable.

Refactor these Kubernetes deployment manifests:
1. Standardize label conventions to app.kubernetes.io/* 
2. Remove duplicate and stale environment variables
3. Consolidate probe configurations to match our current health endpoints
4. Clean up any obviously wrong or overly conservative resource settings

The agent ran for about 12 minutes, touching 28 of 31 files. Diff looked clean. Labels standardized, dead env vars gone, probe paths updated to /health. Quick review, PR, thumbs-up from a teammate, merge.

CI passed. Staging looked fine. We deployed to production at 8:51 PM. Dashboards nominal. I went to bed at midnight feeling good about paying down real tech debt.


False assumptions at every layer

Every assumption I walked in with turned out to be wrong in a slightly different way.

"The AI only touches what I asked it to touch." The prompt said "clean up obviously wrong or overly conservative resource settings." The agent interpreted conservative as unnecessarily restrictive. It found our memory: "512Mi" limits on the API pods, saw we'd had OOMKill events logged in YAML comments, and "helpfully" removed the limits entirely. No memory ceiling, no CPU throttle, just vibes.

"Staging would catch this." Staging runs at 5% of production traffic. An OOM condition triggered by concurrent request spikes at 2 AM simply never manifested in staging's daytime load patterns.

"A green CI run means the manifests are correct." Our CI validates YAML syntax and runs kubectl diff --dry-run. Neither catches semantic regressions like missing resource limits. The manifests were syntactically perfect.

"I reviewed the diff." I reviewed 28 files worth of diff in about 8 minutes. At that rate you're not reviewing, you're pattern-matching for obvious disasters. The resource limit removal was buried inside a larger block refactor that changed indentation and key ordering at the same time. My eyes skipped right over it, which is the kind of thing that's easy to type up in hindsight and hard to actually prevent in the moment.


The 2 AM investigation

I pulled up the cluster. The picture was grim:

CLUSTER STATE — 02:47 AM

api-service          847 OOMKill events / 4 hours
                     Memory: 2.1GB consumed (no ceiling)
                     Restart count: 134 pods
                     Status: CrashLoopBackOff on 9/12 replicas

worker-service       Memory: 1.8GB (no ceiling)  
                     Restart count: 89 pods
                     Throttled CPU: 0% (CPU limit also removed)

HPA (Horizontal Pod Autoscaler)
                     Trying to scale from 12 → 40 replicas
                     Pods starting, OOMKilling, terminating
                     Faster than readiness probes can pass

Node memory pressure: 3/5 worker nodes in MemoryPressure=True
                     kubelet evicting pods aggressively

First theory: a memory leak in the same deploy. I pulled the application code diff. Minor changes, a two-line tweak to a database query timeout. No new allocations. Ruled out in 8 minutes.

Second theory: traffic spike. Ingress metrics showed traffic at 2 AM was actually lower than average, about 60% of daytime volume. Ruled out in 3 minutes.

Third theory: something in the manifest diff. I finally opened the full git diff and started reading carefully. Line 847 of the diff.

 resources:
-  limits:
-    memory: "512Mi"
-    cpu: "500m"
-  requests:
-    memory: "256Mi"
-    cpu: "250m"
+  requests:
+    memory: "256Mi"
+    cpu: "250m"

There it was. The limits block, gone. Four services, twelve deployments, all missing their memory and CPU ceilings. The agent had seen old OOMKill comments (# TODO: OOMKilled twice in Nov, investigate memory usage) and concluded the limits were the problem. Classic AI reasoning failure: correlation treated as causation.


Root cause: unbounded memory + Kubernetes Node pressure cascade

Without memory limits, each pod was free to consume as much node RAM as it wanted. Our application has a background job that processes webhooks in batches, normally bounded by the 512Mi limit to about 200 concurrent in-memory payloads. Without the limit, it processed all 1,400 queued webhooks at once, allocating 2.1GB per pod instance.

When node memory pressure crossed the threshold, kubelet started evicting pods. Not just ours, system components too. Eviction triggered the HPA to spin up replacement pods. Replacement pods hit the same webhook queue and immediately consumed 2GB of RAM each. Nodes hit MemoryPressure again. Eviction loop.

THE CASCADE

  Pod starts
      │
      ▼
  Webhook worker: no limit → 2.1GB RAM
      │
      ▼
  Node MemoryPressure threshold crossed
      │
      ▼
  kubelet evicts pod (OOMKill)
      │
      ▼
  HPA sees fewer replicas → spins new pod
      │
      └──────────────────────────────────┐
                                         ▼
                              New pod hits same queue
                              2.1GB RAM in 45 seconds
                              Node MemoryPressure again
                              (loop repeats, 134 times)

The 34% error rate wasn't random. It was exactly the percentage of pods in CrashLoopBackOff at any given moment, unable to serve traffic.


The fix: two parts, 22 minutes

The immediate fix was a targeted rollback of the resource limits, not a full deploy rollback (which would have re-broken the probe paths we'd actually fixed correctly).

# Patch resource limits back in-place without a full rollback
kubectl patch deployment api-service -n production --type=json -p='[
  {
    "op": "add",
    "path": "/spec/template/spec/containers/0/resources/limits",
    "value": {"memory": "512Mi", "cpu": "500m"}
  }
]'

# Repeat for all affected deployments
# Then force a rolling restart to kill the OOMing pods
kubectl rollout restart deployment/api-service -n production

Twelve minutes after applying patches to all four services, error rate dropped from 34% to 0.2%. P99 latency recovered from 18,000ms to 180ms. Total incident duration was 94 minutes from first alert.

The longer-term fix was adding a LimitRange to the namespace: a cluster-level safety net that enforces default limits on any pod that doesn't specify them.

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - type: Container
    default:          # applied if limits not specified
      memory: "512Mi"
      cpu: "500m"
    defaultRequest:   # applied if requests not specified
      memory: "256Mi"
      cpu: "250m"
    max:
      memory: "2Gi"
      cpu: "2000m"

Now even if a future AI agent, intern, or sleep-deprived engineer removes resource limits, the namespace-level policy kicks in. Pods can't go fully unbounded.


What we changed in the ai-assisted workflow

I didn't stop using AI for infrastructure refactoring. I stopped using it unsafely.

Explicit negative-space prompting. Every infra refactor prompt now includes a "do not touch" list: "Do not modify resource limits, requests, security contexts, or RBAC rules under any circumstances." Vague instructions produce vague boundaries.

Automated diff validation for critical fields. We added a CI check that diffs the YAML and exits non-zero if resources.limits or resources.requests blocks are removed from any deployment.

# .github/workflows/manifest-guard.yml
- name: Check resource limits not removed
  run: |
    git diff origin/main...HEAD -- '**/*.yaml' |     grep '^-.*limits:' && {
      echo "ERROR: Resource limits removed from manifest. Review required."
      exit 1
    } || echo "Resource limits intact"

Scoped AI tasks. "Refactor all 31 files" is too broad for an AI agent touching production infrastructure. We now scope tasks to single files or single concern areas. Never "clean up everything" in one pass.

LimitRange as the last line of defence. Applied namespace-wide now. If a manifest ever ships without limits again, the cluster itself enforces a sane ceiling.


Lessons learned

Metrics from this incident:
— 94 minutes total downtime
— 847 OOMKill events across 4 services
— 134 pod restarts in the crash loop
— P99 latency peak: 18,000ms (baseline: 180ms)
— Error rate peak: 34%
— Recovery after patch: 12 minutes
— Files touched by AI agent: 28 of 31
— Critical removals missed in review: 12 resources blocks across 4 deployments

The failure mode here wasn't that AI is unreliable. It was that I gave an AI agent ambiguous authority over safety-critical configuration and then rubber-stamped a 28-file diff in 8 minutes. Those are human process failures.

The scariest part was that the AI's reasoning was internally consistent. It saw OOMKill history, concluded limits were causing it, and removed them. A junior engineer following the same logic might have done the same thing. The difference is a junior engineer would probably ask before deleting the safety rails. The agent just did it.

Share this
← All Posts10 min read