We Deployed on a Friday. Here's What Happened Next.
4:30 PM, Friday. The ticket was small. A config change, two lines of YAML, a deployment that had worked in staging three times. My lead signed off. I clicked deploy. By 10 PM I was still at my desk and the on-call phone had rung seven times.
That was four years ago. I haven't deployed to production on a Friday since.
The setup
Mid-size SaaS platform. About 40,000 active users, a PHP monolith slowly being strangled
by a growing set of Node microservices, glued together behind an Nginx reverse proxy.
Our CI/CD pipeline was functional in the way that a 1997 Toyota is functional. GitHub
push triggered a Jenkins build, tests ran, Docker image got tagged, a shell script SSHed
into the production box and ran docker-compose up -d. Artisanal. Deeply
fragile.
The change I was deploying was a new environment variable pointing the notification
service at a different queue endpoint. Staging worked. UAT worked. The variable was in
.env.production. What could go wrong.
BEFORE DEPLOY (Expected)
─────────────────────────────────────────────────────
GitHub Push
│
▼
Jenkins Build
│
▼
Run Tests ──── PASS
│
▼
Build Docker Image
│
▼
SSH → docker-compose up -d
│
▼
✅ Done. Go home.
WHAT ACTUALLY HAPPENED
─────────────────────────────────────────────────────
GitHub Push
│
▼
Jenkins Build
│
▼
Run Tests ──── PASS (tests don't read .env.production)
│
▼
Build Docker Image (image bakes in OLD env snapshot)
│
▼
SSH → docker-compose up -d
│
▼
New container starts with MISSING env var
│
▼
Notification service silently swallows queue errors
│
▼
😱 Users stop receiving emails. Nobody knows yet.
The silence that screams
The worst production incidents aren't the loud ones. Loud ones (500s, crashes, pages going white) get caught immediately. Alerts fire, users complain, you know within minutes.
This was the other kind. Everything looked fine. Deploy succeeded. Green checkmark in Jenkins. Response times normal. Error rate zero. CPU nominal. The notification service was running. It was just quietly not delivering anything, logging failures to a file nobody was watching, and returning success codes anyway because the original developer had wrapped the queue call in a broad try-catch that ate the exception.
// What we had
async function enqueueNotification(payload) {
try {
await queueClient.send(payload);
return { success: true };
} catch (err) {
// TODO: add proper error handling
logger.warn('Queue send failed', err.message);
return { success: true }; // ← lied about success to not break callers
}
}
// What we needed
async function enqueueNotification(payload) {
const result = await queueClient.send(payload); // let it throw
metrics.increment('queue.send.success');
return result;
}
// And a dead letter queue handler that actually alerts:
queueClient.on('error', (err) => {
metrics.increment('queue.send.error');
alerts.fire('QUEUE_SEND_FAILURE', { err, severity: 'critical' });
});
We discovered the issue at 7:15 PM. Not from monitoring. A user emailed support saying their password reset link never arrived. Support checked three more accounts. Same story. Someone pinged me. I checked the logs. Stomach dropped.
"Emails have been failing since 4:47 PM. That's two hours and twenty-eight minutes of silent failure, approximately 1,400 undelivered notifications, and zero alerts fired."
The rollback that wasn't
Here's where it got worse. Our rollback procedure was: SSH into the box, pull the previous Docker image tag, run docker-compose up again. Thirty seconds.
Except the previous image tag was latest. We hadn't been tagging images
with commit SHAs. The "previous" image was whatever had been sitting in the registry
before this build, which turned out to be a build from three weeks ago with a different
database migration state. I'll pause there so you can appreciate that sentence.
OUR "ROLLBACK" PROCESS ────────────────────────────────────────────────── Tag Strategy: latest ← overwrites on every build Timeline: Week 1 [build] → :latest (v1) Week 2 [build] → :latest (v2, overwrites v1) Week 3 [build] → :latest (v3, overwrites v2) Friday [build] → :latest (v4, broken) Rollback attempt → pulls :latest → gets v4 (same broken build) ❌ No previous image available. Rollback impossible. WHAT WE SHOULD HAVE HAD ────────────────────────────────────────────────── Tag Strategy: commit SHA + semver + latest alias [build] → :abc1234 + :v2.4.1 + :latest [build] → :def5678 + :v2.4.2 + :latest Rollback → docker pull myapp:abc1234 ✅ Any previous version instantly available.
We ended up doing a manual hotfix. Patched the env var directly on the server, restarted the container, verified notifications were flowing. Six hours from deploy to resolution.
The rebuild
The following week, I rewrote the entire pipeline. Not because anyone asked. Because I couldn't sleep knowing it could happen again. What changed:
- Image tagging: every build gets tagged with
git rev-parse --short HEAD. The:latesttag still exists but only as an alias. - Environment validation: a startup script reads a
.env.requiredmanifest and fails loudly if any variable is missing or empty before the app boots. - Deployment windows: a GitHub Actions check fails the deploy job if the current time is Friday after 3 PM or the weekend. Enforced, not advisory.
- Post-deploy smoke tests: a script hits 12 critical endpoints and checks response codes and response shape. Any failure triggers auto-rollback.
- Dead letter queues with alerts: any queue failure now fires a PagerDuty alert inside 60 seconds instead of 2.5 hours.
jobs:
check-deploy-window:
runs-on: ubuntu-latest
steps:
- name: Enforce deployment window
run: |
DAY=$(date +%u) # 1=Mon ... 7=Sun
HOUR=$(date +%H) # 00-23 UTC (adjust for your TZ)
if [ "$DAY" -ge 5 ] && [ "$HOUR" -ge 10 ]; then
echo "❌ Deploys blocked: Friday after 3 PM IST or weekend."
echo " Open a break-glass PR to override (requires 2 approvals)."
exit 1
fi
echo "✅ Deploy window is open."
deploy:
needs: [check-deploy-window, test]
runs-on: ubuntu-latest
steps:
- name: Build and tag image
run: |
SHA=$(git rev-parse --short HEAD)
docker build -t myapp:$SHA -t myapp:latest .
docker push myapp:$SHA
docker push myapp:latest
echo "IMAGE_TAG=$SHA" >> $GITHUB_ENV
- name: Deploy
run: ./scripts/deploy.sh ${{ env.IMAGE_TAG }}
- name: Smoke test
run: ./scripts/smoke-test.sh
# On failure, this job fails and deploy.sh rollback hook fires
What I learned that couldn't come from a book
The Friday rule gets mocked. Engineers call it superstition. It isn't. It's a forcing function. It makes you ask "is this urgent enough to deploy now, or can it wait until Monday?" Almost always, the answer is Monday. If it genuinely can't wait, you have the break-glass process, and you've made the risk explicit.
Silent failures are the dangerous ones. Noisy failures are feedback. Silent failures erode trust in your system and are harder to diagnose because by the time you find them, you've lost proximity to the change that caused them.
Rollbacks are only real if you can execute them in under five minutes without a runbook. If your rollback procedure requires careful thought, it will fail you under pressure.