How a DigitalOcean Firewall Rule Silently Dropped 23% of Production Traffic for 11 Days
← Back
March 13, 2026Docker9 min read

How a DigitalOcean Firewall Rule Silently Dropped 23% of Production Traffic for 11 Days

Published March 13, 20269 min read

For 11 days, roughly 1 in 4 users hitting our platform got a timeout instead of a response. CPU at 18%. Memory at 34%. Nginx access logs showed nothing unusual. APM error rate sat at 0.2%, well inside normal. The failing requests never actually reached our servers. They were being silently thrown away by a firewall rule I had written myself three weeks earlier, on a Friday afternoon while setting up a new Droplet. I remember thinking "this won't take long."

Production failure

Symptom reports started on a Wednesday. Sporadic. Some users would get a timeout, refresh, and it would load fine. Support received 14 tickets over two days, all variations of "the site is slow sometimes." Not slow enough to trip our synthetic monitoring, which tested every 5 minutes from a fixed IP. Not consistent enough to reproduce on demand.

The affected users had one thing in common we didn't notice for a week. They were all on mobile carriers. Specifically, carriers using CGNAT (Carrier-Grade NAT), where thousands of users share a single public IP and get distinguished by high-numbered ephemeral source ports (typically 32768 to 60999).

11 days issue in production
23% of requests silently dropped
0 server-side errors logged
~4,300 affected sessions estimated

False assumptions: everything we blamed first

Week one was a tour through every wrong answer:

  • Nginx worker connections. Checked worker_processes and worker_connections. Both correctly sized. Active connections never exceeded 40% of capacity.
  • Docker bridge network. I was convinced the inter-container bridge was dropping under load. Added latency metrics between containers. Clean.
  • Node.js keep-alive. The API server's keep-alive timeout was shorter than the load balancer's, which is a known source of premature resets. Corrected it. Timeouts kept happening.
  • DigitalOcean Droplet network limits. Bandwidth and PPS nowhere near the ceiling.

Every metric we knew how to measure was normal.

Which, in retrospect, should have been the clue. We were measuring the server, and whatever was happening wasn't happening on the server.

Finding it: tcpdump at the edge

The break came from an orchestrated support call. We got a user on their phone, on a mobile connection that had been timing out, and ran tcpdump on the Droplet's public interface while they hit refresh. The TCP SYN arrived at the NIC. No SYN-ACK went back. The connection was dead at the OS level before Nginx ever saw it.

That pointed squarely at the firewall. I opened the DigitalOcean Cloud Firewall rules for the Droplet and stared at something I had written three weeks earlier:

DigitalOcean Cloud Firewall — inbound rules (the broken config)
# What I intended: allow inbound HTTP and HTTPS from anywhere
Inbound Rules:
  TCP  port 80   sources: All IPv4, All IPv6   ✓
  TCP  port 443  sources: All IPv4, All IPv6   ✓
  TCP  port 22   sources: [my office IP]       ✓

# What I had actually added while "cleaning up" the rules:
  TCP  port 1024-65535  sources: [specific IP range]   ← this one

# What that rule does:
# DigitalOcean Cloud Firewalls are STATELESS for inbound rules.
# A TCP reply from the server back to a CGNAT client uses the client's
# ephemeral source port as the DESTINATION port on the return path.
# Ports 32768–60999 (CGNAT ephemeral range) fall within 1024–65535.
# The rule was RESTRICTING return traffic to a specific IP range,
# silently dropping TCP ACK/data packets back to mobile clients.

Root cause: stateless firewall + CGNAT ephemeral ports

DigitalOcean Cloud Firewalls evaluate each packet independently. They don't track TCP connection state. An inbound rule covering port range 1024–65535 applies to incoming packets destined for those ports on the Droplet, and it also implicitly affects the return path of connections originating from clients whose source port falls in that range. I did not know this when I wrote the rule. The docs say it, just not as loudly as they should.

CGNAT clients use ephemeral ports in the 32768–60999 range as their source port. When our server sent back a TCP response, the destination port was the client's source port, which landed inside the range the rule was restricting. The firewall dropped the response. The client saw a timeout. The server logged nothing, because as far as it was concerned the packet had left the building just fine.

  NORMAL CONNECTION (non-CGNAT client, source port 55000)
  ─────────────────────────────────────────────────────────────────

  Client              Firewall             Droplet / Nginx
  (port 55000)
      │                   │                     │
      │── SYN ──────────▶ │ port 443 ✓ ALLOW ──▶│
      │                   │                     │── SYN-ACK ──▶ return path
      │◀──────────────────┼─────────────────────│  (dst port 55000)
      │                   │                     │
      │  Connection established ✓               │


  CGNAT CLIENT (source port 44821 — within rule range 1024-65535)
  ─────────────────────────────────────────────────────────────────

  Client              Firewall             Droplet / Nginx
  (CGNAT port 44821)
      │                   │                     │
      │── SYN ──────────▶ │ port 443 ✓ ALLOW ──▶│
      │                   │                     │── SYN-ACK ──▶
      │                   │◀──── return packet ──│  (dst port 44821)
      │                   │                     │
      │                   │ port 44821 — matches rule 1024-65535
      │                   │ source: server IP — NOT in allowed range
      │                   │ DROPPED silently ✗  │
      │                   │                     │
      ×  Timeout after 30s │                     │
         (client never     │                     │
          gets SYN-ACK)    │                     │

Architecture fix: remove the rule, understand the firewall model

The fix was a single rule deletion. The rogue 1024–65535 rule had no legitimate purpose. I had added it during a "security hardening" session where I misread the docs and mentally treated the Cloud Firewall like stateful iptables. Classic case of applying the wrong mental model to a product whose name sounds familiar.

We considered moving to stateful iptables with conntrack inside the Droplet and decided against it. The Cloud Firewall is easier to audit across multiple Droplets from one place, and the real problem was a human one. Every firewall change now gets a second engineer to read it against the DigitalOcean statefulness docs before it ships.

  FIREWALL RULE CHANGE PROCESS (after incident)
  ─────────────────────────────────────────────────────────────────

  Engineer proposes firewall change
          │
          ▼
  Document: What port/range? What source? What protocol?
          │
          ▼
  Ask: Is this Cloud Firewall (stateless) or iptables (stateful)?
          │
          ├── Cloud Firewall ──▶ Does this rule affect RETURN TRAFFIC
          │                      from legitimate connections?
          │                          │
          │                          ├── YES → redesign or use iptables
          │                          └── NO  → second engineer review → apply
          │
          └── iptables ──────▶ Standard review → apply


  CORRECTED INBOUND RULES (after fix)
  ─────────────────────────────────────────────────────────────────

  TCP  port 80    sources: All IPv4, All IPv6  ✓
  TCP  port 443   sources: All IPv4, All IPv6  ✓
  TCP  port 22    sources: [trusted IPs only]  ✓
  (no port-range rules — ever)

Lessons learned

  • Stateless firewalls need a different mental model than iptables. With conntrack, an established connection's return traffic is automatically allowed. Cloud Firewalls look at every packet in isolation. A port-range rule that reads as "allow high ports inbound" is also a rule about return traffic to those ports.
  • CGNAT is the default for mobile now. A huge chunk of mobile users share IPs and sit on high ephemeral ports. Any rule touching 1024–65535 hits them disproportionately.
  • Silent drops are invisible to server-side monitoring. Packets killed before the NIC never show up in Nginx logs, APM, or error tracking. Client-side error monitoring (JS error boundaries, mobile crash reporting) is the only way you'll see the hole.
  • tcpdump on the public interface is the last-resort oracle. Logs show nothing, clients report timeouts, you run it during a live failure. If SYN arrives and no SYN-ACK leaves, the suspect is the OS or firewall, not the app.

The rule has been gone for months. I still check the firewall tab first when anything mobile-shaped breaks.

Share this
← All Posts9 min read