How AWS SQS Visibility Timeout Caused the Same Order to Be Processed 847 Times
Monday morning, 7:42 AM. The on-call engineer gets paged. "Multiple customers reporting duplicate charges." I pull up the dashboard. Order count is 3x what it should be. Stripe shows the same order IDs appearing multiple times. In under four hours, 847 orders had been processed twice. Some customers billed three times. We had a queue. We had idempotency keys. We thought we were safe.
Production failure
Our e-commerce checkout pipeline looked textbook clean. The checkout service pushed order
messages to an SQS queue. A fleet of Node.js workers polled the queue, processed each order
(database writes, Stripe charge, confirmation email), then deleted the message. At-least-once
delivery was handled with an idempotency key: a unique order_id stored in
PostgreSQL. Before touching Stripe, every worker ran this check.
async function processOrder(orderId: string, payload: OrderPayload) {
// Idempotency check — skip if already processed
const existing = await db.query(
'SELECT processed FROM orders WHERE order_id = $1',
[orderId]
);
if (existing.rows[0]?.processed) {
console.log(`Order ${orderId} already processed, skipping`);
return;
}
// Process: charge Stripe, write DB, send email (~35–45s total)
const charge = await stripe.charges.create({ amount: payload.amount, currency: 'usd', source: payload.token });
await db.query('INSERT INTO orders (order_id, charge_id, processed) VALUES ($1, $2, true)', [orderId, charge.id]);
await sendConfirmationEmail(payload.email, orderId);
// Delete the SQS message — but are we still within the visibility window?
await sqs.deleteMessage({ QueueUrl: QUEUE_URL, ReceiptHandle: payload.receiptHandle }).promise();
}
This looked safe. It wasn't. The bug wasn't in the code. It was in the interaction between our processing latency and a single SQS configuration value nobody had thought to question in about 18 months.
False assumptions
We'd made three compounding assumptions. Each reasonable in isolation, all three together lethal.
- A pre-flight SELECT would catch duplicates. It would, if only one worker ever ran the check at a time.
- Workers were fast enough. In development, order processing took 8 to 12 seconds. Nobody profiled it under production load with real Stripe latency and a busy PostgreSQL cluster.
- SQS visibility timeout was a safety net, not a deadline. We set it to 30 seconds and never looked at it again. It is, in fact, a hard deadline with a sharp edge.
The SQS documentation is explicit: if a consumer doesn't delete a message before the visibility timeout expires, SQS makes the message visible again for redelivery. I'd read that sentence more than once. I just never connected it to our actual processing latency.
Reproducing the bottleneck
Before diving into CloudWatch, we timed actual order processing in production over 200 samples:
Production was 3 to 4x slower than staging. Stripe added 8 to 15 seconds. PostgreSQL under real load added another 10 to 12 seconds for write-heavy transactions. SendGrid's confirmation email added another 5 to 8. Total routinely 40+ seconds, well past our 30-second timeout.
BEFORE: Visibility Timeout Race Condition
─────────────────────────────────────────────────────────────────────
SQS Queue
┌─────────────────────────────────────────────────────────┐
│ [Order #847] visibility_timeout = 30s │
└─────────────────────────────────────────────────────────┘
│
│ t=0s: Worker A pulls message, message hidden
▼
┌──────────────┐
│ Worker A │ t=0s: SELECT processed → false ✓
│ │ t=5s: Stripe charge begins
│ │ t=20s: Stripe responds
│ PROCESSING │ t=28s: DB write begins
│ │
└──────────────┘
·
· t=30s: VISIBILITY TIMEOUT EXPIRES
· Message becomes visible again!
·
▼
┌──────────────┐
│ Worker B │ t=31s: Pulls same message
│ │ t=31s: SELECT processed → false ✓ (Worker A hasn't committed yet)
│ │ t=36s: Stripe charge begins — DUPLICATE CHARGE
│ PROCESSING │
│ │
└──────────────┘
·
· t=38s: Worker A commits, deletes message
· t=42s: Worker B commits, deletes message (receipt handle still valid)
· Two Stripe charges. One customer. One very angry morning.
Investigation: the sawtooth in CloudWatch
Once we knew what to look for, the CloudWatch metrics told the story. The
ApproximateNumberOfMessagesNotVisible metric, messages currently being processed,
showed a distinctive sawtooth pattern. Flat for about 30 seconds, then a spike as messages
re-emerged, then flat again. Every 30 seconds, like clockwork.
We pulled NumberOfMessagesSent vs NumberOfMessagesDeleted over the
four-hour incident window. Messages deleted exceeded messages sent by 847: one extra delete
per duplicated order, because Worker A and Worker B each deleted the same message after
processing it. SQS silently accepts deletes for already-deleted messages (it's idempotent at
the delete level, not at the consume level). No errors. No alarms. Just quiet double-processing
for four hours on a Monday morning.
Root cause: TOCTOU + clock mismatch
Two distinct failure modes compounding each other.
The first was a Time-of-Check / Time-of-Use race. The idempotency check
(SELECT processed WHERE order_id = ?) and the write
(INSERT ... processed = true) were not atomic. Two workers could both pass the
check in the gap before either committed. Under normal circumstances that window is
milliseconds. With a 30-second visibility timeout expiring mid-processing, the window was
guaranteed to be hit every time processing ran long.
The second was that we never measured our latency budget. The visibility timeout was set once in a Terraform file 18 months before the incident. Nobody re-examined it as the order pipeline grew. More validation logic, a third-party fraud check, an email delivery confirmation step. The timeout was a number on a line nobody touched.
Architecture fix
We needed two independent fixes. Either one alone was insufficient. The timeout fix wouldn't prevent a race on worker restarts. The lock fix wouldn't prevent the timeout from triggering unnecessary redeliveries.
Fix 1 was a heartbeat. While a worker is processing, a background interval extends the message's visibility timeout every 20 seconds. The message stays invisible until the worker explicitly finishes or crashes.
async function processOrderSafely(message: SQS.Message, payload: OrderPayload) {
const orderId = payload.orderId;
// --- FIX 1: Heartbeat — keep message invisible while we work ---
const heartbeat = setInterval(async () => {
try {
await sqs.changeMessageVisibility({
QueueUrl: QUEUE_URL,
ReceiptHandle: message.ReceiptHandle!,
VisibilityTimeout: 60, // extend by 60s on each heartbeat
}).promise();
} catch (err) {
// Message may have already been deleted — safe to ignore
console.warn('Heartbeat extension failed (message already gone?)', err);
}
}, 20_000); // every 20s, well within the 60s window
try {
// --- FIX 2: Atomic distributed lock via PostgreSQL advisory lock ---
// pg_try_advisory_xact_lock takes a bigint; hash the orderId to one
const lockKey = hashOrderId(orderId); // stable integer hash
const lockResult = await db.query(
'SELECT pg_try_advisory_xact_lock($1) AS acquired',
[lockKey]
);
if (!lockResult.rows[0].acquired) {
console.log(`Order ${orderId} locked by another worker — skipping`);
return; // another worker has the lock; let it finish
}
// --- FIX 3: Atomic upsert — insert only if not already processed ---
const insert = await db.query(`
INSERT INTO orders (order_id, status, created_at)
VALUES ($1, 'processing', NOW())
ON CONFLICT (order_id) DO NOTHING
RETURNING order_id
`, [orderId]);
if (insert.rowCount === 0) {
console.log(`Order ${orderId} already exists — idempotent skip`);
return; // row was already there; genuine duplicate, safe to skip
}
// Safe to charge — lock held, row owned, no other worker can get here
const charge = await stripe.charges.create({
amount: payload.amount,
currency: 'usd',
source: payload.token,
idempotency_key: `order-${orderId}`, // Stripe-level idempotency as last resort
});
await db.query(
'UPDATE orders SET status = $1, charge_id = $2 WHERE order_id = $3',
['completed', charge.id, orderId]
);
await sendConfirmationEmail(payload.email, orderId);
await sqs.deleteMessage({ QueueUrl: QUEUE_URL, ReceiptHandle: message.ReceiptHandle! }).promise();
} finally {
clearInterval(heartbeat); // always stop the heartbeat
// pg advisory lock releases automatically at transaction end
}
}
The three-layer defense ensures exactly-once semantics even under concurrent workers and long processing times.
AFTER: Heartbeat + Distributed Lock + Atomic Upsert
────────────────────────────────────────────────────────────────────
SQS Queue
┌──────────────────────────────────────────────────────────────┐
│ [Order #847] initial visibility_timeout = 30s │
└──────────────────────────────────────────────────────────────┘
│
│ t=0s: Worker A pulls message
▼
┌──────────────────────────────────────────────┐
│ Worker A │
│ │
│ t=0s: pg_try_advisory_xact_lock → true ✓ │
│ t=0s: INSERT ... ON CONFLICT DO NOTHING │
│ → 1 row inserted ✓ │
│ │
│ ┌── Heartbeat (every 20s) ──────────────┐ │
│ │ t=20s: extend visibility → +60s ✓ │ │
│ │ t=40s: extend visibility → +60s ✓ │ │
│ └───────────────────────────────────────┘ │
│ │
│ t=42s: Stripe charge complete │
│ t=44s: DB update, email sent │
│ t=44s: sqs.deleteMessage() ✓ │
│ t=44s: clearInterval(heartbeat) │
└──────────────────────────────────────────────┘
If Worker B somehow pulls the same message (e.g., on restart):
┌──────────────────────────────────────────────┐
│ Worker B │
│ pg_try_advisory_xact_lock → false ✗ │
│ → returns immediately, no Stripe call │
└──────────────────────────────────────────────┘
Result: Zero duplicate charges. Message processed exactly once.
Results after the fix
We also added a CloudWatch alarm on the ratio of NumberOfMessagesDeleted to
NumberOfMessagesSent. Anything over 1.05 (5% tolerance for legitimate retries)
pages the on-call engineer immediately. Which in retrospect should have been the very first
thing we set up when we built the pipeline.
Lessons learned
- SQS visibility timeout is a hard deadline, not a suggestion. Treat it as a maximum processing budget. If you can't guarantee completion within it, use heartbeats. Rule of thumb: set initial timeout to 2x your p50 latency; extend via heartbeat for anything longer.
-
Idempotency checks must be atomic. A SELECT followed by an INSERT is a TOCTOU race. Use
INSERT ... ON CONFLICT DO NOTHINGand checkrowCount. If it's 0, another worker already claimed the work. - At-least-once delivery requires true idempotency, not a guard clause. "We have idempotency keys" is only true if the write that sets the key and the side effect it guards are in the same atomic operation, or protected by a distributed lock.
- Config values drift away from reality silently. Our 30-second timeout was set when the pipeline took 10 seconds. Nobody re-examined it as we added fraud checks, email confirmations, and retry logic. Put latency budget reviews into your architecture change process, not just code review.
-
Stripe-level idempotency is a last resort. We added
idempotency_keyto Stripe calls as a third layer. Stripe deduplicates identical keys within 24 hours. Don't rely on your payment provider to catch what your architecture should have prevented.