How AWS SQS Visibility Timeout Caused the Same Order to Be Processed 847 Times
← Back
March 14, 2026Architecture10 min read

How AWS SQS Visibility Timeout Caused the Same Order to Be Processed 847 Times

Published March 14, 202610 min read

Monday morning, 7:42 AM. The on-call engineer gets paged. "Multiple customers reporting duplicate charges." I pull up the dashboard. Order count is 3x what it should be. Stripe shows the same order IDs appearing multiple times. In under four hours, 847 orders had been processed twice. Some customers billed three times. We had a queue. We had idempotency keys. We thought we were safe.


Production failure

Our e-commerce checkout pipeline looked textbook clean. The checkout service pushed order messages to an SQS queue. A fleet of Node.js workers polled the queue, processed each order (database writes, Stripe charge, confirmation email), then deleted the message. At-least-once delivery was handled with an idempotency key: a unique order_id stored in PostgreSQL. Before touching Stripe, every worker ran this check.

order-processor.ts (original)
async function processOrder(orderId: string, payload: OrderPayload) {
  // Idempotency check — skip if already processed
  const existing = await db.query(
    'SELECT processed FROM orders WHERE order_id = $1',
    [orderId]
  );
  if (existing.rows[0]?.processed) {
    console.log(`Order ${orderId} already processed, skipping`);
    return;
  }

  // Process: charge Stripe, write DB, send email (~35–45s total)
  const charge = await stripe.charges.create({ amount: payload.amount, currency: 'usd', source: payload.token });
  await db.query('INSERT INTO orders (order_id, charge_id, processed) VALUES ($1, $2, true)', [orderId, charge.id]);
  await sendConfirmationEmail(payload.email, orderId);

  // Delete the SQS message — but are we still within the visibility window?
  await sqs.deleteMessage({ QueueUrl: QUEUE_URL, ReceiptHandle: payload.receiptHandle }).promise();
}

This looked safe. It wasn't. The bug wasn't in the code. It was in the interaction between our processing latency and a single SQS configuration value nobody had thought to question in about 18 months.


False assumptions

We'd made three compounding assumptions. Each reasonable in isolation, all three together lethal.

  • A pre-flight SELECT would catch duplicates. It would, if only one worker ever ran the check at a time.
  • Workers were fast enough. In development, order processing took 8 to 12 seconds. Nobody profiled it under production load with real Stripe latency and a busy PostgreSQL cluster.
  • SQS visibility timeout was a safety net, not a deadline. We set it to 30 seconds and never looked at it again. It is, in fact, a hard deadline with a sharp edge.

The SQS documentation is explicit: if a consumer doesn't delete a message before the visibility timeout expires, SQS makes the message visible again for redelivery. I'd read that sentence more than once. I just never connected it to our actual processing latency.


Reproducing the bottleneck

Before diving into CloudWatch, we timed actual order processing in production over 200 samples:

8–12sDev / staging latency
35–45sProd latency (p95)
30sSQS visibility timeout
847Duplicate orders (4 hrs)

Production was 3 to 4x slower than staging. Stripe added 8 to 15 seconds. PostgreSQL under real load added another 10 to 12 seconds for write-heavy transactions. SendGrid's confirmation email added another 5 to 8. Total routinely 40+ seconds, well past our 30-second timeout.

BEFORE: Visibility Timeout Race Condition
─────────────────────────────────────────────────────────────────────

  SQS Queue
  ┌─────────────────────────────────────────────────────────┐
  │  [Order #847]  visibility_timeout = 30s                 │
  └─────────────────────────────────────────────────────────┘
         │
         │ t=0s: Worker A pulls message, message hidden
         ▼
  ┌──────────────┐
  │   Worker A   │  t=0s:  SELECT processed → false ✓
  │              │  t=5s:  Stripe charge begins
  │              │  t=20s: Stripe responds
  │  PROCESSING  │  t=28s: DB write begins
  │              │
  └──────────────┘
         ·
         · t=30s: VISIBILITY TIMEOUT EXPIRES
         ·        Message becomes visible again!
         ·
         ▼
  ┌──────────────┐
  │   Worker B   │  t=31s: Pulls same message
  │              │  t=31s: SELECT processed → false ✓ (Worker A hasn't committed yet)
  │              │  t=36s: Stripe charge begins — DUPLICATE CHARGE
  │  PROCESSING  │
  │              │
  └──────────────┘
         ·
         · t=38s: Worker A commits, deletes message
         · t=42s: Worker B commits, deletes message (receipt handle still valid)
         ·        Two Stripe charges. One customer. One very angry morning.

Investigation: the sawtooth in CloudWatch

Once we knew what to look for, the CloudWatch metrics told the story. The ApproximateNumberOfMessagesNotVisible metric, messages currently being processed, showed a distinctive sawtooth pattern. Flat for about 30 seconds, then a spike as messages re-emerged, then flat again. Every 30 seconds, like clockwork.

We pulled NumberOfMessagesSent vs NumberOfMessagesDeleted over the four-hour incident window. Messages deleted exceeded messages sent by 847: one extra delete per duplicated order, because Worker A and Worker B each deleted the same message after processing it. SQS silently accepts deletes for already-deleted messages (it's idempotent at the delete level, not at the consume level). No errors. No alarms. Just quiet double-processing for four hours on a Monday morning.

The scariest production bugs aren't the ones that throw errors. They're the ones that succeed quietly, twice.

Root cause: TOCTOU + clock mismatch

Two distinct failure modes compounding each other.

The first was a Time-of-Check / Time-of-Use race. The idempotency check (SELECT processed WHERE order_id = ?) and the write (INSERT ... processed = true) were not atomic. Two workers could both pass the check in the gap before either committed. Under normal circumstances that window is milliseconds. With a 30-second visibility timeout expiring mid-processing, the window was guaranteed to be hit every time processing ran long.

The second was that we never measured our latency budget. The visibility timeout was set once in a Terraform file 18 months before the incident. Nobody re-examined it as the order pipeline grew. More validation logic, a third-party fraud check, an email delivery confirmation step. The timeout was a number on a line nobody touched.


Architecture fix

We needed two independent fixes. Either one alone was insufficient. The timeout fix wouldn't prevent a race on worker restarts. The lock fix wouldn't prevent the timeout from triggering unnecessary redeliveries.

Fix 1 was a heartbeat. While a worker is processing, a background interval extends the message's visibility timeout every 20 seconds. The message stays invisible until the worker explicitly finishes or crashes.

order-processor.ts (with heartbeat + distributed lock)
async function processOrderSafely(message: SQS.Message, payload: OrderPayload) {
  const orderId = payload.orderId;

  // --- FIX 1: Heartbeat — keep message invisible while we work ---
  const heartbeat = setInterval(async () => {
    try {
      await sqs.changeMessageVisibility({
        QueueUrl: QUEUE_URL,
        ReceiptHandle: message.ReceiptHandle!,
        VisibilityTimeout: 60, // extend by 60s on each heartbeat
      }).promise();
    } catch (err) {
      // Message may have already been deleted — safe to ignore
      console.warn('Heartbeat extension failed (message already gone?)', err);
    }
  }, 20_000); // every 20s, well within the 60s window

  try {
    // --- FIX 2: Atomic distributed lock via PostgreSQL advisory lock ---
    // pg_try_advisory_xact_lock takes a bigint; hash the orderId to one
    const lockKey = hashOrderId(orderId); // stable integer hash
    const lockResult = await db.query(
      'SELECT pg_try_advisory_xact_lock($1) AS acquired',
      [lockKey]
    );

    if (!lockResult.rows[0].acquired) {
      console.log(`Order ${orderId} locked by another worker — skipping`);
      return; // another worker has the lock; let it finish
    }

    // --- FIX 3: Atomic upsert — insert only if not already processed ---
    const insert = await db.query(`
      INSERT INTO orders (order_id, status, created_at)
      VALUES ($1, 'processing', NOW())
      ON CONFLICT (order_id) DO NOTHING
      RETURNING order_id
    `, [orderId]);

    if (insert.rowCount === 0) {
      console.log(`Order ${orderId} already exists — idempotent skip`);
      return; // row was already there; genuine duplicate, safe to skip
    }

    // Safe to charge — lock held, row owned, no other worker can get here
    const charge = await stripe.charges.create({
      amount: payload.amount,
      currency: 'usd',
      source: payload.token,
      idempotency_key: `order-${orderId}`, // Stripe-level idempotency as last resort
    });

    await db.query(
      'UPDATE orders SET status = $1, charge_id = $2 WHERE order_id = $3',
      ['completed', charge.id, orderId]
    );

    await sendConfirmationEmail(payload.email, orderId);
    await sqs.deleteMessage({ QueueUrl: QUEUE_URL, ReceiptHandle: message.ReceiptHandle! }).promise();

  } finally {
    clearInterval(heartbeat); // always stop the heartbeat
    // pg advisory lock releases automatically at transaction end
  }
}

The three-layer defense ensures exactly-once semantics even under concurrent workers and long processing times.

AFTER: Heartbeat + Distributed Lock + Atomic Upsert
────────────────────────────────────────────────────────────────────

  SQS Queue
  ┌──────────────────────────────────────────────────────────────┐
  │  [Order #847]  initial visibility_timeout = 30s              │
  └──────────────────────────────────────────────────────────────┘
         │
         │ t=0s:  Worker A pulls message
         ▼
  ┌──────────────────────────────────────────────┐
  │               Worker A                       │
  │                                              │
  │  t=0s:  pg_try_advisory_xact_lock → true ✓  │
  │  t=0s:  INSERT ... ON CONFLICT DO NOTHING    │
  │         → 1 row inserted ✓                  │
  │                                              │
  │  ┌── Heartbeat (every 20s) ──────────────┐  │
  │  │  t=20s: extend visibility → +60s ✓   │  │
  │  │  t=40s: extend visibility → +60s ✓   │  │
  │  └───────────────────────────────────────┘  │
  │                                              │
  │  t=42s: Stripe charge complete              │
  │  t=44s: DB update, email sent               │
  │  t=44s: sqs.deleteMessage() ✓              │
  │  t=44s: clearInterval(heartbeat)            │
  └──────────────────────────────────────────────┘

  If Worker B somehow pulls the same message (e.g., on restart):
  ┌──────────────────────────────────────────────┐
  │               Worker B                       │
  │  pg_try_advisory_xact_lock → false ✗        │
  │  → returns immediately, no Stripe call       │
  └──────────────────────────────────────────────┘

  Result: Zero duplicate charges. Message processed exactly once.

Results after the fix

0Duplicate charges (30 days post-fix)
847 → 0Incident replication rate
30s → 90sInitial visibility timeout (buffer)
Effective timeout via heartbeat

We also added a CloudWatch alarm on the ratio of NumberOfMessagesDeleted to NumberOfMessagesSent. Anything over 1.05 (5% tolerance for legitimate retries) pages the on-call engineer immediately. Which in retrospect should have been the very first thing we set up when we built the pipeline.


Lessons learned

  • SQS visibility timeout is a hard deadline, not a suggestion. Treat it as a maximum processing budget. If you can't guarantee completion within it, use heartbeats. Rule of thumb: set initial timeout to 2x your p50 latency; extend via heartbeat for anything longer.
  • Idempotency checks must be atomic. A SELECT followed by an INSERT is a TOCTOU race. Use INSERT ... ON CONFLICT DO NOTHING and check rowCount. If it's 0, another worker already claimed the work.
  • At-least-once delivery requires true idempotency, not a guard clause. "We have idempotency keys" is only true if the write that sets the key and the side effect it guards are in the same atomic operation, or protected by a distributed lock.
  • Config values drift away from reality silently. Our 30-second timeout was set when the pipeline took 10 seconds. Nobody re-examined it as we added fraud checks, email confirmations, and retry logic. Put latency budget reviews into your architecture change process, not just code review.
  • Stripe-level idempotency is a last resort. We added idempotency_key to Stripe calls as a third layer. Stripe deduplicates identical keys within 24 hours. Don't rely on your payment provider to catch what your architecture should have prevented.
SQS gives you at-least-once delivery by design. Your job is to make your consumers idempotent enough to survive it, and fast enough not to trigger it.
Share this
← All Posts10 min read