June 27, 2026Architecture10 min read

Agents in production, part 2: the 150-person scale-up

Published June 27, 202610 min read

A 150-person company built a document processing pipeline: extract key fields with LLM step 1, classify the document type with step 2, generate a structured summary with step 3. After a routine deploy, 12% of documents came back with wrong classifications. By the time anyone noticed, 4,000 documents had been processed. The audit to find which step introduced the error took three days.

The pipeline was built on the patterns from Part 1 — async queue, circuit breaker, output validation. None of those saved them. The failure mode was different: not a single LLM call timing out, but a multi-step chain where one step's bad output fed silently into the next. No step-level tracing. No checkpoint state. No way to replay just the broken step.

This is part 2 of the series. It covers the architecture problems that appear when you move from isolated LLM calls to multi-step agent workflows, and the specific patterns that give you visibility and control at each step in the chain.

What the scale-up built on top of the startup patterns

By the time a company reaches 150 people, the single LLM call patterns are already in place. The LLM is queued and async. There is a circuit breaker. Outputs are validated with a schema before writing to the database. The startup fire drills are in the past.

What changes at this scale is the workflow complexity. Instead of one LLM call per job, you have three, five, or eight — each dependent on the previous output. The naive implementation chains them sequentially in a single worker function:

// The naive multi-step chain — works in staging, breaks in production
async function processDocument(job: Job): Promise<void> {
  const { documentId, content } = job.data;

  // Step 1: extract
  const extracted = await llm.complete(buildExtractionPrompt(content));
  const fields = ExtractionSchema.parse(JSON.parse(extracted));

  // Step 2: classify
  const classified = await llm.complete(buildClassificationPrompt(fields));
  const docType = ClassificationSchema.parse(JSON.parse(classified));

  // Step 3: summarize
  const summary = await llm.complete(buildSummaryPrompt(fields, docType));
  const result = SummarySchema.parse(JSON.parse(summary));

  await db.saveResult({ documentId, fields, docType, result });
}

This looks reasonable. The schemas are there. Each step awaits. But three things are missing that do not matter at low scale and become critical at production volume.

What breaks in production at this scale

No step-level visibility. You have a job ID and a final status. You do not have a record of what each individual step received as input, what it returned, how long it took, or which model call it made. When classification goes wrong, you cannot tell whether extraction produced malformed fields that confused the classifier, or whether the classification prompt itself drifted after a model update. Both look identical in your logs: a wrong output in the database.

Retry starts from the beginning. When step 3 fails — a 429 rate limit, a validation error, a transient timeout — BullMQ retries the entire job from step 1. You re-run the extraction call and the classification call, spending tokens and time on steps that already succeeded. At 10,000 documents per day, a 5% retry rate means 500 full-chain reruns per day, paying for three LLM calls when one failed.

Cross-step contamination is invisible. When a schema validation is too loose — a field allows null where it should require a value — step 1 passes, step 2 receives a partial input and fills in a default, step 3 succeeds against that default. The final output looks valid. Nothing errors. You discover the problem three weeks later in a data quality review, with no way to trace which documents were processed with the corrupted intermediate state.

// What the failure looks like in logs — no step context
[ERROR] job:doc-processing:7f3a2b1 failed attempt 3/5
  Error: ZodError at SummarySchema.parse
  job.data.documentId: "doc-89234"
  duration: 47,832ms

// What you actually need to know
[STEP] trace:8d91c job:7f3a2b1 step:extraction PASS (2,341ms) model:gpt-4o
[STEP] trace:8d91c job:7f3a2b1 step:classification PASS (1,887ms) model:gpt-4o
[STEP] trace:8d91c job:7f3a2b1 step:summarization FAIL (8,491ms)
  input.docType: null     ← contamination from step 2
  error: required field missing

The corrected architecture: step-aware pipelines

The fix is to make the step the first-class unit of the pipeline, not the job. Each step gets its own identity, its own state record, its own retry scope. Four patterns close the gap:

1. A step interface with explicit inputs and outputs. Every step in the chain is a function with a typed input, typed output, and a string key. The runner controls execution order. Steps do not call each other directly.

interface PipelineStep<TIn, TOut> {
  key: string;
  execute(input: TIn, ctx: StepContext): Promise<TOut>;
  validate(output: unknown): TOut; // throws on invalid
}

interface StepContext {
  traceId: string;
  jobId: string;
  stepIndex: number;
}

// Each step is a self-contained unit
const extractionStep: PipelineStep<RawDocument, ExtractedFields> = {
  key: 'extraction',
  async execute(input, ctx) {
    const raw = await llm.complete(buildExtractionPrompt(input.content), {
      metadata: { traceId: ctx.traceId, step: ctx.key },
    });
    return this.validate(JSON.parse(raw));
  },
  validate(output) {
    return ExtractionSchema.parse(output); // strict schema — no nulls allowed
  },
};

2. Checkpoint-based execution. The step runner persists each step's result before moving to the next. On retry, it loads the checkpoint and skips completed steps. A job that fails at step 3 resumes from step 3, not step 1.

class StepRunner {
  async run<T>(
    steps: PipelineStep<unknown, unknown>[],
    initialInput: T,
    jobId: string
  ): Promise<unknown> {
    const traceId = generateTraceId();
    let currentInput: unknown = initialInput;

    for (let i = 0; i < steps.length; i++) {
      const step = steps[i];
      const ctx: StepContext = { traceId, jobId, stepIndex: i };

      // Resume from checkpoint if step already completed
      const checkpoint = await this.loadCheckpoint(jobId, step.key);
      if (checkpoint) {
        currentInput = checkpoint.output;
        continue;
      }

      const spanStart = Date.now();
      const output = await step.execute(currentInput as never, ctx);

      // Persist before advancing — if the next step throws, we can resume here
      await this.saveCheckpoint(jobId, step.key, {
        input: currentInput,
        output,
        durationMs: Date.now() - spanStart,
        completedAt: new Date().toISOString(),
      });

      currentInput = output;
    }

    return currentInput;
  }

  private async saveCheckpoint(
    jobId: string,
    stepKey: string,
    data: StepCheckpoint
  ): Promise<void> {
    await redis.set(
      `checkpoint:${jobId}:${stepKey}`,
      JSON.stringify(data),
      'EX', 86400 // 24h TTL
    );
  }
}

3. Trace spans per step. Every step execution emits a structured log with the trace ID, step key, input hash, output hash, model used, latency, and pass/fail status. This gives you the audit trail to answer: which step failed, with what input, and how often.

// Tracing middleware — wrap any step to add observability
function withTracing<TIn, TOut>(
  step: PipelineStep<TIn, TOut>
): PipelineStep<TIn, TOut> {
  return {
    key: step.key,
    async execute(input: TIn, ctx: StepContext): Promise<TOut> {
      const spanId = generateSpanId();
      const start = Date.now();

      try {
        const output = await step.execute(input, ctx);

        logger.info('step.success', {
          traceId: ctx.traceId,
          spanId,
          jobId: ctx.jobId,
          step: step.key,
          durationMs: Date.now() - start,
          inputHash: hashObject(input),
          outputHash: hashObject(output),
        });

        return output;
      } catch (err) {
        logger.error('step.failure', {
          traceId: ctx.traceId,
          spanId,
          jobId: ctx.jobId,
          step: step.key,
          durationMs: Date.now() - start,
          error: err instanceof Error ? err.message : String(err),
        });
        throw err;
      }
    },
    validate: step.validate,
  };
}

4. Between-step schema tightening. The most common source of cross-step contamination is a schema that allows values the downstream step cannot handle. Every field that the next step requires as non-null must be non-null in the previous step's output schema. Not optional, not defaultable — required with a validation error if missing. This surfaces the failure at the step boundary rather than silently propagating.

// Step pipeline architecture

Job Queue (BullMQ)
    │
    ▼
StepRunner.run([step1, step2, step3], input, jobId)
    │
    ├─► withTracing(extractionStep)
    │       ↓ execute()  →  validate()
    │       ↓ saveCheckpoint(jobId, 'extraction', output)   ←── Redis (24h TTL)
    │
    ├─► withTracing(classificationStep)
    │       ↓ loadCheckpoint → skip if already done
    │       ↓ execute()  →  validate()
    │       ↓ saveCheckpoint(jobId, 'classification', output)
    │
    └─► withTracing(summarizationStep)
            ↓ execute()  →  validate()
            ↓ saveCheckpoint(jobId, 'summarization', output)
            ↓
        db.saveResult()

On retry:  resumes from last successful checkpoint, not step 1
On failure: structured log with traceId + step key + input hash + error

What the audit looked like with this architecture

Three days to audit 4,000 documents. With step-level tracing, the same investigation takes twenty minutes.

You query the step logs for all jobs where the classification step received an input with a null documentCategory field. That is the contaminated field. You filter to the deploy window. You get 4,312 matching jobs. You find the extraction step's output schema changed in the deploy: the documentCategory field became optional. The classifier received null, defaulted to "general", and produced valid-looking output.

With checkpoints, you do not re-process from scratch. You tighten the extraction schema, invalidate the extraction checkpoints for the affected window, and replay only the extraction step for those 4,312 jobs. Classification and summarization keep their results. The remediation costs three LLM calls per document — one per re-extracted document — instead of nine.

Where this falls short at the next scale

These patterns handle a 150-person company with a few agent pipelines owned by one team. They do not handle agents that span team boundaries — where the extraction step is owned by the data team and the classification step is owned by the product team, each running on different infrastructure with different SLAs. They do not handle budget allocation: which team's LLM spend does a shared pipeline charge to? And they do not handle governance: who approves when an agent gains access to a new data source?

Part 3 covers the coordination and governance problems that appear when you have thirty teams running agents in parallel, none of them aware of what the others are doing, and a cost bill that triples in sixty days.