Why I switched my agent's inner loop from Claude to Gemini 3.5 Flash
The agent worked. That was the problem. It worked well enough that my team kept adding steps to it — security check, style check, logic review, test coverage analysis. Each step was a separate model call. Each call took about 3 seconds. By the time we hit 12 steps, every code review request was taking 36 seconds. The fix was not to reduce steps. It was to stop using a frontier reasoning model for tasks that do not need deep reasoning.
How the loop got slow
The agent was a code review pipeline. A developer opens a PR, the agent runs, and within a minute they get structured feedback across four categories: security vulnerabilities, style violations, logic issues, and missing test cases. The original design ran these sequentially:
import anthropic
import time
client = anthropic.Anthropic()
def review_step(code: str, focus: str) -> str:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Review this code for {focus}:\n\n{code}"
}]
)
return response.content[0].text
def run_review(code: str) -> dict:
start = time.time()
security = review_step(code, "security vulnerabilities")
style = review_step(code, "style and readability issues")
logic = review_step(code, "logic errors and edge cases")
tests = review_step(code, "missing test coverage")
elapsed = time.time() - start
print(f"Review completed in {elapsed:.1f}s") # ~12s sequential
return {
"security": security,
"style": style,
"logic": logic,
"tests": tests,
}
Sequential was the original sin. We parallelised it with asyncio.gather, which brought it down to about 4 seconds for four steps. But the product kept growing. By the time we had twelve sub-checks, even running them in parallel, the slowest call in the batch was setting the floor — and that floor kept rising as we added more nuanced prompts.
The real issue was model selection. We were using a frontier reasoning model for every step, including ones that do not require frontier reasoning. Checking whether a function name follows snake_case convention does not need the same model as writing a formal proof. We were paying in latency for capability we were not using.
What Gemini 3.5 Flash actually offers
Gemini 3.5 Flash outputs tokens 4x faster than other frontier models. That is not a marketing number — it shows up in practice on the kinds of tasks agents run: tool calls, sub-task completions, structured outputs. The model also scores 76.2% on Terminal-Bench 2.1 (versus Gemini 3.1 Pro's 70.3%) and 83.6% on MCP Atlas (versus Pro's 78.2%). These benchmarks measure what agents actually do: completing multi-step tasks in real environments, handling tool calls across complex sequences.
Dynamic thinking is on by default. The model decides how much internal reasoning each request needs without you configuring it. For a step that is checking style rules, it stays fast. For a step where the code logic is genuinely ambiguous, it can reason more carefully. You do not have to tune this per step.
The rewrite: parallel agents with Gemini 3.5 Flash
The new version of the code review agent uses Gemini 3.5 Flash for every inner loop step and runs them all in parallel. The orchestrator — the piece that synthesizes results into final output — can stay on whatever model you prefer for synthesis.
import google.generativeai as genai
import asyncio
import time
from dataclasses import dataclass
genai.configure(api_key="YOUR_API_KEY")
# Dynamic thinking on by default — no config needed
flash = genai.GenerativeModel("gemini-3.5-flash")
@dataclass
class ReviewResult:
category: str
findings: str
severity: str # "high" | "medium" | "low" | "none"
async def run_review_step(code: str, category: str, instructions: str) -> ReviewResult:
"""Single agent step. Runs concurrently with other steps."""
prompt = f"""You are reviewing code for: {category}
Instructions: {instructions}
Code:
```
{code}
```
Respond with JSON only:
{{
"findings": "specific issues found, or 'none' if clean",
"severity": "high | medium | low | none"
}}"""
response = await asyncio.to_thread(flash.generate_content, prompt)
import json
data = json.loads(response.text)
return ReviewResult(
category=category,
findings=data["findings"],
severity=data["severity"],
)
REVIEW_STEPS = [
("Security", "Look for SQL injection, XSS, hardcoded secrets, insecure deserialization, and missing auth checks"),
("Input validation", "Check that all external inputs are validated and sanitised before use"),
("Error handling", "Find unhandled exceptions, bare except clauses, and errors that swallow stack traces"),
("Style", "Check for naming consistency, function length, and readability"),
("Logic", "Identify off-by-one errors, incorrect boolean logic, and missed edge cases"),
("Performance", "Spot N+1 queries, unnecessary loops inside loops, and blocking I/O in async functions"),
("Test coverage", "List code paths that have no corresponding test"),
("Type safety", "Find missing type annotations and any uses of Any that should be more specific"),
]
async def run_full_review(code: str) -> list[ReviewResult]:
start = time.time()
# All steps run concurrently — total time = slowest single step
results = await asyncio.gather(*[
run_review_step(code, category, instructions)
for category, instructions in REVIEW_STEPS
])
elapsed = time.time() - start
print(f"Review completed in {elapsed:.1f}s") # ~8s vs 36s before
return list(results)
def format_report(results: list[ReviewResult]) -> str:
high = [r for r in results if r.severity == "high"]
medium = [r for r in results if r.severity == "medium"]
low = [r for r in results if r.severity in ("low", "none") and r.findings != "none"]
lines = []
if high:
lines.append("HIGH PRIORITY")
for r in high:
lines.append(f" [{r.category}] {r.findings}")
if medium:
lines.append("MEDIUM PRIORITY")
for r in medium:
lines.append(f" [{r.category}] {r.findings}")
if low:
lines.append("LOW PRIORITY")
for r in low:
lines.append(f" [{r.category}] {r.findings}")
if not high and not medium:
lines.append("No high or medium issues found.")
return "\n".join(lines)
# Run it
code = open("pr_diff.py").read()
results = asyncio.run(run_full_review(code))
print(format_report(results))
Before and after
The difference in practice was significant enough that we stopped treating it as an experiment after the first week:
- Before (Claude, sequential, 4 steps): ~12 seconds
- Before (Claude, parallel, 4 steps): ~4 seconds
- Before (Claude, parallel, 12 steps): ~36 seconds — the step count had crept up over months
- After (Gemini 3.5 Flash, parallel, 8 steps): ~8 seconds
We went from 12 steps in 36 seconds to 8 more targeted steps in 8 seconds. The quality was not worse — if anything, the more focused prompts per step gave cleaner output because the model was not trying to hold multiple concerns at once. The structured JSON responses were more consistent, which made the formatter more reliable.
The strategy: use the right model for each layer
This is not a "Claude is slow" finding. It is a model-routing finding. The architecture that works well for agentic pipelines is layered:
- Inner loop steps (tool calls, sub-tasks, structured checks): Gemini 3.5 Flash. These steps do not require deep reasoning. They require speed, reliable structured output, and enough intelligence to catch real issues. Flash handles all of this.
- Orchestration and synthesis (the step that aggregates results and writes the final output): Your choice. If the synthesis is mechanical — format these findings into a report — Flash works here too. If the synthesis requires nuanced judgment, use whichever model you trust for that.
- One-off complex reasoning (formal architecture review, security threat model): Use a reasoning-heavy model. These are single calls, not loops, so latency compounds less.
The mistake we made was not recognising when we had crossed from "one-off call" to "loop with compounding latency." Once you are running the same model ten or more times per user action, the per-call latency becomes the user experience.
A note on quality
The concern I hear most when people consider switching to a faster model is accuracy. Here the benchmarks are useful context: Gemini 3.5 Flash scores higher than Gemini 3.1 Pro on Terminal-Bench 2.1 (76.2% vs 70.3%) and MCP Atlas (83.6% vs 78.2%). On coding and tool-use tasks specifically, the Flash model is not a downgrade — it is a lateral move or better.
Where 3.5 Flash trades off compared to slower frontier models is on hard abstract reasoning tasks — dense formal math, ARC-AGI-2 style puzzles. Those tasks rarely appear in production coding agents. If your inner loop step is "does this function handle the None case correctly," Flash is plenty. If it is "prove that this distributed lock is deadlock-free," that is a different tool.
Dynamic thinking being on by default helps here. For the easy style checks, the model stays fast. For a step where the logic is genuinely subtle, it will reason more carefully. You get some of the benefit of a reasoning model without the flat cost of always using one.
How to add it to an existing pipeline
If you already have an agent running and want to test this without a full rewrite, the minimal change is to swap the model on your inner-loop steps and run them in parallel if you are not already:
import google.generativeai as genai
import asyncio
genai.configure(api_key="YOUR_GEMINI_API_KEY")
flash = genai.GenerativeModel("gemini-3.5-flash")
async def agent_step(prompt: str) -> str:
"""Drop-in replacement for a single LLM call in your existing loop."""
response = await asyncio.to_thread(flash.generate_content, prompt)
return response.text
async def run_steps_in_parallel(prompts: list[str]) -> list[str]:
"""Run all steps concurrently. Returns results in same order."""
return await asyncio.gather(*[agent_step(p) for p in prompts])
# Before: sequential calls, each ~3s, total = n * 3s
# After: concurrent calls, total = slowest single call (~2s)
prompts = [
"Check this code for SQL injection: ...",
"Check this code for missing null checks: ...",
"Check this code for style issues: ...",
]
results = asyncio.run(run_steps_in_parallel(prompts))
# results[0] = security findings
# results[1] = null check findings
# results[2] = style findings
The model ID is gemini-3.5-flash. It is live in the Gemini API now. The context window is 1,048,576 input tokens with 65,536 output tokens — large enough that you can include substantial context per step without worrying about truncation.
If you are building any kind of multi-step agent and you have not measured per-step latency, measure it first. The number is almost always higher than developers expect once you account for all the steps that got added after launch. Then look at which steps actually require deep reasoning and which do not. The ones that do not are good candidates for this swap.