May 15, 2026Gemini6 min read

Gemini 3.5 Flash beats last year's Pro on coding. Here is what that means

Published May 15, 20266 min read

Google announced Gemini 3.5 at I/O on May 19, 2026 with a headline that sounds backwards: the Flash model beats last year's Pro on coding and agentic tasks. Gemini 3.5 Flash scores 76.2% on Terminal-Bench 2.1 versus Gemini 3.1 Pro's 70.3%. On MCP Atlas — the benchmark for multi-step tool use — Flash scores 83.6% against Pro's 78.2%. This is not a marginal improvement. It changes the model selection calculus for anyone building AI-powered tools.

What Google actually released

Right now, only Gemini 3.5 Flash is publicly available. Gemini 3.5 Pro is being used internally at Google and is expected to roll out in June 2026. So when you hear "Gemini 3.5," you are talking about Flash — which is already available in the Gemini API, Google AI Studio, and the Gemini app.

The model ID is gemini-3.5-flash. Context window: 1,048,576 input tokens with 65,536 output tokens. Dynamic thinking is on by default — the model decides how much internal reasoning each request needs without you having to configure it.

The agentic pivot

The 3.5 family represents a deliberate shift in what Google is optimizing for. The tagline is "frontier intelligence with action." Prior Gemini generations were evaluated primarily on knowledge and reasoning benchmarks. The 3.5 benchmarks are weighted toward what agents actually do: terminal interaction, tool calling, multi-step planning, and coding under realistic conditions.

This shows up in the numbers. GDPval-AA — an agentic benchmark measuring real-world utility across tasks — shows 3.5 Flash at 1656 Elo versus 3.1 Pro's 1314. That is not a marginal improvement. It reflects a model trained to complete tasks, not just answer questions.

The speed improvement compounds this. 3.5 Flash outputs tokens 4x faster than other frontier models. In agentic workflows where a single user request triggers dozens of model calls, latency per call multiplies. A 4x throughput improvement is substantial when your agent is running in a loop.

The benchmark reality check

3.5 Flash wins 11 of 15 published benchmarks against 3.1 Pro. The wins are concentrated in coding, tool use, and agentic tasks — which maps directly to what most developers are building. But the losses matter too.

3.5 Flash trails 3.1 Pro on Humanity's Last Exam (40.2% vs 44.4%) and ARC-AGI-2 (72.1% vs 77.1%). These measure raw abstract reasoning and parametric knowledge depth. If your workload requires a genuinely hard single-turn question answered from first principles — formal mathematics, complex scientific reasoning, dense academic text — 3.1 Pro still holds. Or wait for 3.5 Pro.

The practical read: for software engineering tasks, multi-step pipelines, tool use, and agents — switch to 3.5 Flash now. For tasks requiring dense reasoning depth, stay on 3.1 Pro until 3.5 Pro ships.

Using 3.5 Flash in an agentic loop

The model is designed for parallel sub-agent execution. Here is a pattern that works well — spinning up concurrent agents on a shared task and aggregating results:

python

import google.generativeai as genai
import asyncio

genai.configure(api_key="YOUR_API_KEY")

# Dynamic thinking is on by default — no config needed
model = genai.GenerativeModel("gemini-3.5-flash")

async def run_sub_agent(task: str, context: str) -> str:
    """Single agent call in the loop."""
    response = await asyncio.to_thread(
        model.generate_content,
        f"""Context: {context}

Task: {task}

Complete this task. Be specific and actionable."""
    )
    return response.text

async def parallel_agent_pipeline(codebase_summary: str):
    """
    3.5 Flash is built for this: parallel sub-agent execution
    4x faster output means loops complete significantly faster
    """
    tasks = [
        "Identify the top 3 security vulnerabilities in this codebase",
        "Find all database queries that are missing indexes",
        "List all API endpoints with no input validation",
        "Identify any hardcoded secrets or credentials",
    ]

    # Run all sub-agents concurrently
    results = await asyncio.gather(*[
        run_sub_agent(task, codebase_summary)
        for task in tasks
    ])

    # Synthesize findings
    synthesis_prompt = f"""You reviewed a codebase across 4 dimensions.
Here are the findings from each agent:

{chr(10).join(f"Agent {i+1}: {r}" for i, r in enumerate(results))}

Synthesize these into a prioritized action list. Top 5 issues to fix first."""

    final = model.generate_content(synthesis_prompt)
    return final.text

# Run it
import asyncio
summary = "FastAPI service, PostgreSQL, Redis cache, 50k RPM..."
report = asyncio.run(parallel_agent_pipeline(summary))

Tool use with MCP

The MCP Atlas benchmark score (83.6%) reflects real capability: 3.5 Flash handles tool calling reliably across complex, multi-step sequences. Here is how to wire up function calling with the Gemini API:

python

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")

# Define tools the agent can call
tools = [
    genai.protos.Tool(function_declarations=[
        genai.protos.FunctionDeclaration(
            name="run_sql_query",
            description="Execute a read-only SQL query on the production database",
            parameters=genai.protos.Schema(
                type=genai.protos.Type.OBJECT,
                properties={
                    "query": genai.protos.Schema(
                        type=genai.protos.Type.STRING,
                        description="The SQL SELECT query to run"
                    )
                },
                required=["query"]
            )
        ),
        genai.protos.FunctionDeclaration(
            name="get_api_logs",
            description="Fetch recent API error logs for a given endpoint",
            parameters=genai.protos.Schema(
                type=genai.protos.Type.OBJECT,
                properties={
                    "endpoint": genai.protos.Schema(type=genai.protos.Type.STRING),
                    "hours": genai.protos.Schema(type=genai.protos.Type.INTEGER)
                },
                required=["endpoint"]
            )
        )
    ])
]

model = genai.GenerativeModel("gemini-3.5-flash", tools=tools)
chat = model.start_chat()

# The agent will call tools as needed to answer the question
response = chat.send_message(
    "Our /api/payments endpoint has had elevated 500 errors for 2 hours. "
    "Diagnose the root cause and suggest a fix."
)

# Handle tool calls in the loop
while response.candidates[0].finish_reason.name == "TOOL_CODE":
    tool_call = response.candidates[0].content.parts[0].function_call

    # Execute the actual function
    if tool_call.name == "run_sql_query":
        result = execute_sql(tool_call.args["query"])  # your impl
    elif tool_call.name == "get_api_logs":
        result = fetch_logs(tool_call.args["endpoint"], tool_call.args.get("hours", 1))

    # Feed result back to the model
    response = chat.send_message(
        genai.protos.Content(parts=[
            genai.protos.Part(function_response=genai.protos.FunctionResponse(
                name=tool_call.name,
                response={"result": result}
            ))
        ])
    )

print(response.text)  # Final diagnosis

Pricing and when to still use 3.1 Pro

Gemini 3.5 Flash is priced at $1.50 per million input tokens and $9.00 per million output tokens, with cached input at $0.15. This is in line with frontier model pricing from other providers, but the 4x speed advantage means you get more throughput per dollar on time-sensitive workloads.

The decision tree is simple right now:

Use 3.5 Flash if: you are building agents, coding tools, or any multi-step pipeline. You get better benchmark scores on these tasks than 3.1 Pro, with significantly faster output.

Stay on 3.1 Pro if: your workload is single-turn complex reasoning — hard math, scientific analysis, or tasks where the 3.1 Pro advantage on HLE (44.4% vs 40.2%) and ARC-AGI-2 (77.1% vs 72.1%) actually matters to your output quality.

Wait for 3.5 Pro if: you want the best of both — frontier reasoning depth with the new agentic architecture. Google has confirmed it is being used internally and will ship next month.

The shift that matters

The headline benchmark story is that a Flash model beat last year's Pro. But the more significant shift is what Google is optimizing for. ARC-AGI-2 and Humanity's Last Exam measure the model's parametric knowledge and abstract reasoning in isolation. Terminal-Bench, MCP Atlas, and GDPval measure whether the model can complete real work in a real environment.

Google is betting that the second category is what developers actually care about. Based on the 3.5 Flash numbers, that bet looks correct for most of what people are actually building. The era of choosing models based on MMLU scores is giving way to choosing based on agentic task completion. Gemini 3.5 is built for that world.