May 16, 2026Gemini6 min read

What becomes possible when you stop chunking your codebase

Published May 16, 20266 min read

I was three hours into setting up a chunking pipeline for a 300K-token codebase when I realized I was solving the wrong problem. The context limit I was working around — Claude's 200K maximum — was a real constraint, but the pipeline I was building to get around it was going to lose the very thing I needed: cross-file reasoning. Bugs that live in the gap between files, where a function returns something the caller never checks, don't survive chunking. I scrapped the pipeline and switched to Gemini. Here's what that looked like.

The context size difference is not a rounding error

Claude's context window is 200,000 tokens. Gemini's is 1,048,576 — just over one million. That gap matters more than the number suggests, because crossing a model's context limit forces you to make a choice: either truncate your input (and hope the relevant part is in what you kept), or chunk it (and lose the model's ability to reason across chunk boundaries).

For most tasks, chunking is fine. Summarize a single file: easy. Explain one function: doesn't matter. But the hardest engineering questions are precisely the ones that don't live inside a single file. A null-dereference bug where the null enters from an API response in api/client.py and explodes six function calls later in services/order.py. A performance regression traced to how three separate modules share a database connection pool. These problems require holding the whole codebase in view at once.

With 1,048,576 tokens, Gemini can hold roughly:

800,000 lines of source code (average 1.3 tokens per line)
6 months of server logs at a typical 3,000 lines/day
A 300-page PDF with room to spare

The practical effect is that you stop thinking about what to include and start thinking about what to ask.

Use case 1: whole-codebase bug hunting

The first thing I fed Gemini was the full codebase I'd been trying to chunk — a Python monorepo with about 280K tokens of actual source. The question was specific: find every place where we call an internal API that can return null for a user field, but don't check for it before using the value.

Chunking this would have required me to know which files were relevant up front. That's circular: if I knew where the bugs were, I wouldn't need to search for them. With the full codebase in context, I could ask the question without any pre-filtering.

python

import google.generativeai as genai
from pathlib import Path
import os

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

model = genai.GenerativeModel("gemini-3.5-flash")

def load_codebase(root: str, extensions: tuple = (".py",)) -> str:
    parts = []
    for path in sorted(Path(root).rglob("*")):
        if path.suffix in extensions and path.is_file():
            try:
                source = path.read_text(encoding="utf-8", errors="ignore")
                parts.append(f"# === {path} ===
{source}")
            except Exception:
                pass
    return "

".join(parts)


codebase = load_codebase("./src")

prompt = """
You are reviewing a Python codebase for null-safety bugs.

The internal user API (`get_user_by_id`) can return None if the user does not
exist. Find every call site where the return value is used without a None check
before accessing a field or calling a method on it.

For each issue, return:
- File path and line number
- The problematic code snippet
- Why it will crash

Codebase:
""" + codebase

response = model.generate_content(prompt)
print(response.text)

It found 11 call sites across 8 files, three of which were in code paths I'd never looked at because they were only hit during account deletion flows. The chunked approach would have required me to guess which files to include, and I almost certainly would have missed the deletion paths.

Use case 2: six months of logs, one question

Log analysis is where context size wins almost unconditionally. A typical approach is to grep for specific error strings, pull a 24-hour window, or run a log aggregation query in your observability platform. These all require you to know what you're looking for before you look.

I had six months of application logs — about 540MB compressed, which decompresses to roughly 2.8GB of text, far too large to fit even Gemini's context. But I also had a specific time window I cared about: the six weeks before a performance regression was first reported. That window compressed to about 400MB of raw log lines, and after stripping timestamps and deduplicating identical repeated entries, the token count dropped to around 600K — comfortably inside Gemini's limit.

python

import google.generativeai as genai
import gzip
import os
from pathlib import Path

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

model = genai.GenerativeModel("gemini-3.5-flash")

def load_logs(log_dir: str, max_chars: int = 3_000_000) -> str:
    """Load and concatenate log files up to a character limit."""
    lines = []
    total = 0
    for log_file in sorted(Path(log_dir).glob("*.log.gz")):
        with gzip.open(log_file, "rt", encoding="utf-8", errors="ignore") as f:
            for line in f:
                if total + len(line) > max_chars:
                    break
                lines.append(line.rstrip())
                total += len(line)
    return "
".join(lines)


logs = load_logs("./logs/2026-q1")

prompt = f"""
These are six weeks of application server logs.

Find all timeout patterns: requests that timed out, retries triggered by
timeouts, downstream service calls that exceeded their deadline. For each
pattern, identify:
- Which endpoint or service was involved
- The frequency and time distribution (did it cluster around specific hours?)
- Whether the timeout appears to have cascaded to other services

Logs:
{logs}
"""

response = model.generate_content(
    prompt,
    generation_config=genai.GenerationConfig(max_output_tokens=4096),
)
print(response.text)

The response identified a pattern I had not seen: database connection timeouts on the orders service were clustering between 02:00 and 03:30 UTC every night, and each one triggered a retry storm on the notification service that was extending the actual user-visible timeout from 30 seconds to nearly 4 minutes. No grep query I would have written would have connected those two services across a six-week window.

Use case 3: entire PDF as a search target

The standard approach for large PDF Q&A is a RAG pipeline: chunk the PDF, embed the chunks, store in a vector database, embed the query, retrieve the top-k chunks, send those to the model. It works, but it has a real failure mode: questions that require synthesizing information from two parts of the document that chunk boundaries separated will get wrong answers without any visible error signal.

For a 300-page technical specification I was working with, I wanted to ask things like "find all places where the spec mentions a timeout value and list them with context." That kind of question needs the full document in view because the answer is spread across it.

python

import google.generativeai as genai
import os

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

model = genai.GenerativeModel("gemini-3.5-flash")

# Upload the PDF using the File API — Gemini handles extraction
pdf_file = genai.upload_file(
    path="./specs/payments-api-spec-v3.pdf",
    mime_type="application/pdf",
    display_name="payments-api-spec-v3",
)

questions = [
    "List every timeout value mentioned in the spec, with the section and context.",
    "What are all the error codes the API can return, and what triggers each one?",
    "Find any place where the spec says behavior is 'implementation-defined' or 'undefined'.",
]

for question in questions:
    print(f"
--- {question} ---
")
    response = model.generate_content([pdf_file, question])
    print(response.text)

The File API handles PDF extraction on Google's side, so you're not paying the token cost of base64-encoding the binary yourself. For a 300-page document, the extraction lands at roughly 180K-250K tokens depending on how text-dense the pages are — well within Gemini's limit, and no embedding pipeline needed.

When to use Gemini over Claude for this

This is not a claim that Gemini is a better model in general. For most tasks I still reach for Claude. But context size is a genuine technical constraint, and when your input exceeds 200K tokens, you don't have a choice about which trade-off to make with Claude — chunking is the only option.

The cases where Gemini's 1M context window changes what's possible:

Cross-file bug hunting — when the bug spans multiple files and you don't know which ones
Log analysis over long time windows — when the pattern you're looking for might only be visible across weeks of data
Full-document Q&A — when questions require synthesizing information from non-adjacent sections
Schema + migration history analysis — when you need to understand the full evolution of a database schema across many migrations

The cost and latency tradeoffs are real. Gemini's 1M context requests are slower than Claude at 50K tokens. For interactive use, that latency is noticeable. For batch analysis tasks where you're running a question against a large corpus, the latency is irrelevant and the context window advantage is decisive.

The mental shift that helped me: chunking is a workaround for a context limit, not an analysis strategy. If your question needs the whole document, send the whole document.