ArchitectureStaff

Staff Prep 15: The Python GIL Explained — What It Is, When It Hurts, What Changed

April 4, 20269 min readPART 13 / 18

Back to Part 14: Auth & Authorization. The GIL is the single most misunderstood thing about Python. Developers blame it for every threading problem they've ever had, most of which weren't actually about the GIL. In practice it only matters for one specific class of workload. Most Python backends are I/O-bound and barely notice it's there. But when it does matter, you need to know exactly what to do.

What the GIL actually is

The GIL is a mutex inside the CPython interpreter. It ensures that only one Python thread executes Python bytecode at any given instant, even on a multi-core machine. It exists because CPython's reference-counted memory management isn't thread-safe without it.

The important part people miss: the GIL is not always held. Extensions that release it can run truly in parallel. NumPy, most mature C extensions, and the standard library's I/O operations all release the GIL during their actual work, which is why your multi-threaded web scraper actually gets faster with more threads.

python

import threading
import time

def count_to_million():
    count = 0
    while count < 1_000_000:
        count += 1

# CPU-bound threads: GIL causes them to time-share, not truly parallel
# On an 8-core machine, 2 CPU-bound threads run at the speed of 1
start = time.perf_counter()
t1 = threading.Thread(target=count_to_million)
t2 = threading.Thread(target=count_to_million)
t1.start(); t2.start()
t1.join(); t2.join()
elapsed = time.perf_counter() - start
# Elapsed: ~2x a single thread — GIL prevents real parallelism

# I/O-bound threads: GIL released during I/O waits
import urllib.request

def fetch_url(url):
    urllib.request.urlopen(url).read()  # GIL released during network I/O

start = time.perf_counter()
threads = [threading.Thread(target=fetch_url, args=("http://example.com",)) for _ in range(10)]
for t in threads: t.start()
for t in threads: t.join()
elapsed = time.perf_counter() - start
# Elapsed: ~1 second — all 10 threads truly ran concurrently during I/O

When the GIL matters (and when it does not)

The GIL doesn't matter for asyncio-based web servers (single thread, so nothing to contend), I/O-bound threading like network calls and DB queries, or anything backed by NumPy, pandas, or a serious C extension. Most FastAPI backends fall squarely in this bucket, because the bottleneck is I/O, not CPU.

It does matter for pure-Python CPU loops, pure-Python image processing, and pure-Python ML inference (which is why nobody does ML inference in pure Python). If that's what you're doing, you reach for processes, not threads.

The correct tools for CPU parallelism

python

from multiprocessing import Pool
from concurrent.futures import ProcessPoolExecutor
import asyncio

# Option 1: multiprocessing.Pool — separate processes, no GIL
def cpu_intensive_task(data: list) -> int:
    return sum(x * x for x in data)

with Pool(processes=4) as pool:  # 4 processes = 4 real CPU cores
    results = pool.map(cpu_intensive_task, [range(10**6), range(10**6), range(10**6)])

# Option 2: ProcessPoolExecutor with asyncio (for FastAPI integration)
executor = ProcessPoolExecutor(max_workers=4)

async def run_cpu_task(data: list) -> int:
    loop = asyncio.get_event_loop()
    result = await loop.run_in_executor(executor, cpu_intensive_task, list(data))
    return result

# Option 3: Celery worker process (separate process per task)
@celery.task
def process_image(image_path: str):
    # Runs in a separate Celery worker process — no GIL contention
    from PIL import Image
    img = Image.open(image_path)
    # ... heavy processing
    return result

Python 3.13: free-threaded mode (no-gil build)

Python 3.13 introduced an experimental free-threaded build (PEP 703) that compiles CPython without the GIL. I'd call it the biggest Python concurrency change in 30 years, and I'd also warn you that it's not something I'd put in production today. The ecosystem isn't ready, and a lot of C extensions quietly assume the GIL exists.

bash

# Check if running a free-threaded build
python3.13 -c "import sys; print(sys._is_gil_enabled())"
# True: standard GIL build
# False: free-threaded build (--disable-gil compile flag)

# Install free-threaded Python 3.13 (macOS via Homebrew)
brew install python@3.13 --with-freethreading

python

import threading
import sys

print(f"GIL enabled: {sys._is_gil_enabled()}")

def count(n):
    count = 0
    while count < n:
        count += 1
    return count

# With free-threaded Python 3.13:
# Two CPU-bound threads truly run in parallel on 2 CPU cores
# Expected speedup: ~2x for 2 threads (vs ~1x with GIL)

# Caveats of free-threaded Python:
# - Many C extensions are not thread-safe without the GIL
# - You need to add your own locks for shared mutable state
# - Performance is ~5-10% slower even in single-threaded code (thread safety overhead)
# - Still experimental as of 3.13 — not production-ready for most cases

Practical GIL workarounds for web backends

python

from fastapi import FastAPI
from concurrent.futures import ProcessPoolExecutor
import asyncio

app = FastAPI()
cpu_executor = ProcessPoolExecutor(max_workers=4)

# Pattern: offload CPU work to process pool, keep I/O in async
@app.post("/process-report")
async def process_report(report_id: int, db=Depends(get_db)):
    # 1. Fetch data from DB (async I/O — stays in event loop)
    raw_data = await db.fetch_report_data(report_id)

    # 2. Heavy computation — offload to process pool
    loop = asyncio.get_event_loop()
    processed = await loop.run_in_executor(
        cpu_executor,
        compute_report_stats,  # pure function, no I/O
        raw_data
    )

    # 3. Write result to DB (async I/O — back in event loop)
    await db.save_report(report_id, processed)
    return {"status": "done"}

# Note: data passed to run_in_executor must be picklable
# (primitives, dicts, lists — not SQLAlchemy model instances)

Gunicorn workers vs threads vs processes

bash

# Gunicorn with Uvicorn workers: BEST for async FastAPI
# Multiple processes, no GIL contention between them
gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker

# Gunicorn sync workers + threads: OK for sync Flask/Django
# Threads share process, GIL limits CPU parallelism
gunicorn app:app -w 4 --threads 4 -k sync

# Rule of thumb for worker count:
# Async FastAPI: workers = CPU cores (each runs its own event loop)
# Sync Django: workers = CPU cores * 2 + 1 (waiting on I/O = GIL released often)

Quiz: test your understanding

Before moving on, answer these in your head (or out loud):

You have 4 Python threads running database queries concurrently. Does the GIL slow them down? Why or why not?
You have 4 Python threads each running a CPU-intensive loop to 10 million. You are on an 8-core machine. How long does it take compared to 1 thread? Why?
What is the correct tool for true CPU parallelism in Python? When would you choose ProcessPoolExecutor vs a Celery worker?
Python 3.13 free-threaded mode removes the GIL. What new problems does this introduce that developers must handle themselves?
Your FastAPI app needs to generate PDF reports (CPU-bound, ~2 seconds each). Walk through the exact architecture you would use to handle 50 concurrent report requests without blocking the event loop.

Next up: Part 16: asyncio Deep Dive. Event loop internals, coroutines vs tasks, gather vs wait, and common deadlock patterns.

← PREV

Staff Prep 14: Auth & Authorization — JWT, OAuth2, RBAC vs ABAC