Staff Prep 15: The Python GIL Explained — What It Is, When It Hurts, What Changed
ArchitectureStaff

Staff Prep 15: The Python GIL Explained — What It Is, When It Hurts, What Changed

April 4, 20269 min readPART 13 / 18

Back to Part 14: Auth & Authorization. The GIL (Global Interpreter Lock) is one of the most misunderstood parts of Python. Developers blame it for all threading problems. In practice, it only matters for a specific class of workload. Most Python backend servers are I/O-bound and the GIL barely affects them. But when it does matter, you need to know exactly what to do.

What the GIL actually is

The GIL is a mutex (mutual exclusion lock) inside the CPython interpreter. It ensures that only one Python thread executes Python bytecode at any given time, even on multi-core machines. It exists because CPython's memory management (reference counting) is not thread-safe without it.

The GIL is not permanent. Extensions that release the GIL can run truly parallel. NumPy, many C extensions, and the standard library's I/O operations all release the GIL during their work.

python
import threading
import time

def count_to_million():
    count = 0
    while count < 1_000_000:
        count += 1

# CPU-bound threads: GIL causes them to time-share, not truly parallel
# On an 8-core machine, 2 CPU-bound threads run at the speed of 1
start = time.perf_counter()
t1 = threading.Thread(target=count_to_million)
t2 = threading.Thread(target=count_to_million)
t1.start(); t2.start()
t1.join(); t2.join()
elapsed = time.perf_counter() - start
# Elapsed: ~2x a single thread — GIL prevents real parallelism

# I/O-bound threads: GIL released during I/O waits
import urllib.request

def fetch_url(url):
    urllib.request.urlopen(url).read()  # GIL released during network I/O

start = time.perf_counter()
threads = [threading.Thread(target=fetch_url, args=("http://example.com",)) for _ in range(10)]
for t in threads: t.start()
for t in threads: t.join()
elapsed = time.perf_counter() - start
# Elapsed: ~1 second — all 10 threads truly ran concurrently during I/O

When the GIL matters (and when it does not)

GIL does NOT matter for:

  • Asyncio-based web servers (single thread, no contention)
  • I/O-bound threading (network calls, database queries, file I/O)
  • NumPy, pandas, and C extensions (release the GIL during computation)
  • Most FastAPI backends — your bottleneck is I/O, not CPU

GIL DOES matter for:

  • Pure Python CPU computation in threads (data processing loops, parsers)
  • Image processing in pure Python
  • Machine learning inference in pure Python (use PyTorch/TF — they release the GIL)

The correct tools for CPU parallelism

python
from multiprocessing import Pool
from concurrent.futures import ProcessPoolExecutor
import asyncio

# Option 1: multiprocessing.Pool — separate processes, no GIL
def cpu_intensive_task(data: list) -> int:
    return sum(x * x for x in data)

with Pool(processes=4) as pool:  # 4 processes = 4 real CPU cores
    results = pool.map(cpu_intensive_task, [range(10**6), range(10**6), range(10**6)])

# Option 2: ProcessPoolExecutor with asyncio (for FastAPI integration)
executor = ProcessPoolExecutor(max_workers=4)

async def run_cpu_task(data: list) -> int:
    loop = asyncio.get_event_loop()
    result = await loop.run_in_executor(executor, cpu_intensive_task, list(data))
    return result

# Option 3: Celery worker process (separate process per task)
@celery.task
def process_image(image_path: str):
    # Runs in a separate Celery worker process — no GIL contention
    from PIL import Image
    img = Image.open(image_path)
    # ... heavy processing
    return result

Python 3.13: free-threaded mode (no-gil build)

Python 3.13 introduced an experimental "free-threaded" build (PEP 703) that compiles CPython without the GIL. This is the most significant Python concurrency change in 30 years.

bash
# Check if running a free-threaded build
python3.13 -c "import sys; print(sys._is_gil_enabled())"
# True: standard GIL build
# False: free-threaded build (--disable-gil compile flag)

# Install free-threaded Python 3.13 (macOS via Homebrew)
brew install python@3.13 --with-freethreading
python
import threading
import sys

print(f"GIL enabled: {sys._is_gil_enabled()}")

def count(n):
    count = 0
    while count < n:
        count += 1
    return count

# With free-threaded Python 3.13:
# Two CPU-bound threads truly run in parallel on 2 CPU cores
# Expected speedup: ~2x for 2 threads (vs ~1x with GIL)

# Caveats of free-threaded Python:
# - Many C extensions are not thread-safe without the GIL
# - You need to add your own locks for shared mutable state
# - Performance is ~5-10% slower even in single-threaded code (thread safety overhead)
# - Still experimental as of 3.13 — not production-ready for most cases

Practical GIL workarounds for web backends

python
from fastapi import FastAPI
from concurrent.futures import ProcessPoolExecutor
import asyncio

app = FastAPI()
cpu_executor = ProcessPoolExecutor(max_workers=4)

# Pattern: offload CPU work to process pool, keep I/O in async
@app.post("/process-report")
async def process_report(report_id: int, db=Depends(get_db)):
    # 1. Fetch data from DB (async I/O — stays in event loop)
    raw_data = await db.fetch_report_data(report_id)

    # 2. Heavy computation — offload to process pool
    loop = asyncio.get_event_loop()
    processed = await loop.run_in_executor(
        cpu_executor,
        compute_report_stats,  # pure function, no I/O
        raw_data
    )

    # 3. Write result to DB (async I/O — back in event loop)
    await db.save_report(report_id, processed)
    return {"status": "done"}

# Note: data passed to run_in_executor must be picklable
# (primitives, dicts, lists — not SQLAlchemy model instances)

Gunicorn workers vs threads vs processes

bash
# Gunicorn with Uvicorn workers: BEST for async FastAPI
# Multiple processes, no GIL contention between them
gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker

# Gunicorn sync workers + threads: OK for sync Flask/Django
# Threads share process, GIL limits CPU parallelism
gunicorn app:app -w 4 --threads 4 -k sync

# Rule of thumb for worker count:
# Async FastAPI: workers = CPU cores (each runs its own event loop)
# Sync Django: workers = CPU cores * 2 + 1 (waiting on I/O = GIL released often)

Quiz: test your understanding

Before moving on, answer these in your head (or out loud):

  1. You have 4 Python threads running database queries concurrently. Does the GIL slow them down? Why or why not?
  2. You have 4 Python threads each running a CPU-intensive loop to 10 million. You are on an 8-core machine. How long does it take compared to 1 thread? Why?
  3. What is the correct tool for true CPU parallelism in Python? When would you choose ProcessPoolExecutor vs a Celery worker?
  4. Python 3.13 free-threaded mode removes the GIL. What new problems does this introduce that developers must handle themselves?
  5. Your FastAPI app needs to generate PDF reports (CPU-bound, ~2 seconds each). Walk through the exact architecture you would use to handle 50 concurrent report requests without blocking the event loop.

Next up — Part 16: asyncio Deep Dive. Event loop internals, coroutines vs tasks, gather vs wait, and common deadlock patterns.

← PREV
Staff Prep 14: Auth & Authorization — JWT, OAuth2, RBAC vs ABAC
← All Architecture Posts