The Curse of the GIL: Why Python Is Actually Slow

A Single-Lane Bridge in a Multi-Core World

I keep running into the same wall. The bot is humming, the strategy is correct, the math checks out — and yet, when I add a second worker thread to chase a hot opportunity, things actually get worse. Not faster. Not the same. Worse.

For a while I blame my code. Then I blame the network. Then I blame myself for picking Python in the first place. The truth, it turns out, is older than I am as a developer: a small piece of plumbing called the Global Interpreter Lock, or GIL, sits in the middle of every CPython program ever written, deciding which of my threads gets to breathe at any given microsecond.

On a modern machine with eight or sixteen cores sitting idle, that single lock feels like discovering a brand-new sixteen-lane interstate has been narrowed down to one lane right where I need to merge. Everyone takes turns. Nobody really speeds up.

In an arbitrage context — where the gap between spotting an opportunity and acting on it is measured in milliseconds — that single-lane bridge is not a quirk. It is the core economic problem.

What the GIL Actually Is

Real Python defines it cleanly: "Python's Global Interpreter Lock, or GIL, is a mutex (or a lock) that allows only one thread to hold the control of the Python interpreter," according to Real Python's GIL guide. Strip away the jargon and it means this: no matter how many threads I create, only one of them is actually executing Python bytecode at any given moment. The others sit in the dugout waiting for their at-bat.

The analogy I keep coming back to is a food truck with one window. You can hire ten cooks, but if there is exactly one window where customers can pay and pick up their order, you do not get ten times the throughput. You get one window's worth of throughput, plus the overhead of ten cooks bumping into each other behind the counter.

The GIL exists because CPython manages memory through reference counting. Every Python object carries a little counter that tracks how many variables point to it. When the count drops to zero, the object is freed. If two threads tried to bump that counter at the same moment, the count could end up wrong — and "wrong" here means the interpreter either leaks memory or frees something that is still in use, which in C-land typically translates to a crash. The GIL prevents that race by forbidding the race in the first place. One thread, one counter update, one safe interpreter.

It is a design that makes single-threaded Python fast and simple. It is also a design that, in 2026, is the single most expensive line item in my latency budget.

How a 1992 Decision Outlived Its Era

Guido van Rossum added the GIL in 1992, just one year after the language's first release. In a quote preserved by The Server Side, he describes the original intent plainly: "We'll provide something that looks like threads, and as long as you only have a single CPU on your computer — which most computers at the time did — it feels just like threads."

In 1992, that is an entirely reasonable bet. The Pentium has not even shipped. Multi-core consumer chips are a decade away. Threading is mostly about being polite to the operating system while waiting on a disk read, not about parallel computation. The GIL is not a compromise — it is the obvious answer.

By the mid-2000s the world looks nothing like 1992. Dual-core laptops at every Best Buy. Servers shipping with dozens of cores. Languages without a global lock — Java, C#, Go — quietly eat Python's lunch in any workload that benefits from real parallelism. But by then Python's ecosystem is enormous. NumPy, SciPy, pandas, every C extension ever written all assume the GIL exists and use it as an implicit safety net. Removing it would break the world.

So the temporary 1992 lock becomes permanent. Not because anyone is happy with it, but because the cost of removing it is higher than the cost of living with it.

In 2007, Guido formalizes the conditions under which he will accept a GIL-removal patch. According to The Server Side, removal will be welcomed "only if the performance for a single-threaded program (and for a multi-threaded but I/O-bound program) does not decrease." That sentence kills every patch attempt for the next sixteen years. Every removal attempt slows something else down. Every attempt gets rejected.

The Benchmark That Made Me Stop Trusting Threads

Real Python publishes a benchmark in their GIL guide that cured me of any romantic ideas about Python multithreading. The setup is intentionally simple: a CPU-bound countdown loop with fifty million iterations.

Single thread: 6.20 seconds
Two threads: 6.92 seconds
Two processes (multiprocessing): 4.06 seconds

Look at that middle line. Adding a second thread does not just fail to help. It actively makes the program slower than running it on one thread. The cost of the threads bumping into each other on the GIL exceeds any work they can do in parallel — because they cannot actually do any work in parallel.

Multiprocessing, which spawns separate OS processes (each with its own GIL, its own memory, its own everything), is the only configuration that delivers a real speedup. And multiprocessing has its own taxes: process startup time, inter-process communication overhead, memory duplication, serialization costs every time data crosses a process boundary.

For a long-running batch job that is fine. For a hot-loop arbitrage bot that needs to react inside a single block window, paying multi-millisecond serialization costs to talk between processes is not a serious option.

Where the GIL Hides — and Where It Doesn't

There is a wrinkle that confuses every newcomer, including me a few months ago. Not all Python work suffers equally. The GIL is released cooperatively when a thread is waiting on something outside Python — a network response, a disk read, an OS call. That is why "Python is slow" is a half-truth.

If my workload is I/O-bound — say, fanning out a hundred concurrent web requests to scrape pool prices — Python threads work pretty well. Each thread spends almost all its time waiting for bytes to arrive over the wire. While it waits, it releases the GIL, and another thread gets to run. The bottleneck is the network, not the interpreter.

If my workload is CPU-bound — calculating the optimal trade size across a graph of liquidity pools, simulating an AMM swap, hashing, decoding — every thread wants the interpreter at the same time, and the GIL becomes the choke point.

Most real bots are a messy mix of both. The simulator that prices a candidate trade is CPU-bound. The fetcher that pulls fresh state from blockchain endpoints is I/O-bound. The scorer that ranks opportunities is CPU-bound. The submitter that fires the transaction is I/O-bound. If I throw all of it into a thread pool, the CPU-bound stages serialize behind the GIL while the I/O-bound stages do their thing — and the overall pipeline runs at the speed of the slowest serialized stage.

This is why a profile of a Python bot under load almost never matches intuition. The slow part is not the part that does the work. It is the part where ten threads are politely standing in line to do the work one at a time.

The Numbers That Matter When Latency Is Money

The MEV literature is unusually honest about language performance, because the answer falls out of the data with very little ambiguity. Solid Quant ran a seven-operation MEV benchmark comparing JavaScript, Python, and Rust on the same workload. A few rows from that table tell the whole story.

For blockchain endpoint construction, Python clocks in at roughly 1,100 microseconds against Rust's 8 microseconds — Python is roughly 137 times slower for that single operation. For a batch multicall request of 3,774 calls, Python runs in about 1,600 milliseconds against Rust's 170 milliseconds, roughly a 9-times gap. Solid Quant's author also notes that Python's results were unstable enough to make benchmarking awkward, which tracks with my own experience: Python latency is not just slow on average, it is spiky.

A different practitioner migrated a high-frequency trading system from Python to Rust and reported tick-to-trade latency dropping from roughly 12 milliseconds in Python (with spikes up to 80 milliseconds) to roughly 40 microseconds in Rust. That is a multi-hundred-times difference, and the author's quote nails the issue: "When my loop takes 5 microseconds, it always takes 5 microseconds." Predictability matters as much as raw speed. Python's garbage collector and GIL between them produce occasional pauses that are perfectly acceptable in a web app but ruinous in a trading loop.

A CoinMonks post on a C++ MEV detection engine frames the threshold sharply: MEV opportunities have to be captured "within 200ms," and "the difference between 15ms and 150ms is the difference between capturing an opportunity and missing it." The author cites the Dwellir blog on MEV arbitrage infrastructure for that threshold. When my Python pipeline occasionally pauses for a stop-the-world garbage collection cycle that takes longer than the entire opportunity window, the math stops being abstract.

None of these numbers are about Python being a bad language. They are about a language whose interpreter was designed to make a single thread fast on a single CPU, run on hardware and workloads it was never built for. The GIL is not a bug. It is an end-of-life mismatch.

The Three Real Reasons the GIL Has Survived

Why hasn't this been fixed already? Wikipedia summarizes the three reasons the GIL exists in the first place: speed for single-threaded programs, easy integration with C libraries that are not thread-safe, and simplicity of implementation.

That second reason is the killer. A huge chunk of Python's value comes from its C extensions — NumPy, SciPy, pandas, scientific stacks, machine learning runtimes, low-level networking libraries. Most of those C extensions were written assuming the GIL was always held when their code was called. Remove the GIL and a lot of them have undefined behavior under concurrent access. They might work. They might silently corrupt data. They might segfault under load.

This is not a theoretical concern. PEP 703, the official proposal for making the GIL optional, cites a DeepMind report quoted in its motivation section that with "even with fewer than 10 threads the GIL becomes the bottleneck," forcing "large parts of our Python codebase into C++" rewrites. When one of the most sophisticated AI labs in the world tells you their workaround is to rewrite Python in C++, that is a strong signal about where the actual ceiling sits.

The third reason — simplicity — is the one that has compounded over thirty years. The CPython interpreter is a complex piece of software. Adding fine-grained locking everywhere the GIL used to protect things is not a single weekend's work. It is years of careful, surgical changes across hundreds of files, plus a rewrite of the memory allocator, plus a rewrite of the reference counting machinery, plus an entire compatibility story for the C extensions that were not written with concurrency in mind.

What's Actually Changing — Slowly

The news, finally, is genuinely good. After decades of failed attempts, PEP 703 was accepted in 2023, and the Python steering council has committed to a multi-phase removal. The JetBrains PyCharm blog timeline maps out the milestones: Sam Gross reignites the discussion in 2021; PEP 703 lands in 2023; Python 3.13 ships in October 2024 with an experimental free-threaded build; PEP 779 promotes free-threading from "experimental" to officially supported in June 2025; Python 3.14 ships in October 2025.

The official PEP 703 roadmap divides removal into three phases. Phase 1, in Python 3.13, makes free-threading available behind a build flag. Phase 2, around Python 3.15 and 3.16 in 2026 to 2027, makes it runtime-controlled, with the GIL still on by default. Phase 3, around Python 3.17 and 3.18 in 2028 to 2030, finally flips the default — GIL off, with an environment variable to turn it back on if your code base needs the old behavior.

For those who can already use it, the speedups are striking. JetBrains' own prime-counting benchmark, with one million primes split across 16 chunks, runs in roughly 1.19 seconds on Python 3.13.5 with the GIL on a single thread, then gets slower — 1.22 seconds — when given four threads. The same benchmark on the free-threaded Python 3.13.5t build drops to 0.47 seconds with four threads, which works out to roughly a 3.4-times speedup.

A Medium write-up of FastAPI under both modes shows a CPU-bound endpoint handling roughly 4 requests per second with the GIL on, against roughly 32 requests per second with the GIL off — about an 8-times jump. A separate EPAM benchmark reports a CPU-bound multithreaded job dropping from 8.66 seconds to 1.39 seconds — roughly a 6-times improvement — between GIL and no-GIL builds of Python 3.13.

Those are not microbenchmarks gone wild. They are exactly what the GIL has been costing all along, finally being reclaimed.

The Catch No One Likes to Talk About

Free-threaded Python is not a free lunch. The same PSF Language Summit recap that declares free-threading officially supported also lays out the costs honestly.

First, single-threaded performance regresses. Python 3.13's free-threaded build was roughly 40 percent slower than the GIL build for single-threaded workloads. Python 3.14 brought that penalty down to under 10 percent on most platforms — a real improvement, but still a tax. The official pyperformance numbers in PEP 703 put the long-term overhead at roughly 5 to 8 percent across single-threaded and multi-threaded scenarios on Intel Skylake and AMD Zen 3.

Second, memory usage rises. The PSF blog cites a roughly 20 percent memory increase from biased reference counting and the new object metadata required to track per-thread ownership.

Third, and most importantly, the ecosystem is not ready. The PSF Language Summit recap reports that of the top 360 PyPI projects with C extension modules, only roughly one in six currently support free-threading. NumPy, pandas, and SciPy are on the list. Plenty of others are not. If my pipeline depends on a library that has not been ported, running Python 3.14t means either accepting unsafe behavior or rewriting around the missing dependency.

Core developer Thomas Wouters offers a measured take in the same recap: "There are cases where you need to think about free-threading, but for the most part it's not that big of a deal." That is encouraging for greenfield projects. It is less encouraging for an MEV bot that already depends on a half-dozen libraries with native code under the hood.

What This Means for the Bot in Front of Me

Figuring out where the GIL hurts me requires a more honest profile of the bot than I had been doing. The model in my head was "Python is slow." The actual picture is more like a relay race where some legs run on a six-lane track and others run on a single-lane sidewalk. The single-lane sidewalk is the part where Python bytecode is doing real CPU work — pricing, ranking, simulating — and that is where the GIL is silently capping me.

The options break down into roughly four buckets, each with its own trade-offs.

Stay on threads where it works. I/O-bound stages — fetching, polling, submitting — already release the GIL during their waits. There is no benefit to migrating those to anything else. The threads I have are doing exactly what threads are supposed to do.

Move CPU-bound stages to multiprocessing. Real parallelism without rewriting in another language. The price is the inter-process communication tax: any data that crosses a process boundary has to be serialized, sent, and deserialized. For pipelines where the work per task is large compared to the data size, this is a clean win. For pipelines where the work is small and the data is huge, it can be a wash or worse.

Push the hot path into compiled code. Cython, Numba, native extensions, or just calling out to NumPy operations that already release the GIL internally. This is how scientific Python has always coped with the GIL — push the inner loop down into C or Fortran where the GIL does not exist, and let the Python layer orchestrate. For MEV math that is genuinely numeric, this often beats both threads and processes.

Accept that some hot paths belong in another language. Not every part of the bot needs to be in Python. The high-throughput, low-latency core — the part where every microsecond matters — can live in Rust or C++ and expose a thin interface that Python drives. This is the same pattern DeepMind described and the same one every serious HFT shop has converged on. The cost is a polyglot codebase. The benefit is leaving the GIL out of the critical path entirely.

None of these are exciting. All of them are real.

Why I Am Not Switching Languages Tomorrow

There is a genuine temptation, especially after staring at a multi-hundred-times latency gap between Python and Rust, to just throw out the Python and start over. I am not doing that, and the reason is honest: the parts of my bot that are slow because of the GIL are not the parts that are blocking me from shipping. The parts that are blocking me from shipping are the parts where I am still figuring out what the bot should do. Those parts benefit enormously from a language with fast iteration loops, rich libraries, and a forgiving interactive environment. That language is Python.

When the bot's behavior is settled and the bottleneck genuinely becomes "Python cannot react fast enough," rewriting the hot path in something compiled is a reasonable engineering call. Doing it before then would be optimizing the wrong stage of the project. "Make it work, make it right, make it fast" is annoying advice precisely because it is correct. The GIL will still be there waiting when I am ready for it.

The one thing I am refusing to do, after spending a week of evenings chasing a phantom slowdown, is pretend the GIL does not exist. It exists. It sits in the middle of every Python thread I spawn. And until I either remove it (Python 3.14t, with caveats) or route around it (multiprocessing, native code, polyglot architecture), it is going to cap how fast my bot can react no matter how clever the algorithm is.

What This Changed for Me

Learning the GIL story changes two specific things about how I think about the bot.

First, I stop reaching for threading as the default answer to "this is too slow." For CPU-bound work, more threads in Python is not just unhelpful, it is often actively counterproductive. Real Python's benchmark — single thread 6.20s, two threads 6.92s — is the kind of result you have to internalize before you stop suggesting it.

Second, I stop treating Python's average latency as the number that matters. Python's worst-case latency is the one that determines whether I capture an opportunity or watch someone else take it. A pipeline that averages 30 milliseconds but spikes to 200 milliseconds three times a minute is functionally a pipeline that misses every opportunity that lands during a spike. The GIL and the garbage collector together make those spikes more frequent and less predictable than I want to admit.

Neither realization makes me want to abandon Python. Both make me much more careful about where in the pipeline I let Python do the heavy lifting and where I push the work somewhere it can run without being serialized behind a thirty-year-old lock.

Key Takeaways

The GIL is a single mutex that lets only one thread execute Python bytecode at a time, originally added in 1992 to make reference counting safe on single-CPU machines.
CPU-bound Python multithreading often runs slower than single-threaded code — Real Python's benchmark shows two threads taking 6.92s versus a single thread at 6.20s for the same work.
For latency-sensitive workloads like MEV, the gap to compiled languages is large and well-documented — practitioners report Python tick-to-trade latencies in the millisecond range against microsecond-range Rust equivalents.
Free-threaded Python is real and shipping, with PEP 703 accepted, Python 3.13t available, and the GIL slated to become opt-in by Phase 3 of the roadmap (Python 3.17 to 3.18, 2028 to 2030).
The ecosystem is the bottleneck for adoption — only roughly one in six top PyPI extension packages currently supports free-threading, per the PSF Language Summit recap, so the practical answer for now is multiprocessing, native code in the hot path, or polyglot architecture.

Disclaimer

This article is for informational and educational purposes only and does not constitute financial, investment, legal, or professional advice. Content is produced independently and supported by advertising revenue. While we strive for accuracy, this article may contain unintentional errors or outdated information. Readers should independently verify all facts and data before making decisions. Company names and trademarks are referenced for analysis purposes under fair use principles. Always consult qualified professionals before making financial or legal decisions.