Circuit Breakers: The Safety Valve That Keeps a Trading Bot From Setting Itself on Fire

When the Bot Is Faster Than I Am

There's a specific kind of dread that hits when you watch a bot you wrote do something stupid in a loop. Each iteration takes a few hundred milliseconds. By the time you've registered the pattern, alt-tabbed to the terminal, and typed the kill command, it has already done the stupid thing a few dozen more times. If each iteration costs even pennies, you're suddenly out a real amount of money — not because the strategy was bad, but because nobody told the bot to stop.

This is the problem a circuit breaker solves. The name is borrowed straight out of the breaker panel in your basement: a switch that flips itself when too much current flows, before the wiring catches fire. In software, it's the same idea — a wrapper around a risky operation that stops calling that operation once failures pile up faster than they should. I'm wiring one into my MEV bot now, and the deeper I go, the more I realize the easy part is writing the state machine. The hard part is deciding which failures should trip it.

The Pattern Has a Lineage, and It's Older Than Crypto

Before I wrote a single line of breaker code, I went looking for the canonical reference. The pattern was systematized by Michael T. Nygard in his 2007 book Release It!, and it was popularized on the wider web by Martin Fowler, whose bliki entry is still the cleanest one-page explanation I've found.

Fowler describes it like this: "You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all." That sentence is doing a lot of work. The protected call doesn't run. The caller gets an error fast. The downstream service — whatever was failing — gets a break.

That last part is the non-obvious win. A circuit breaker isn't only protecting me. It's also protecting the thing I'm hammering. If an RPC endpoint is staggering under load, the worst thing my bot can do is keep firing requests at it like a Black Friday shopper at a Walmart entrance. Backing off lets the service recover. Recovery is faster when you stop pushing.

Three States, Borrowed Straight From the Breaker Panel

Every serious implementation of this pattern uses the same three-state machine. The state vocabulary below is the canonical one from a well-documented architecture guide on the pattern.

CLOSED — current flowing, requests passing

In the CLOSED state, requests go through to the protected operation as normal. A failure counter ticks up each time something goes wrong, and that counter typically resets on a periodic time interval so intermittent blips don't slowly accumulate toward the trip threshold. This is the default, the boring state, the one where everything is fine.

The analogy back to the breaker panel works perfectly: the switch is closed, current flows. Nobody thinks about the breaker when nothing's wrong. That's the point.

OPEN — switch tripped, requests fail fast

When the failure counter exceeds a threshold within the time window, the breaker flips to OPEN. In this state the request from the application fails immediately and an error is returned without the protected call being made. A timer starts running.

The key word is immediately. The whole point of OPEN state is that you stop spending time and resources on a thing that's not working. If your RPC provider is timing out, you don't want to wait the full timeout on every call — you want to know within microseconds that you can't proceed, so you can decide what to do next. Fail fast is a feature, not a bug.

HALF-OPEN — testing the waters

When the OPEN timer expires, the breaker transitions to HALF-OPEN. A limited number of test requests are allowed through. If they succeed, the breaker assumes the underlying problem is fixed and returns to CLOSED. If even one fails, it slams back to OPEN and restarts the timer.

This intermediate state exists for a specific reason: a recovering service can usually handle a trickle of requests well before it can handle its full normal load. Slam it with the full firehose the moment its OPEN timer expires and you knock it right back over.

Think of HALF-OPEN as the moment a freeway reopens after an accident: the highway patrol lets a trickle of cars through first to see if traffic flows, before opening all lanes. Skipping HALF-OPEN and jumping straight from OPEN to CLOSED is how you cause a second pileup right after the first one clears.

The Subtle Trap: Not Every Failure Is the Same Kind of Failure

This is where I lost a couple of days, and it's the lesson I most want to write down for myself.

Fowler buries the most important sentence in the whole pattern as an aside: "not all errors should trip the circuit, some should reflect normal failures and be dealt with as part of regular logic." The same principle shows up in every architecture guide on the pattern: a request might fail because a remote service crashed and needs minutes to recover, or because an overloaded service timed out, or because the input was wrong. These are categorically different failures. Treating them the same is how you build a breaker that flips constantly for the wrong reasons.

In my case, the lightbulb moment came from a class of error that kept blowing the breaker even though nothing was actually broken: transaction build failures. A TX build can fail for completely local reasons — the parameters I'm passing are wrong, the instruction is malformed, an account I expected to be initialized isn't, or my own logic produced a swap path that doesn't exist. None of these are infrastructure failures. The RPC is up. The network is fine. The exchange-equivalent on chain — the DEX program — is happily serving everyone else. The failure is me, and the right fix is to debug my code.

If I count those failures toward the breaker, here's what happens: I push a buggy build, the breaker counts a string of TX build failures as evidence that "the system" is broken, trips OPEN, and now the bot is sitting in time-out for a problem that has no recovery path because the problem is in my code. The breaker is supposed to give the downstream service a break to recover. There's no downstream service to recover from a logic bug in my function.

So I excluded TX build failures from the breaker entirely. They get logged loudly, they alert me, but they don't move the failure counter. The breaker counts only failures that look like "the world outside my process is misbehaving" — RPC timeouts, connection resets, 503s from external services, things that benefit from backing off.

The heuristic I ended up with: if the right response is "wait and retry," count it. If the right response is "fix my code," don't.

Concrete Numbers, And Why They Vary So Much

When I was sketching out parameters, I went looking for what real production libraries default to. The spread is wider than I expected.

A popular Python library ships with a failure_threshold of 5 consecutive failures and a recovery_timeout of 30 seconds. That's the kind of default a hobby script wants — small enough to react quickly, generous enough not to flap.

A widely deployed Java circuit breaker library, by contrast, ships much more conservative defaults: a 50% failure rate threshold, a 60-second wait in OPEN, 10 permitted calls in HALF-OPEN, a sliding window of 100 calls, and a minimum of 100 calls before the rate is even computed. Per its official documentation, those numbers are tuned for high-volume backend services where a 50% failure rate is a meaningful signal and a single 503 on a quiet morning isn't.

Fowler's own pedagogical example in Ruby uses an invocation_timeout of 0.01 seconds, a failure_threshold of 5, and a reset_timeout of 0.1 seconds — numbers chosen to make a unit test run quickly, not for production.

The pattern is the same across every implementation, but the dials are completely different depending on traffic shape. A bot making a few requests per second wants tight, sensitive thresholds. A service handling thousands of requests per second wants statistical thresholds ("50% over the last 100 calls") because individual failures are noise. Picking the right defaults requires knowing what "normal" looks like for your specific traffic, which means you can't just copy a number off a blog post — you have to measure first.

What Trading Bots Add On Top

The classic circuit breaker pattern was designed for microservices: protect availability. Trading bots want the same pattern but for a different goal: protect capital. So the trip conditions look different.

A trading-bot-style breaker watches things the original pattern never considered:

Drawdown. A common 3-stage pattern, described by a well-known trading risk management blog, reduces position size by 50% at 10% drawdown, halts new positions at 3% daily loss or 5 consecutive losses, and force-liquidates everything at 20% drawdown. These numbers are illustrative — every trader picks their own — but the structure (graduated response: reduce, then pause, then halt) is the right shape.
Consecutive losses. Five in a row is a common trigger, partly because it's a clean number and partly because the probability of five independent losses on a strategy with positive expected value is low enough to be a real signal.
Daily loss caps. Capping the day at, say, 3% of capital prevents a single bad day from compounding into a catastrophic week.
Position size limits. Capping any one position at 1% of total capital is a standard recommendation for the same reason a casino caps table stakes — concentration kills.

The math here matters more than people give it credit for. A 50% drawdown requires a 100% gain to recover. A 75% drawdown requires a 300% gain. The asymmetry is brutal: every percentage point you lose costs you more than a percentage point to win back, and the curve gets steeper the deeper you go. That asymmetry is the entire reason capital-protecting breakers exist. The cost of a false trip (you sit out a few good trades) is linear. The cost of not tripping when you should is exponential.

Manual Reset vs. Automatic Recovery

The original microservices pattern auto-recovers: HALF-OPEN tests the waters, success transitions to CLOSED, and the bot keeps running with no human in the loop. That's the right behavior for a service that's just trying to stay available.

For trading bots, a number of practitioners prefer manual reset after a capital-related trip. The argument from one risk-management writeup puts it bluntly: "When a circuit breaker triggers, the bot pauses and requires manual reset, ensuring the trader reviews what went wrong before resuming." The reasoning is that if your bot just lost 3% of your capital in a day, the right next step is for a human to look at why before flipping the switch back on. Auto-resuming a strategy that just bled out is how you turn a 3% day into a 6% day.

My take, after sitting with this for a while: split the breakers by category. Infrastructure-style breakers (RPC down, network errors) auto-recover via HALF-OPEN, because the failure mode there is transient and a human review adds nothing. Capital-style breakers (drawdown, consecutive losses) require a manual reset, because the failure mode there demands a human asking "is the strategy still valid?" before resuming. One bot, two breaker philosophies, depending on what they're protecting.

Kill Switch vs. Circuit Breaker — They're Not the Same Thing

A related concept that gets confused with circuit breakers is the kill switch. They're both stop-the-bot mechanisms, but they're different tools.

A circuit breaker is a graduated, automatic response to measured problems: failure rate is too high, drawdown crossed a threshold, consecutive losses piled up. It usually has a recovery path. It runs continuously in the background.

A kill switch is an emergency stop. It's manual, it's all-or-nothing, and the recovery path is "a human starts the bot back up after looking at what happened." You hit the kill switch when you see something the breaker can't see — a security concern, a market event with no historical analog, a sudden suspicion that your code has a bug.

A serious bot has both. The circuit breaker is the seat belt that engages on its own in a crash. The kill switch is the brake pedal you stomp when you see the deer in the road. Different problems, different tools.

The Macro Analogy: NYSE Has a Circuit Breaker Too

One thing I find clarifying: this pattern isn't unique to software at all. The New York Stock Exchange runs a circuit breaker on the entire market. The thresholds are public: a 7% drop in the S&P 500 triggers a 15-minute halt (Level 1), a 13% drop triggers another 15-minute halt (Level 2), and a 20% drop halts trading for the rest of the day (Level 3).

The motivation is the same as the software version: when a system is in distress, hammering it harder makes things worse. Algorithmic feedback loops were a major contributor to the 2010 Flash Crash. Without breakers, automated participants amplify each other's errors faster than humans can react. The fix the market reached for is exactly the fix Nygard described in 2007: stop everything, give the system a moment, then let it resume gradually.

When I think about my own bot's breaker, I find it useful to remember it's the same pattern operating at a smaller scale. The S&P 500 has a 20% halt threshold. My bot has a daily loss cap. Same shape, six orders of magnitude apart in dollar terms.

Adaptive Breakers — The Next Wave

For completeness, there's an emerging variant worth mentioning. Hardcoded thresholds — failure count, time-out duration — are deterministic but often suboptimal: too tight (false trips during normal volatility) or too loose (real problems get through). The newer direction in the discipline is adaptive breakers that adjust thresholds based on real-time traffic patterns, anomalies, and historical failure rates.

The idea is that an adaptive breaker tunes itself based on what it has been seeing — a 50% failure rate is normal at 3am on a holiday, but a real problem at 2pm on a Tuesday. I'm not building one of these. For a hobby-scale bot, hardcoded thresholds are fine, and the complexity of an adaptive layer would just give me more places to have bugs. But it's good to know the direction the discipline is moving.

What I'm Wiring Into the Bot

Where I've landed for now, in plain English (no specific values, because every bot's right values are different):

Infrastructure breaker, auto-recover. Counts only failures that look like the outside world is misbehaving — connection errors, RPC timeouts, gateway errors. Excludes TX build failures, swap simulation rejections, and other locally-caused errors. Trips on a sliding window of consecutive failures, sleeps for a tunable cooldown, then HALF-OPEN tests with a small number of probe requests before reopening.
Capital breaker, manual reset. Watches drawdown and consecutive trade losses. Doesn't count fee-only losses or expected losses below a noise floor. When it trips, the bot stops opening new positions and pages me. I look at the logs and decide whether to reset.
Kill switch, always available. A single command flips a flag the main loop checks every iteration. No state machine, no recovery, just stop. For when I see something the breakers don't.

The boring revelation is that the wiring is the easy 20% and the classification is the hard 80%. "What counts as a failure for this breaker?" is a question I keep finding new edge cases for. A simulation that returns zero profit isn't a failure. A simulation that errors with InsufficientFunds is a failure of my path-finding logic, not infrastructure. A simulation that errors with ConnectionRefused is exactly the kind of thing the breaker is supposed to catch. Sorting these correctly takes way more thought than picking a threshold number.

What This Looks Like From the Inside

I keep coming back to the freeway analogy. A circuit breaker isn't a wall — it's a flagger. When the road ahead is on fire, the flagger holds you in place. When the fire crew clears it, the flagger lets a few cars through to make sure the surface is intact. When those cars make it through, the flagger waves everyone forward.

The pattern works because it makes the act of stopping a normal, automatic thing instead of a heroic intervention. By the time I'm yelling at my screen, I've already lost. The whole point of the breaker is to handle the boring, predictable failure modes — RPC blips, exchange downtime, my own bad streaks — without me being awake. What's left for me is the genuinely unusual stuff that no automated rule can catch.

That's the bargain: write the breaker correctly, and you buy yourself the right to sleep.

Key Takeaways

The circuit breaker pattern was systematized by Nygard (2007) and popularized by Fowler. It's a three-state machine: CLOSED (requests pass), OPEN (requests fail fast), HALF-OPEN (limited probe requests test recovery).
Not every failure should trip the breaker. Infrastructure failures (timeouts, connection errors) should count. Logical failures (bad input, malformed transactions) should not — they need code fixes, not cooldowns.
Defaults vary wildly by traffic shape. Hobby-scale libraries default to a handful of consecutive failures and a short cooldown; high-volume production libraries default to statistical thresholds over hundreds of calls. Both are right for their use case.
Trading bots extend the pattern to protect capital, not availability. Drawdown thresholds, daily loss caps, and consecutive-loss counts are all just specialized failure counters wired to the same state machine.
Manual reset for capital-related trips, automatic recovery for infrastructure trips. Different failure modes warrant different recovery philosophies. One bot, two breaker behaviors.

Disclaimer

This article is for informational and educational purposes only and does not constitute financial, investment, legal, or professional advice. Content is produced independently and supported by advertising revenue. While we strive for accuracy, this article may contain unintentional errors or outdated information. Readers should independently verify all facts and data before making decisions. Company names and trademarks are referenced for analysis purposes under fair use principles. Always consult qualified professionals before making financial or legal decisions.