Rate Limitting Explained: The Bouncer Your API Deserves
Rate limiting controls how many requests a user can make within a given time period. When the limit is exceeded, further requests are rejected until the window resets or capacity is replenished. In this post, we break down the 3 most common algorithms, their trade-offs, and how companies like GitHub, Stripe, and Cloudflare use them in production.
Why Rate Limiting Exists
A single misbehaving client, whether a bad actor, a buggy script, or a bot, can send thousands of requests per second. Without any guardrails, that single client can consume the majority of your server's capacity, degrading or completely blocking service for every other user.
Rate limiting addresses this by enforcing a hard cap on how many requests a client can make within a defined time period. Requests that exceed the cap are rejected with an HTTP 429 Too Many Requests response. The server stays healthy, resources stay shared fairly, and abuse becomes significantly harder.
Beyond availability, rate limiting is also used for security (limiting login attempts to prevent brute-force attacks), cost control (preventing runaway clients from driving up infrastructure bills), fair use enforcement (ensuring no single user monopolizes a shared resource), and API monetization (gating higher request volumes behind paid tiers).
The 3 Core Algorithms
There are three widely-used approaches to rate limiting. Each solves the same fundamental problem but makes different trade-offs around complexity, memory usage, fairness, and burst tolerance.
1. Fixed Window
How it works
A fixed window rate limiter divides time into discrete, non-overlapping intervals, for example one window per hour. Each window has an associated counter. Every incoming request increments that counter. If the counter exceeds the defined limit, the request is rejected. When the window ends, the counter resets to zero and the cycle begins again.
Limit: 5 requests per hour
Window boundaries: every hour on the hour
[12:00 PM - 1:00 PM] counter = 0
Request at 12:01 counter = 1 -> ALLOWED
Request at 12:15 counter = 2 -> ALLOWED
Request at 12:30 counter = 3 -> ALLOWED
Request at 12:45 counter = 4 -> ALLOWED
Request at 12:59 counter = 5 -> ALLOWED
Request at 12:59 counter = 6 -> BLOCKED (limit exceeded)
[1:00 PM - 2:00 PM] counter resets to 0
Request at 1:01 counter = 1 -> ALLOWED
...and so on
The boundary burst problem
The most significant weakness of a fixed window is what happens at the seam between two windows. Because the counter resets hard at the boundary, a client can fire a full burst of requests near the end of one window and immediately fire another full burst at the start of the next, effectively doubling the allowed request rate for a brief period.
Limit: 5 requests per hour
Window A ends at 1:00 PM:
Request at 12:59:00 -> ALLOWED (counter: 1)
Request at 12:59:20 -> ALLOWED (counter: 2)
Request at 12:59:40 -> ALLOWED (counter: 3)
Request at 12:59:55 -> ALLOWED (counter: 4)
Request at 12:59:59 -> ALLOWED (counter: 5)
Window B starts at 1:00 PM:
Request at 1:00:01 -> ALLOWED (counter: 1)
Request at 1:00:10 -> ALLOWED (counter: 2)
Request at 1:00:20 -> ALLOWED (counter: 3)
Request at 1:00:30 -> ALLOWED (counter: 4)
Request at 1:00:40 -> ALLOWED (counter: 5)
Result: 10 requests allowed in under 2 minutes, despite a limit of 5 per hour.
This is a known and accepted trade-off for many use cases, but it is important to understand before choosing this algorithm.
Pros and Cons
- What works well: Fixed windows are simple to implement and reason about. They have a very low memory footprint and work naturally with key-value stores like Redis. Reset times are fixed and predictable, so users always know when their limit will lift.
- What to watch out for: The boundary burst vulnerability means a client can effectively access up to 2x the stated limit across a window boundary. The counter resets all at once rather than gradually, which offers no traffic smoothing. Timezone handling for daily windows is also notoriously error-prone.
Real-world example: GitHub API
GitHub's REST API uses a fixed window rate limiter. Authenticated users receive 5,000 requests per hour, with the window resetting at the top of each clock hour. GitHub surfaces this information in every API response via standard headers:
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4987
X-RateLimit-Reset: 1372700873 (Unix timestamp of next window reset)
2. Sliding Window
How it works
A sliding window rate limiter resolves the boundary burst problem by making the observation window move continuously with time. Rather than counting requests since the last fixed reset point, it counts requests within the last N seconds from right now, at every moment.
Limit: 5 requests per hour (last 3600 seconds)
Current time: 1:30 PM
Sliding window covers: [12:30 PM --> 1:30 PM]
Requests within that window: 3
Remaining capacity: 2
At 1:31 PM, the window shifts:
Sliding window covers: [12:31 PM --> 1:31 PM]
Any requests made before 12:31 PM drop off and free up capacity.
As the window slides forward, old requests naturally fall out of scope and capacity is freed incrementally rather than all at once. This produces a much smoother distribution of allowed traffic over time.
Exact vs. approximate implementation
The straightforward implementation stores a timestamp for every request. On each new request, it scans all stored timestamps, discards those outside the window, counts the rest, and decides whether to allow or block.
This is accurate but memory-intensive. Storing and scanning millions of timestamps for a high-traffic service is not practical.
Most production systems use an approximated sliding window instead. The approximation blends the request counts from the current and previous fixed windows, weighted by how much of the previous window overlaps with the current sliding window:
approximation = (prev_window_count * prev_window_weight) + curr_window_count
where:
prev_window_weight = 1 - (elapsed_time_in_current_window / window_duration)
Example:
Window duration: 60 minutes
Previous window count: 80 requests
Elapsed in current: 30 minutes (50% of the window has passed)
prev_window_weight = 1 - 0.5 = 0.5
approximation = (80 * 0.5) + curr_window_count
= 40 + curr_window_count
This requires storing only two counters rather than millions of timestamps, while still providing a close approximation of true sliding window behavior.
Pros and Cons
- What works well: Sliding windows have no boundary burst vulnerability. Capacity is replenished gradually, which produces smooth and fair traffic distribution. Clients with steady, consistent usage patterns are treated equitably. The approximated version is both memory-efficient and highly accurate.
- What to watch out for: The implementation is more complex than a fixed window. Because the window moves continuously, it is harder to tell users exactly when their capacity will reset, which can make debugging or communicating limits more difficult. The exact (non-approximated) version is impractical at scale due to its memory requirements.
Real-world example: Cloudflare
Cloudflare's configurable rate limiting uses an approximated sliding window. At Cloudflare's scale, handling trillions of requests, the efficiency of the approximation is critical. The approximation introduces a small margin of error but eliminates the boundary burst problem while keeping memory usage constant regardless of traffic volume.
3. Token Bucket
How it works
The token bucket algorithm is conceptually different from window-based approaches. Instead of counting requests against a time window, it models a bucket that is continuously refilled with tokens at a fixed rate. Each incoming request consumes one token. If the bucket has tokens available, the request is allowed. If the bucket is empty, the request is rejected until tokens are replenished.
Bucket capacity: 10 tokens (max burst size)
Refill rate: 2 tokens per second
t=0s tokens=10 Request arrives -> ALLOWED tokens=9
t=0s tokens=9 Request arrives -> ALLOWED tokens=8
t=0s tokens=8 Request arrives -> ALLOWED tokens=7
...8 more requests...
t=0s tokens=0 Request arrives -> BLOCKED (bucket empty)
t=0.5s tokens=1 (1 token refilled at 2/sec rate)
t=0.5s tokens=1 Request arrives -> ALLOWED tokens=0
t=1s tokens=2 (bucket refilling at 2 tokens/sec)
Two independent controls
The token bucket exposes two parameters that can be tuned independently, something window-based limiters cannot offer.
Bucket capacity determines the maximum burst size. A larger bucket means the system can absorb a sudden spike of requests before blocking kicks in.
Refill rate determines the long-term sustained throughput. A faster refill allows a higher average request rate over time.
This separation allows you to express nuanced constraints such as "Allow bursts of up to 500 requests, but sustain no more than 100 requests per second over time." A fixed window cannot express this distinction because its burst tolerance and average rate are the same number.
Pros and Cons
- What works well: Token buckets gracefully handle burst traffic by design. The two independent controls give you precise tuning over both burst size and average rate, which no window-based approach can match. Clients that space out their requests naturally accumulate tokens and are rewarded with burst capacity when they need it.
- What to watch out for: Token buckets are harder to communicate to users. Unlike a fixed window where the reset time is always known, clients using a token bucket cannot easily predict when the next token will be available or how many they currently hold. This makes surfacing useful error messages more challenging.
Real-world examples
Stripe uses a token bucket for its API. Each user is allocated a bucket with a capacity of 500 tokens and a refill rate of 100 tokens per second. This allows a client to fire a burst of up to 500 requests, useful during a checkout spike or a large batch operation, but the sustained rate is capped at 100 requests per second over time.
OpenAI (free tier) uses a token bucket with a capacity of 200 and a refill rate of approximately 1 token every 432 seconds. This limits the free tier to 200 requests per day, but tokens are replenished continuously throughout the day rather than all at once at midnight.
Side-by-Side Summary
Fixed Window is the simplest to implement, with very low memory usage and predictable reset times. Its main weakness is the boundary burst vulnerability, where a client can access up to 2x the stated limit across a window boundary. Best suited for simple APIs and predictable usage patterns.
Sliding Window eliminates the boundary burst problem and provides smooth, gradual capacity replenishment. The approximated version is efficient enough for high-traffic production systems. The trade-off is implementation complexity and less predictable reset times for users.
Token Bucket is the most expressive of the three, with separate controls for burst size and sustained rate. It handles bursty clients gracefully but is the hardest to communicate clearly to end users. Best suited for APIs where clients are expected to have uneven, spiky traffic.
Try the Visualizer
Reading about these algorithms is one thing. Watching them respond to the same stream of requests in real time makes the trade-offs immediately obvious.
The Rate Limit Visualizer lets you configure each algorithm with your own parameters and simulate a request stream. You can directly observe how the fixed window's counter resets all at once, how the sliding window replenishes capacity gradually, and how the token bucket handles a burst that would immediately breach a fixed window.
Try simulating the boundary burst scenario against a fixed window and then the same burst against a sliding window or token bucket. The difference is stark.
Implementation Considerations
Once you have chosen an algorithm, several practical considerations apply regardless of which approach you use.
Use a persistent external store
Rate limit counters must not live in application memory. In-memory state is lost when a server restarts, when a new instance is spun up, or when requests are distributed across multiple servers by a load balancer. Redis is the standard choice: it is fast, supports atomic increment operations, and has native key expiration that handles window cleanup automatically.
Fail open, not closed
If your rate limiter cannot reach its data store, for example if Redis becomes unavailable, the safe default is to allow requests through rather than block them. A rate limiter that fails closed takes your service offline along with it. A service that temporarily cannot enforce rate limits is far preferable to a service that is completely inaccessible.
Choose an appropriate key
Rate limiting is always applied per some identifier. User ID is most appropriate for authenticated APIs. API key is the standard for developer-facing APIs. IP address works for unauthenticated endpoints but is unreliable due to NAT, shared IPs, and VPNs. Device fingerprint or session ID is useful for unauthenticated web and mobile clients.
Surface meaningful error responses
When a request is rejected, return an HTTP 429 Too Many Requests response and include headers that tell the client exactly what happened and when they can retry:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1735689600
Retry-After: 47
Without these headers, clients have no way to implement sensible back-off behavior and may retry immediately, compounding the problem the rate limiter is trying to solve.
Choosing the Right Algorithm
If simplicity and predictable reset timing are the priority, use a Fixed Window. Accept the boundary burst vulnerability as a known trade-off.
If fairness and smooth traffic distribution matter more, use an approximated Sliding Window. Accept that users will have less visibility into exactly when their capacity resets.
If clients have legitimately bursty usage patterns, use a Token Bucket. Accept that communicating remaining capacity to users will require more thought.
There is no universally correct answer. Many systems layer multiple algorithms, for example a token bucket at the per-user level for burst tolerance combined with a fixed window at the service level for cost control.
Wrapping Up
Rate limiting is one of those infrastructure decisions that is invisible when it is working correctly and immediately painful when it is absent. The three algorithms covered here each solve the core problem of controlling request rates, but with meaningfully different properties.
Fixed windows are easy to implement and easy to explain, but their hard reset boundary is a structural weakness. Sliding windows are fairer and smoother at the cost of some implementation complexity. Token buckets are the most expressive, offering independent control over burst tolerance and sustained throughput, but require more thought to surface clearly to end users.
Understanding these trade-offs puts you in a much stronger position to choose the right tool for the problem at hand, and to understand why the systems you work with behave the way they do.
Acknowledgments
This post was written alongside a purpose-built Rate Limit Visualizer to make these algorithms tangible rather than purely theoretical. The visualizer lets you configure and simulate each algorithm interactively, and it is worth spending a few minutes with it to develop a concrete intuition for how these systems behave under realistic traffic patterns.
The framing of this post was informed by the excellent write-up at smudge.ai/blog/ratelimit-algorithms, which approaches the same topic with its own interactive visualizations and is worth reading alongside this one. Additional technical references include Cloudflare's engineering blog post on counting at scale, Stripe's engineering blog on rate limiters, and the Upstash documentation on their sliding window implementation.
Have corrections, additions, or questions? Feel free to reach out.