API Rate Limiting

A comprehensive guide to protecting systems and sustaining performance.

API rate limiting protects systems from overload, promotes fair usage, and sustains performance during traffic spikes. For developer platforms, it builds trust by delivering predictable behavior—preventing one rogue client from disrupting everyone else.

This guide covers the essentials: what rate limiting entails, popular strategies with their pros and cons, implementation patterns in modern systems, and a polished Node.js example. You'll walk away equipped to design limits that balance protection with usability.

What Is API Rate Limiting?

API rate limiting caps the number of requests a client can send within a specific timeframe, often tied to identifiers like API keys, user IDs, IP addresses, or access tokens.

It achieves key objectives:

Public APIs treat rate limits as a core contract with developers—clearly documented and consistently enforced.

Common Use Cases

Apply rate limiting to vulnerable or resource-intensive endpoints:

Rate Limiting Strategies

Choose based on your traffic patterns, scale, and fairness needs. Each tracks requests against time-based counters or queues.

Fixed Window

How it works: Tally requests in rigid intervals (e.g., 100 requests per 60-second window). At the boundary (e.g., :59 to :00), the counter resets to zero.

Pros Dead simple to code and debug. Low memory and CPU overhead.

Cons Burst risks: Clients pile on just before reset, causing spikes. Unfairness: A request at 59:59 counts against the next window, even if early in it.

Best for: Internal tools or low-stakes APIs with steady traffic.

Sliding Window

How it works: Track timestamps of recent requests in a rolling window (e.g., last 60 seconds). Reject if the count exceeds the limit.

Pros Even distribution—no hard boundaries. Precise enforcement using exact timings.

Cons Needs sorted storage (e.g., Redis sorted sets), raising complexity and cost. Slower lookups for high-volume APIs.

Best for: Variable public traffic where fairness matters.

Token Bucket

How it works: Maintain a bucket of "tokens" that refills at a steady rate (e.g., 100 tokens/minute). Each request consumes one; wait or reject if empty. Bursts deplete the initial fill.

Pros Handles short bursts gracefully (e.g., page loads). Smooth long-term averaging. Industry standard (used by AWS, Stripe).

Cons Stateful: Persist buckets across restarts or instances (e.g., via Redis). Tuning refill rate and bucket size is tricky.

Best for: Developer APIs needing burst tolerance.

Leaky Bucket

How it works: Queue requests and process at a fixed rate (like a bucket leaking at constant speed). Overflow drops requests.

Pros Ironclad backend smoothing—never overwhelms downstream services. Predictable throughput.

Cons No bursts; clients face queues or drops during peaks. Queue management adds latency.

Best for: Strict SLAs or microservices with fixed capacity.

Comparing Strategies

Strategy Burst Handling Complexity Accuracy Storage Needs
Fixed Window Poor Low Medium Minimal
Sliding Window Good Medium High High (timestamps)
Token Bucket Very Good Medium High Moderate (per-client state)
Leaky Bucket Poor Medium High High (queue)

Node.js Example: Token Bucket with Express

This production-ready snippet uses proportional refills for accuracy. It stores state in-memory (fine for prototypes; swap with Redis for scale).

const express = require('express');

const app = express();
app.use(express.json());
const PORT = 3000;

// Config: 100 tokens every 60s (1.67/sec)
const MAX_TOKENS = 100;
const REFILL_RATE = MAX_TOKENS / (60 * 1000); // tokens per ms
const buckets = new Map(); // clientId -> {tokens, lastRefill}

function rateLimiter(req, res, next) {
  // Flexible ID: IP or API Key
  const clientId = req.ip || req.get('X-API-Key'); 
  const now = Date.now();
  
  let bucket = buckets.get(clientId);
  if (!bucket) {
    bucket = { tokens: MAX_TOKENS, lastRefill: now };
    buckets.set(clientId, bucket);
  }

  // Proportional refill
  const elapsedMs = now - bucket.lastRefill;
  const tokensToAdd = elapsedMs * REFILL_RATE;
  
  bucket.tokens = Math.min(MAX_TOKENS, bucket.tokens + tokensToAdd);
  bucket.lastRefill = now;

  if (bucket.tokens >= 1) {
    bucket.tokens -= 1;
    const resetTime = Math.ceil((MAX_TOKENS / REFILL_RATE - elapsedMs) / 1000);
    
    // Standard Headers
    res.set({
      'X-RateLimit-Limit': MAX_TOKENS,
      'X-RateLimit-Remaining': Math.floor(bucket.tokens),
      'X-RateLimit-Reset': resetTime,
    });
    
    next();
  } else {
    res.status(429).json({
      error: 'Rate limit exceeded',
      message: 'Too many requests. Try again soon.',
    });
  }
}

// Apply globally to /api
app.use('/api', rateLimiter);

app.get('/api/data', (req, res) => {
  res.json({ message: 'Success! Data here.' });
});

app.listen(PORT, () => {
  console.log(`Server on http://localhost:${PORT}`);
});

Key Improvements in Code

Best Practices

Tailor to patterns: Use Token Bucket for bursts, Leaky Bucket for steady streams.

Standard Headers & UX: Always expose X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset. On 429 errors, include a Retry-After header.

Never silent-throttle: Always return a 429 status code with guidance.

Monitor & adapt: Track 429s and P99 latency. Perform A/B testing on limits to find the sweet spot.