When Chat Guard hit 1,000 servers, we thought that was a milestone. When it hit 7,500 servers with over 15 million users across all of them, the team realized we were just getting started - and that the original architecture needed a complete rethink.
The Naive Architecture
The first version of Chat Guard was a single Node.js process. One index.js file, one database connection, one event loop handling everything. It worked brilliantly for 50 servers. At 500, it started showing cracks. At 1,000, it was held together by optimism and setInterval.
Here's what the early architecture looked like:
Single Node.js Process
├── Discord.js Client (one shard)
├── MongoDB Connection (single)
├── Command Handler
├── Event Listeners (message, join, leave)
└── Moderation Logic
The problems were predictable: memory leaks from uncollected event listeners, database connection timeouts under load, and the occasional complete crash when Discord's gateway hiccuped.
Sharding: The First Real Architecture Decision
Discord forces sharding once your bot exceeds 2,500 servers. But the smart move is to shard earlier. Each shard maintains its own WebSocket connection to Discord's gateway, handling a subset of your total servers.
We implemented sharding at around 800 servers - before Discord required it - and it was one of the best architectural decisions made early on. Each shard runs as an independent process, isolated from the others. If one crashes, the rest keep running.
Shard Manager
├── Shard 0 (servers 0-999)
├── Shard 1 (servers 1000-1999)
├── Shard 2 (servers 2000-2999)
└── ...
The Database Layer
MongoDB was the right choice for this use case, and I'll defend that position. Discord bot data is inherently document-shaped: each server has its own configuration, its own log entries, its own moderation history. The schema varies per server because admins customize their setups differently.
But MongoDB at scale requires discipline:
- Index everything you query. This sounds obvious until you're debugging a query that takes 8 seconds because you forgot to index a compound field.
- Use connection pooling aggressively. Each shard needs its own pool, not a shared connection.
- Batch writes. Don't write to the database on every message event. Buffer, batch, flush. I learned this the hard way when our write throughput hit MongoDB's limits.
I wrote more about the database patterns specifically in MongoDB Patterns I Learned from 7,500 Discord Servers.
Rate Limiting: Respecting Discord's API
Discord's rate limits are strict, and for good reason. Every API call - banning a user, sending a message, modifying a channel - has a rate limit. Exceed it, and you're blocked. Exceed it repeatedly, and your bot gets flagged.
The team built a queue system using an in-memory priority queue with exponential backoff. Every API call goes through the queue. The queue respects per-route rate limits, per-guild rate limits, and global rate limits. It sounds like overengineering until you realize the alternative is your bot randomly failing to moderate because it hit a rate limit mid-operation.
Event Processing Pipeline
At 7,500+ servers, Chat Guard processes thousands of events per second. Every message, every join, every leave, every reaction - they all flow through the event pipeline.
The pipeline is straightforward:
- Receive event from Discord gateway
- Filter - is this server using this feature?
- Validate - does the bot have the required permissions?
- Process - execute the moderation logic
- Log - write to the audit log (batched)
The critical insight was step 2. Most events can be discarded immediately because the server doesn't have the relevant feature enabled. This early-exit pattern reduced processing load by roughly 70%.
What I'd Do Differently
If the team were building Chat Guard from scratch today:
- TypeScript from day one. The refactoring pain of adding types to a large JavaScript codebase is real.
- Redis for caching. We relied too heavily on in-memory caches that reset on restart.
- Better observability. We added structured logging too late. For the first year, debugging production issues meant reading through console.log statements.
The Numbers
- 7,500+ Discord servers
- 15M+ end users reached
- 99.7% uptime over the last year
- <100ms average command response time
These numbers didn't happen by accident. They happened because every scaling problem forced a better architectural decision.
Building a Discord bot? Start with sharding early and batch your database writes. Future you will appreciate it.