By Michael Torres · Feb 5, 2026

Discord Bot Monitoring: Why "Online" Doesn't Mean "Working" (And the Night My Music Bot Proved It)

I run a Discord server for our gaming community. About 300 members, pretty active, people hanging out in voice channels most evenings. We have a music bot. Not Groovy or Rythm (RIP), a self-hosted one I set up on a VPS using discord.js and Lavalink. It worked great. People would queue up songs during game nights, we'd have little listening parties, the whole vibe.

One Friday night I'm in voice chat with about 15 people and someone types /play. Nothing happens. They try again. Nothing. Someone else tries a different command. Nothing. We all look at the member list and there it is: the bot, bright green dot, showing "Online."

It took me 20 minutes of digging to figure out what happened. The Lavalink audio node had crashed three days earlier. The bot's gateway connection to Discord was still alive, so Discord kept showing it as online. The bot process was running. It was receiving events. It just couldn't do the one thing people actually used it for: play music. For three days.

Nobody reported it because the bot looked fine. The green dot lied to all of us.

The Green Dot Problem

This is the single most misleading thing about Discord bots, and I want to be really clear about it because I see server owners get burned by it constantly.

Discord's online/offline indicator is based on one thing: the WebSocket connection between your bot and Discord's gateway. That's it. If that connection is alive, the bot shows as online. Discord doesn't check if your bot can actually process commands. It doesn't verify that your database connection works. It doesn't test whether your API integrations are responsive. It just checks the WebSocket heartbeat.

So your bot can be in any of these states and still show a green dot:

  • Command handler has crashed but the process is still running
  • Database connection has died, so any command that reads or writes data fails silently
  • External API (music service, game stats API, whatever) is down, breaking the features that depend on it
  • Background tasks (scheduled messages, role assignments, auto-moderation) have stopped running
  • The bot is stuck in an error loop, catching exceptions and doing nothing with them
  • Memory usage is so high the bot is barely responsive, timing out on every interaction

Every one of these is a real scenario I've either experienced myself or helped other server owners debug. The bot is "online." The bot is broken.

Why This Matters More Than You Think

When I first started running bots, I figured the worst case was people not being able to play music for a bit. Annoying but not a big deal. Then I started running moderation bots, and the stakes changed completely.

Think about what happens when your moderation bot silently stops working:

  • Auto-mod stops catching spam and slurs. Your server gets trashed while the bot sits there with its green dot doing nothing.
  • Raid protection stops working. A raid hits your server at 2 AM, the bot that's supposed to auto-ban mass joins is broken, and you wake up to hundreds of spam messages.
  • Role assignment commands stop working. New members can't get their roles, can't access channels, and leave because they think the server is dead.
  • Logging stops. Something bad happens in your server, you go to check the audit logs your bot was supposed to keep, and there's nothing there for the last week.

Music bot being down is annoying. Moderation bot being down can wreck your community. The green dot doesn't distinguish between the two.

My Journey From "It's Probably Fine" to Actual Monitoring

After the music bot incident, I went through a few phases of trying to solve this problem. I'll walk through them because I think most people go through the same progression.

Phase 1: The manual check

I started typing /ping to my bot every morning. If it responded, I figured it was working. This lasted about a week before I forgot, and then it was back to square one. Manual checking doesn't scale. You have a life, you forget, you assume everything is fine because you checked yesterday.

Phase 2: The watchdog bot

I wrote a second bot whose only job was to send a command to my main bot every five minutes and check if it got a response. Clever, right? Except now I had two bots that could break. And when my VPS had an issue, both bots went down together. I was monitoring a bot with a bot that ran on the same server. Not my finest moment.

Phase 3: The health check endpoint

This is where things actually started working. I added a tiny HTTP server to my bot that exposed a /health endpoint. Then I pointed UptyBots at that endpoint. External monitoring, running on completely separate infrastructure, checking my bot every minute. When the endpoint stopped responding or returned an error, I got an alert on Telegram.

This is what I should have done from the start, and it's what I'm going to walk you through doing.

Building a Health Check That Actually Checks Health

The health endpoint is only useful if it checks the things that actually matter. A lot of tutorials will tell you to create an endpoint that returns {"status": "ok"} and call it a day. That's barely better than the green dot. Your bot process is running and can serve HTTP. Great. But can it actually do its job?

Here's what my health endpoint checks:

  • Gateway connection. Is the bot connected to Discord's gateway? If the WebSocket dropped and hasn't reconnected, the bot can't receive events.
  • Last event timestamp. When did the bot last process a Discord event? If it's been more than 5 minutes, something is wrong even if the connection is technically alive.
  • Database connectivity. Can the bot reach its database? I run a quick SELECT 1 query. If it fails, every command that touches the DB will fail too.
  • Lavalink status. Is the audio node reachable? This is what would have caught my original problem three days early.
  • Memory usage. Is the bot using more than 80% of available memory? If so, it's probably about to start having problems.
  • Command processing. Has the bot successfully processed at least one command in the last 10 minutes? If zero commands have been processed during active hours, something might be wrong with the command handler.

If all checks pass, the endpoint returns HTTP 200 with a JSON body showing the details. If any check fails, it returns HTTP 500. Simple. Binary. UptyBots hits this endpoint, gets either 200 or 500, and alerts me accordingly.

For discord.js, you can spin up an Express server alongside your bot in about 20 lines of code. For discord.py, aiohttp or Flask works. JDA users can use Javalin or Spark. The implementation is trivial. The value is enormous.

What to Monitor and How

Based on running bots for a few years now, here's the monitoring setup I recommend for anyone who takes their Discord bot seriously:

Layer 1: Process health (is the bot running?)

Use a process manager like systemd, PM2, or Docker with restart policies. This handles the "bot process crashed" scenario by automatically restarting it. But it doesn't handle the "bot process is running but broken" scenario. That's why you need the next layers.

Layer 2: HTTP health endpoint (is the bot working?)

This is the big one. Your health endpoint, monitored externally by UptyBots. Check it every 1-3 minutes. This catches all the silent failure modes that the green dot misses. It catches dead database connections, crashed audio nodes, stuck event loops, and everything else that makes a bot useless without killing the process.

Layer 3: Server monitoring (is the host healthy?)

Monitor the VPS or machine your bot runs on. A ping check or TCP port check on SSH (port 22) tells you if the machine itself is reachable. If the machine goes down, your bot goes with it, and the health endpoint becomes unreachable. But you want to distinguish between "the machine is down" and "the machine is fine but the bot is broken" because the response is different.

Layer 4: Dependency monitoring (are the services your bot needs alive?)

If your bot depends on external APIs (game stats, weather, translation, AI), monitor those separately. When your bot's "check weather" command stops working, you want to know immediately whether it's your bot's fault or the weather API's fault. Separate monitors for separate services.

Setting Up Alerts That Don't Drive You Crazy

I made the mistake early on of setting every alert to maximum urgency. Bot down? Telegram notification. TPS drop? Telegram notification. Memory above 50%? Telegram notification. Within a week I was ignoring alerts because there were too many of them. Alert fatigue is real and it's dangerous because you start ignoring the important ones too.

Here's how I have my alerts configured now:

  • Health endpoint returns 500 (two consecutive checks): Telegram + Discord webhook to admin channel. This is the "something is actually broken" alert.
  • Health endpoint unreachable (three consecutive checks): Telegram + email. The bot or its server is completely down. Higher threshold to avoid false positives from brief network issues.
  • Server ping fails (three consecutive checks): Email. The host machine is unreachable. This is a hosting provider issue, not a bot issue.

The key is requiring consecutive failures before alerting. A single failed check can be a network blip, a momentary DNS hiccup, anything. Two or three consecutive failures from an external monitoring location? That's a real problem.

Real Failures I've Caught With Monitoring

Since setting up proper monitoring, I've caught problems that would have gone unnoticed for hours or days:

  • Database connection pool exhaustion. My bot's SQLite database hit its connection limit after running for two weeks straight. Commands that needed the DB started timing out. The health endpoint caught the dead DB connection within 60 seconds. I restarted the bot, switched to connection pooling with proper cleanup, and it never happened again. Without monitoring, this would have been "hey Michael, the leveling system hasn't been working for a week."
  • Discord API rate limit spiral. After a Discord outage, my bot's reconnection logic fired too aggressively and got rate limited. It reconnected to the gateway but was being throttled on every API call. Commands worked intermittently. The health endpoint showed the rate limit status and flagged it. I adjusted the reconnection backoff timing.
  • Memory leak from cached messages. I had a message logger that cached recent messages for edit/delete tracking. It never cleared the cache. After about 5 days, memory usage would hit the ceiling. My health endpoint tracked memory usage and alerted me when it crossed 80%. I added cache eviction with a 24-hour TTL. Fixed.
  • Node.js event loop blocked by CPU-intensive operation. Someone requested a leaderboard with 10,000 entries, and the sorting/formatting blocked the event loop for 8 seconds. During those 8 seconds, no other commands processed. The health endpoint timed out on its next check because the HTTP server couldn't respond while the event loop was blocked. I moved the heavy computation to a worker thread.
  • Automatic restart at 3 AM saved a raid response. My bot crashed at 3:12 AM. Systemd restarted it automatically. Monitoring confirmed it came back up within 2 minutes. At 3:45 AM, a small raid hit the server. The auto-mod caught it because the bot was already back online. If the bot had stayed down until I woke up at 8 AM, the raid damage would have been much worse.

Notification Channels: Pick More Than One

UptyBots supports email, Telegram, and webhooks. I use all three for different purposes:

  • Discord webhook (via UptyBots's webhook notification channel). Posts to our admin-only channel. The other mods can see it and potentially respond even if I'm not around.
  • Telegram. My personal alert channel. I always have my phone. This is the one that wakes me up at 3 AM if needed (silent hours disabled for this chat).
  • Email. Backup for everything. If Telegram is down (it happens), email still gets through.

Using only one notification channel is a single point of failure. If your only alert method is a Discord webhook and Discord itself is having issues (which affects your bot too), you'll never get the alert. Multiple channels. Always.

Best Practices From Running Bots for Three Years

  • The health endpoint is non-negotiable. If you're running a bot that people depend on, add an HTTP health endpoint. It's 20 lines of code. There is no simpler way to get real visibility into your bot's health.
  • Monitor from outside your infrastructure. An external service like UptyBots catches problems that internal monitoring misses. If your whole VPS goes down, internal monitoring goes with it. External monitoring doesn't.
  • Use a process manager with auto-restart. Systemd, PM2, Docker. Your bot should restart automatically after a crash. Monitoring tells you it happened. Auto-restart fixes it. Together, most crashes resolve themselves with minimal impact.
  • Track uptime history. After a month of monitoring data, patterns emerge. Maybe your bot crashes every Sunday morning because of a cron job on the same server. Maybe it gets slow every evening because that's when your community is most active and memory usage peaks. Data reveals things you'd never notice otherwise.
  • Don't ignore slow responses. A bot that takes 5 seconds to respond to every command feels broken even if it technically works. Monitor response time, not just availability. If your health endpoint starts taking 3 seconds instead of 50 milliseconds, something is degrading.
  • Document your runbook. Write down what to do when the bot goes down. SSH command to restart, how to check logs, how to roll back to a previous version. When it's 3 AM and your phone is buzzing, you don't want to be thinking. You want to be following steps.
  • Test your alerts. Stop your bot on purpose once a month. Verify you get notifications on every channel. I once had a Telegram alert that stopped working because I regenerated my bot token and forgot to update the monitoring config. Found out during a real outage. Don't be me.
  • Monitor dependencies separately. If your bot uses a database, an audio node, an external API, give each one its own monitor. When something breaks, you want to know immediately which component failed, not spend 20 minutes checking each one manually.

Frequently Asked Questions

Why does Discord show my bot as online when it's broken?

Discord's presence indicator is based purely on the gateway WebSocket connection. As long as the bot maintains that connection (sends heartbeats), Discord shows it as online. Application-level failures (crashed command handler, dead database, broken APIs) don't affect the WebSocket connection, so the green dot persists even when the bot can't actually do anything useful.

How do I add a health endpoint to my bot?

For discord.js, add Express as a dependency and create a simple HTTP server that runs alongside your bot. For discord.py, use aiohttp or Flask. For JDA, use Javalin or Spark. The endpoint should check gateway connection status, database connectivity, and any critical dependencies before returning HTTP 200. Return HTTP 500 if any check fails. Most implementations are under 30 lines of code.

How often should I check my bot?

Every 1-3 minutes for active community bots. More frequent checks mean faster detection but also more HTTP requests to your bot. For most setups, checking every minute is the sweet spot. UptyBots supports intervals down to 1 minute on paid plans.

Can I monitor a self-hosted bot?

Yes, as long as your health endpoint is reachable from the public internet. If you're hosting from home, you'll need port forwarding for the health check port (pick something non-standard like 8080 or 3000). If you're on a VPS, the port is already accessible. Just make sure your firewall allows incoming connections on that port.

Should I also monitor from inside my Discord server?

External monitoring (like UptyBots) is the primary method because it's independent of your bot's infrastructure. You can add a secondary watchdog bot as an internal check, but don't rely on it as your only monitoring. If both bots run on the same server, they'll both go down together.

Stop Trusting the Green Dot

That green dot next to your bot's name in the member list tells you exactly one thing: the WebSocket connection to Discord is alive. It tells you nothing about whether your bot can process commands, reach its database, play music, moderate your server, or do literally anything useful.

I wasted three days of broken music playback because I trusted the green dot. I almost lost control of a raid because I assumed "online" meant "working." Don't make the same mistakes. Add a health endpoint. Point UptyBots at it. Set up alerts on Telegram and Discord. The whole setup takes maybe 30 minutes, and the first time it catches a silent failure that would have gone unnoticed for hours, you'll wonder why you didn't do it sooner.

Start monitoring your bot today: See our tutorials.

Ready to get started?

Start Free