False Positives vs Real Downtime: A 30-Second Triage Guide for On-Call Engineers
Last Tuesday at 2:47 AM, my phone buzzed. "HTTP check failed: checkout.example.com returned 503." I sat up, rubbed my eyes, opened the laptop, VPN'd in, pulled up the dashboard. Green. Everything green. I hit the endpoint manually from three different terminals. 200 OK, 200 OK, 200 OK. Response times normal. Logs clean. The alert was a false positive. A single dropped packet somewhere between our monitoring node in Frankfurt and the target server. I closed the laptop, lay back down, and did not fall asleep for another 40 minutes.
That story is boring because every on-call engineer has the same one. Different time, different service, same wasted adrenaline. False positives are not just an annoyance. They are a slow poison that kills your monitoring's credibility. After enough 3 AM false alarms, your team starts treating every alert as probably nothing. And then the real outage hits, the alert fires, and the on-call engineer glances at their phone, mutters "probably another false positive," and goes back to sleep. That is when the $50,000 incident becomes a $200,000 incident.
This post is about how to tell the difference between fake and real in 30 seconds flat. And how to configure your monitoring so you have to make that judgment call less often.
What a False Positive Actually Is
A false positive means monitoring reported a failure but users were fine. The check failed for a reason unrelated to actual service availability. The causes fall into a few predictable buckets:
- Network blip. A single packet dropped between the monitoring node and your server. The internet is a mess of routers, switches, and peering agreements. Packets get lost. It happens.
- Single-region connectivity issue. The monitoring node in Singapore lost connectivity for 3 seconds. Your server never noticed because nothing was actually wrong with it.
- Timeout on a slow response. Your server took 6 seconds to respond instead of the usual 2. The check's timeout was 5 seconds. Result: "down." Reality: slow, but functional.
- Rate limiting or WAF block. Your firewall or CDN temporarily blocked the monitoring IP. Cloudflare's bot detection is a common culprit here.
- DNS resolver hiccup. The DNS resolver used by the monitoring node failed to resolve your domain for one check. Happens more often than you would think.
- Monitoring infrastructure issue. The monitoring service itself had a brief problem. Yes, monitors go down too.
- Load balancer timing. The health check hit a backend server at the exact moment it was being drained or restarted. Bad luck, not a real outage.
- Brief deploy restart. Your CI/CD pipeline restarted the service for 4 seconds during a deploy. Users on existing connections were fine. The monitoring check was not.
What Real Downtime Looks Like
Real downtime means actual users cannot use your service. Not one dropped check. Not a slow response. Actual, persistent, user-facing breakage. The signatures are different:
- Multiple consecutive failures. Not one check. Three, four, five in a row.
- Multiple locations failing simultaneously. Frankfurt is down. Tokyo is down. Virginia is down. That is not a network blip.
- Consistent error codes. 503 after 503 after 503. Or connection refused. Or DNS NXDOMAIN.
- Duration. False positives resolve in seconds. Real outages persist for minutes or hours.
- Correlation across check types. HTTP check fails AND ping fails AND port check fails. Multiple independent signals all pointing the same direction.
The core signal is consistency. A real outage repeats. A false positive does not.
The 30-Second Triage
Here is the decision tree I use when I get paged. It takes 30 seconds. Sometimes less.
Step 1: Check how many locations failed. If only one monitoring location reported the failure and others show green, it is almost certainly a false positive. Network issue at that location. Move on. If multiple locations failed, proceed to step 2.
Step 2: Check consecutive failures. Was it one failed check or multiple in a row? A single failed check from a single location is noise. Two or more consecutive failures from two or more locations is signal. If consecutive, proceed to step 3.
Step 3: Hit it yourself. Open a terminal. curl -I https://yoursite.com. Check the status code and response time. If you get a clean 200 with normal latency, the issue may have already resolved. If you get an error, it is real. Act.
Step 4: Check your logs. Quick scan of your application logs and error tracker (Sentry, Datadog, whatever you use). If there is a spike in 500 errors or connection timeouts, it is real. If the logs are quiet, the issue was likely external to your infrastructure.
That is it. Four steps, 30 seconds. Most false positives get filtered out at step 1 or step 2. You only need to touch your laptop for steps 3 and 4.
Why Single-Check Alerting Destroys Trust
I have seen monitoring setups where every single failed check fires a PagerDuty alert. No retries. No multi-location confirmation. No threshold. One failed HTTP check from one location = immediate page.
These teams get paged 3-5 times per week for nothing. Within a month, the on-call rotation becomes a punishment. Within three months, people start leaving their phones on silent. Within six months, a real outage goes unnoticed for 45 minutes because the alert landed in a graveyard of ignored notifications.
The Boy Who Cried Wolf is not a fairy tale. It is a description of bad monitoring configuration.
One dropped packet should never wake a human. The fix is to require multiple confirming signals before treating a failure as real. That means multi-location confirmation and retry logic. Both are non-negotiable for any monitoring setup that humans have to respond to.
Multi-Location Confirmation: Your First Line of Defense
When a check fails from one location, it could be anything. A routing issue. A flaky ISP peering link. A transient DNS problem at that specific resolver. When a check fails from three locations on three different continents simultaneously, your server is probably down.
UptyBots uses global monitoring nodes to confirm failures before triggering alerts. Here is how it works in practice:
- Each check runs from multiple geographic locations.
- Results from all locations are compared before any alert fires.
- If only one location reports failure, it is logged but does not trigger a notification.
- If multiple locations report failure, the system confirms the outage and alerts.
- The confirmation threshold is configurable per monitor.
This single feature eliminates the majority of false positives. In my experience, multi-location confirmation alone reduces false alerts by 80-90%. The remaining 10-20% get caught by retry logic.
Retry Logic: Your Second Line of Defense
Even with multi-location monitoring, transient failures can occasionally hit multiple locations at once. A brief DNS propagation glitch, a CDN edge that flapped for 5 seconds, a load balancer that hiccuped during a config reload. Retry logic handles these.
- First failure -- retry after a short delay (30-60 seconds)
- Second failure -- retry once more
- Third failure -- confirmed. Send the alert.
Real outages that persist for more than a couple of minutes still trigger alerts within minutes. Transient glitches that resolve in seconds never generate noise. The math works.
I run my personal infrastructure with 2 retries and multi-location confirmation. I get about one false positive per quarter. I used to get five per week before configuring retries. The difference is the difference between trusting your monitoring and ignoring it.
Context Matters More Than Status Codes
A raw HTTP status code tells you almost nothing in isolation. Context is everything.
- 200 OK with an error page in the body. Your app caught the exception and returned a friendly error page with a 200 status. Monitoring says up. Users see "Something went wrong."
- 500 once in a million requests. Statistically normal. Your ORM hit a deadlock, retried, succeeded. Not an outage.
- 503 during a planned deploy. Expected. Your deploy script returned 503 for 3 seconds while the new container started. Not an emergency.
- 200 OK in 8 seconds. Technically successful. Functionally unusable. Users bounced 6 seconds ago.
- 200 OK from CDN cache. CDN served a cached page. Your origin server is on fire. Monitoring is happy. Users who need dynamic content are not.
This is why content validation matters. Check the response body, not just the status code. A good monitoring check for a login endpoint does not just verify "did I get a 200?" It verifies "does the response contain the expected JSON structure with a valid token field?" A good check for a product page verifies "does the response contain the product title and price?" That is the difference between monitoring a server and monitoring a service.
The Alert Fatigue Spiral
Alert fatigue follows a predictable pattern. I have watched it happen to at least a dozen teams:
- Team sets up monitoring with aggressive, default settings.
- False positives start arriving. 2-3 per week.
- Team investigates each one. Finds nothing. Wastes time.
- After a month, the team stops investigating immediately. "I will check it in the morning."
- After two months, the alert channel becomes background noise. Nobody reads it.
- A real outage fires the same alert to the same channel. Nobody responds for 40 minutes.
- Post-mortem reveals the alert was there the whole time. Nobody looked.
The fix is not "try harder to pay attention." The fix is fewer, higher-quality alerts. Every alert that fires should require human action at least 80% of the time. If your actionable rate is below that, your monitoring is too noisy and it is actively making your reliability worse.
How to Reduce False Positives Without Missing Outages
Here is the configuration checklist I use for every monitoring setup:
- Use multi-location monitoring. Require failures from 2+ regions before alerting. This is the single highest-impact change you can make.
- Enable retries. 2-3 retries with 30-60 second delays. Costs you 1-2 minutes of detection time. Saves you hundreds of false alerts per year.
- Monitor APIs and websites separately. Different endpoints have different failure modes and different tolerance for delay.
- Set realistic timeout thresholds. If your API normally responds in 500ms, a 5-second timeout is reasonable. A 30-second timeout will never catch slowness. A 1-second timeout will generate false positives on every minor latency spike.
- Track response time trends. Alert on sustained slowness, not individual slow responses. One 4-second response is noise. Ten consecutive 4-second responses is a problem.
- Use content validation. Verify the response body contains expected content. Catches silent failures that return 200 but serve error pages.
- Whitelist monitoring IPs in your WAF. If your firewall or CDN is blocking monitoring checks, you are generating false positives for no reason. Fix it.
- Review your alert history monthly. Which alerts were actionable? Which were noise? Tune the noisy ones or delete them.
- Document expected maintenance windows. If you deploy every Tuesday at 2 PM and the deploy causes a 5-second blip, suppress alerts during that window.
- Test your alerts. Send a test notification to every channel quarterly. Make sure the right people still receive them.
The Trade-Off: Speed vs. Accuracy
There is a fundamental tension in monitoring configuration. More aggressive alerting catches issues faster but produces more false positives. More conservative alerting is quieter but slower to detect real problems.
Where you set that dial depends on what you are monitoring:
- Revenue-critical endpoints (checkout, payment API): Bias toward speed. Tolerate some false positives. A 2-minute delay in detecting a checkout outage costs real money.
- Non-critical pages (blog, docs, marketing pages): Bias toward accuracy. A 5-minute detection delay for the blog is fine. Getting paged at 3 AM because the blog returned a slow response is not.
- Internal tools: Business hours only. Conservative thresholds. Nobody needs to be woken up because the internal wiki was slow for 10 seconds on a Saturday.
- New or unstable services: Start conservative. Tighten as you learn what normal looks like. You need baseline data before you can set meaningful thresholds.
UptyBots lets you configure all of this per monitor. Your checkout endpoint gets aggressive thresholds and immediate multi-location confirmation. Your blog gets relaxed thresholds and higher retry counts. Different services, different configurations, same dashboard.
A Personal Note on On-Call Culture
I have been on call, on and off, for about 15 years. The teams I have been on that handled it well had one thing in common: they treated every false positive as a bug. Not "oh well, that happens." A bug. Something to investigate, root-cause, and fix so it does not happen again.
When you treat false positives as bugs, they decrease over time. Your monitoring configuration gets better with each incident review. After six months, you have a monitoring setup where every alert means something. Your on-call engineers trust the system. When the phone rings at 3 AM, they know it is real, and they respond fast.
That trust is worth more than any feature on a monitoring tool's pricing page.
Frequently Asked Questions
How do I know if my alerts have too many false positives?
Track your actionable alert rate. If less than 80% of your alerts result in real action (investigation, fix, or escalation), you have too many false positives. Most well-tuned monitoring setups achieve 90%+ actionable rates.
What is the right retry count?
For most services, 2-3 retries with 30-60 second delays. Mission-critical services might use 1 retry (faster detection, accept slightly more noise). Low-priority services might use 3-4 retries (less noise, accept slightly slower detection).
Can I have different retry settings for different monitors?
Yes. This is how it should be configured. Your checkout API and your company blog should not have the same alerting sensitivity. UptyBots supports per-monitor configuration for exactly this reason.
How does multi-location monitoring affect costs?
Multi-location monitoring uses more checks per interval. UptyBots includes multi-location checks in paid plans, with the number of locations varying by tier. The cost increase is trivial compared to the reduction in false positives.
What if my service has periodic maintenance?
Configure maintenance windows to suppress alerts during expected downtime. UptyBots supports per-monitor maintenance scheduling. No more alerts for planned deploy restarts.
Conclusion
The difference between monitoring that works and monitoring that is ignored comes down to signal-to-noise ratio. Every false positive erodes trust. Every missed real alert costs money. The goal is simple: when the alert fires, it should be real. When there is a real problem, the alert should fire.
Multi-location confirmation, retry logic, content validation, and per-monitor thresholds are not nice-to-haves. They are the difference between an on-call rotation your engineers tolerate and one they trust. UptyBots gives you all four out of the box.
Configure it right. Treat false positives as bugs. Tune continuously. Your 3 AM self will thank you.
Start improving your uptime today: See our tutorials or choose a plan.