Detecting Intermittent Downtime That Users Notice but Monitoring Misses
Your monitoring dashboard shows 100% uptime. Green across the board. But your support inbox tells a different story: customers are reporting errors, slow pages, and failed transactions. This disconnect is one of the most frustrating problems in operations -- intermittent downtime that slips through the cracks of standard monitoring. These are not full outages that take your entire site offline for hours. They are brief, sporadic failures that last seconds to minutes, affect only some users, and disappear before your next monitoring check runs. This guide explains exactly why these phantom outages happen, the technical mechanisms behind them, and how to configure your monitoring to catch them reliably.
What Is Intermittent Downtime?
Intermittent downtime refers to brief, recurring periods when your website or application is unavailable or degraded for some users. Unlike a full outage where everything is clearly broken, intermittent issues are partial, temporary, and maddeningly inconsistent. They might affect:
- Only users in certain geographic regions
- Only certain pages or API endpoints
- Only requests that hit a specific server in your load balancer pool
- Only users whose DNS cache has expired and is being refreshed
- Only requests during brief periods of high concurrency
- Only connections routed through a specific CDN edge node
The defining characteristic is that the issue is real -- users genuinely experience failures -- but it does not show up consistently in monitoring because the monitoring check happens to miss the narrow window when the problem occurs.
Why Standard Monitoring Misses Intermittent Issues
To understand why these issues are missed, you need to understand what standard monitoring actually does. A typical HTTP check works like this: every N minutes, a monitoring service sends a single HTTP request to your URL and checks whether it gets a successful response (HTTP 200). If it does, the check passes. If it does not, the check fails.
The fundamental limitation is sampling frequency. If your check runs every 5 minutes, you are testing your site 288 times per day out of the millions of requests real users make. A failure that lasts 30 seconds has only a 10% chance of being caught by a 5-minute check. A failure that lasts 10 seconds has roughly a 3% chance. Here is how check frequency affects detection probability for different outage durations:
| Outage Duration | 5-Min Check Interval | 3-Min Check Interval | 1-Min Check Interval |
|---|---|---|---|
| 10 seconds | 3% detection rate | 6% detection rate | 17% detection rate |
| 30 seconds | 10% detection rate | 17% detection rate | 50% detection rate |
| 1 minute | 20% detection rate | 33% detection rate | 100% detection rate |
| 2 minutes | 40% detection rate | 67% detection rate | 100% detection rate |
| 5 minutes | 100% detection rate | 100% detection rate | 100% detection rate |
This is why UptyBots offers 1-minute check intervals as the default for critical endpoints. At 1-minute intervals, any outage lasting 60 seconds or more is guaranteed to be detected. Shorter outages are caught with high probability, especially when combined with multi-location checks.
The Seven Root Causes of Intermittent Downtime
Intermittent failures are not random. They have specific, diagnosable root causes. Understanding these causes helps you both detect and prevent them.
1. Load Balancer Health Check Gaps
When a server in your load balancer pool starts failing, there is a window between when the server begins returning errors and when the load balancer detects it and removes it from the pool. During this window, some percentage of user requests are routed to the failing server while others hit healthy servers. The result: some users see errors while others see a perfectly working site.
How to detect it: Monitor individual backend servers, not just the load balancer endpoint. If you only monitor the public URL, you are relying on the load balancer to route your monitoring check to the failing server -- which is a coin flip.
2. DNS Propagation and Caching Issues
DNS resolution is not instant or universal. When you make DNS changes (new IP, new CDN, failover), the propagation happens gradually. Different users have different DNS caches with different TTL (Time To Live) values. During propagation, some users resolve to the old IP while others resolve to the new one. If the old IP is no longer serving your site, those users experience downtime while your monitoring (which may have already updated its DNS cache) shows everything fine. For a deep dive into this, read our guide on why your website appears down only in certain countries.
How to detect it: Use multi-location monitoring. Different monitoring locations use different DNS resolvers, so they will detect the inconsistency. UptyBots checks from multiple geographic locations to catch exactly this type of issue.
3. Connection Pool Exhaustion
Your application server maintains a pool of database connections, HTTP client connections, or thread pool workers. Under normal load, the pool handles requests fine. But during traffic spikes, the pool gets exhausted. New requests queue up, and if the queue fills, they start failing. The spike passes, the pool recovers, and everything looks normal again -- until the next spike.
How to detect it: Monitor response time, not just availability. Connection pool exhaustion typically shows up as a sudden spike in response time (from 200ms to 5,000ms+) before requests start failing entirely. UptyBots tracks response times for every check and alerts you when they exceed your configured threshold.
4. CDN Edge Node Failures
Content Delivery Networks route users to the nearest edge node. If a specific edge node has issues (overloaded, misconfigured, or experiencing hardware problems), users routed through that node see errors while users routed through other nodes are fine. Your monitoring might be routed through a healthy node and never see the problem.
How to detect it: Multi-location monitoring is essential here. If your monitoring check from Frankfurt is fine but the check from London fails, you have found a CDN edge issue. This is nearly impossible to detect with single-location monitoring.
5. Race Conditions and Concurrency Bugs
Race conditions are some of the hardest bugs to catch because they only trigger under specific timing conditions. Two requests arriving at exactly the same moment, both trying to update the same database record, both reading the old value before either writes the new one. The result: data corruption, failed transactions, or 500 errors that happen once every thousand requests and are nearly impossible to reproduce in testing.
How to detect it: Monitor specific transactional endpoints (checkout, payment processing, account updates) with higher frequency. Use API monitoring to check not just HTTP status codes but also response body content -- a 200 response with an error message in the body is still a failure.
6. Memory Leaks and Garbage Collection Pauses
Memory leaks cause gradual degradation. Your application starts fine, but over hours or days, memory usage climbs. Eventually, the application starts swapping to disk (causing extreme slowness) or the garbage collector runs aggressively (causing periodic freezes of 1-5 seconds). These freezes appear as intermittent timeouts to users. When the application is restarted (manually or by a process manager), everything is fast again -- until the leak fills memory once more.
How to detect it: Track response time trends over time. A gradual increase in response times that resets after restarts is the classic signature of a memory leak. UptyBots stores historical response time data so you can identify these patterns.
7. SSL/TLS Handshake Failures
SSL handshake failures can be intermittent when caused by certificate chain issues. Your server might serve the leaf certificate correctly but occasionally fail to send intermediate certificates. Some clients cache the intermediate certificates from previous connections and work fine, while clients making their first connection fail with SSL errors. This creates a pattern where existing users have no problems while new visitors cannot access your site.
How to detect it: Use dedicated SSL monitoring that checks the full certificate chain, not just the expiration date. UptyBots validates the entire SSL chain and alerts you to configuration issues before they affect users.
The Multi-Location Strategy
Single-location monitoring has a blind spot: it can only detect issues visible from one network perspective. The internet is not uniform. The path from your monitoring server to your website is different from the path your users take. Multi-location monitoring eliminates this blind spot by checking your site from multiple geographic locations simultaneously.
Here is what multi-location monitoring catches that single-location misses:
- Regional network outages: An ISP issue in Europe does not affect monitoring from the US, but it affects all your European users.
- CDN edge problems: Different locations route through different CDN edges, exposing node-specific failures.
- DNS inconsistencies: Different DNS resolvers in different locations may return different results during propagation or misconfiguration.
- Geo-blocking accidents: A firewall rule that accidentally blocks an entire country will be invisible from locations in unblocked countries.
- Routing issues: BGP routing changes can make your site unreachable from specific network segments while remaining accessible from others.
Configuring UptyBots to Catch Intermittent Issues
Here is a step-by-step guide to setting up your monitoring for maximum intermittent downtime detection:
Step 1: Set Check Frequency to 1 Minute for Critical Endpoints
For your homepage, checkout page, login page, and primary API endpoints, use the minimum check interval of 1 minute. This is the single most impactful setting change you can make. A 1-minute interval catches outages that 5-minute intervals miss 80% of the time.
Step 2: Enable Multi-Location Checks
Enable checks from multiple geographic locations. Each location runs its own independent check, effectively multiplying your detection capability. If you have 3 locations checking every minute, you are running 3 checks per minute -- tripling your chance of catching a brief outage.
Step 3: Monitor Response Time, Not Just Status Codes
Set a response time threshold for your HTTP monitors. A reasonable starting point: alert if response time exceeds 3x your normal average. If your site normally responds in 300ms, set an alert at 900ms. Response time spikes are the early warning signal for many intermittent issues -- they show up before full failures.
Step 4: Use API Monitoring for Transaction-Critical Paths
For checkout flows, payment processing, and other critical transactions, set up API monitors that validate the response body, not just the status code. A checkout endpoint that returns HTTP 200 with a JSON body containing an error message is still a failure. API monitoring lets you check for specific content in the response.
Step 5: Monitor SSL and Domain Separately
SSL and domain issues cause a specific type of intermittent problem: they work fine until they do not, and then they fail catastrophically. Monitor SSL certificate expiration with 30-day, 14-day, and 7-day advance alerts. Monitor domain expiration the same way. These are not intermittent in the traditional sense, but they create sudden, total outages that are entirely preventable.
Step 6: Configure Confirmation Checks
To reduce alert fatigue from false positives, configure confirmation checks. When a check fails, UptyBots immediately runs a follow-up check from a different location before alerting you. This filters out transient network blips while still catching real issues. It is the balance between sensitivity and noise -- you want to detect real intermittent downtime without drowning in false positives.
Deployment-Related Intermittent Downtime
One of the most common and overlooked sources of intermittent downtime is your own deployment process. During a deployment, there is typically a window where old and new versions of your application are running simultaneously. If the deployment involves database migrations, configuration changes, or API contract changes, requests during this window can fail unpredictably.
Common deployment-related intermittent issues:
- Rolling restarts: As servers restart one by one, the remaining servers handle increased load, potentially causing slowdowns or failures.
- Database migrations: Schema changes that lock tables cause queries to queue and timeout during the migration.
- Cache invalidation: A deployment that clears the cache causes a thundering herd of requests to the database until the cache warms up.
- Configuration drift: New servers pick up the new configuration immediately while existing connections use the old configuration until they reconnect.
UptyBots helps you track whether your deployments cause intermittent issues. Check your monitoring dashboard after each deployment to see if response times spiked or any checks failed during the deployment window. Read more about monitoring during deployments to learn how to avoid panic alerts during planned maintenance.
Building an Intermittent Downtime Investigation Playbook
When users report issues that your monitoring does not show, follow this investigation checklist:
- Verify the report: Get the exact time, URL, error message, and location from the reporting user. Screenshots help.
- Check monitoring history: Look at the time range around the reported incident in UptyBots. Even if no alerts fired, check for response time spikes or borderline failures.
- Check from multiple locations: If you are not already using multi-location monitoring, this incident is your sign to enable it.
- Check server logs: Look for 5xx errors, connection timeouts, or out-of-memory events in the reported time window.
- Check load balancer logs: Look for backend server health check failures or connection draining events.
- Check DNS: Verify that DNS resolution is consistent across multiple resolvers (Google 8.8.8.8, Cloudflare 1.1.1.1, your ISP).
- Check CDN logs: Look for edge-specific errors, cache miss spikes, or origin connection failures.
- Check deployment history: Was there a deployment around the time of the reported issue? Even deployments hours earlier can cause delayed effects (memory leaks, cache expiration).
- Increase monitoring frequency: Temporarily set the affected endpoint to 1-minute checks from all available locations. If the issue is intermittent, you need maximum sampling to catch it.
- Set up content checks: If the issue is a soft failure (200 status code with error content), add an API monitor that validates the response body.
The Cost of Undetected Intermittent Downtime
Intermittent issues might seem minor because each individual incident is brief. But their cumulative impact is significant. If your site has three 30-second outages per day that affect 5% of users each time, that is 1.5 minutes of user-facing downtime daily, 45 minutes per month, and 9 hours per year. At scale, this translates to thousands of failed transactions, degraded search rankings, and slow erosion of customer trust. The insidious part is that nobody notices until the damage is already done. Read more about how even minutes of outage affect revenue and why users report issues before monitoring alerts fire.
Summary: Key Takeaways
- Intermittent downtime is real and costly, even if your monitoring dashboard shows 100% uptime.
- Standard monitoring misses brief outages because of sampling frequency limitations -- switch to 1-minute checks.
- The seven root causes (load balancer gaps, DNS issues, connection pool exhaustion, CDN edge failures, race conditions, memory leaks, SSL chain problems) each have specific detection strategies.
- Multi-location monitoring is essential for catching issues that only affect specific regions or network paths.
- Monitor response time, not just availability -- performance degradation is the early warning signal.
- Deployment processes are a common source of intermittent issues -- monitor closely during and after deployments.
- When users report issues that monitoring misses, investigate systematically using the playbook above.
See setup tutorials or get started with UptyBots monitoring today.