By David Kim · Nov 14, 2025

Why Your Monitoring Misses 30-Second Outages: Sampling Theory Applied to Uptime Checks

What happens to a 30-second outage when your monitoring checks every 5 minutes? Most of the time, nothing. The outage starts, affects real users making real requests, and ends. Thirty seconds later, your monitoring probe fires its scheduled check, gets a healthy response, and records "UP." Your dashboard shows 100% uptime. Your support inbox shows three angry tickets.

This is not a bug in your monitoring tool. It is a fundamental property of sampled systems, and it is the same reason a photograph can miss a bird flying through the frame. If you only look once every 300 seconds, events shorter than 300 seconds can exist entirely between your observations. The question is not whether your monitoring is broken. The question is whether your sampling rate is high enough to detect the failures that actually matter to your users.

This post explains the math behind detection probability, the network-layer mechanisms that create short-lived outages, and how to configure monitoring that actually catches them.

The Sampling Problem: Detection Probability as a Function of Check Interval

Let us start with the math, because it is surprisingly simple and it quantifies the problem precisely.

Assume an outage of duration D seconds that starts at a random time. Your monitoring check runs every I seconds. The check is a point sample: it takes one measurement at one instant. What is the probability that at least one check falls within the outage window?

If D >= I, the probability is 100%. The outage is longer than the check interval, so at least one check is guaranteed to fire during the outage.

If D < I, the probability is D / I. A 30-second outage with a 300-second (5-minute) check interval has a 30/300 = 10% chance of being detected. A 10-second outage with the same interval has a 10/300 = 3.3% chance.

This is the Nyquist-Shannon sampling theorem applied to monitoring. In signal processing, you need to sample at least twice the frequency of the signal you want to capture. In monitoring terms: to reliably detect an outage of duration D, you need a check interval of at most D/2. To guarantee detection of a 60-second outage, you need checks every 30 seconds. To guarantee detection of a 30-second outage, you need checks every 15 seconds.

Here are the detection probabilities across different check intervals and outage durations:

Outage Duration 5-Min Check Interval 3-Min Check Interval 1-Min Check Interval
10 seconds 3% detection rate 6% detection rate 17% detection rate
30 seconds 10% detection rate 17% detection rate 50% detection rate
1 minute 20% detection rate 33% detection rate 100% detection rate
2 minutes 40% detection rate 67% detection rate 100% detection rate
5 minutes 100% detection rate 100% detection rate 100% detection rate

Multi-location monitoring improves these odds. If you check from 3 independent locations, each with a 1-minute interval, you effectively have 3 checks per minute. A 30-second outage now has roughly a 1 - (1 - 0.5)^3 = 87.5% chance of being caught by at least one location. This is why UptyBots defaults to 1-minute intervals and supports multi-location checks: the combination pushes detection probability toward near-certainty for outages lasting more than a few seconds.

The Network Mechanisms Behind Ghost Downtimes

Intermittent outages are not random glitches. They have specific, identifiable causes rooted in how TCP/IP, DNS, CDNs, and load balancers work. Each mechanism creates a characteristic failure pattern.

TCP Connection Timeouts and Half-Open Connections

When a server becomes unresponsive (but does not crash), TCP connections enter a problematic state. The client sends data, the server does not acknowledge it, and TCP's retransmission mechanism kicks in. The default TCP retransmission behavior (on Linux, controlled by tcp_retries2, default value 15) retries with exponential backoff: 200ms, 400ms, 800ms, 1.6s, 3.2s, and so on. The total time before TCP gives up on a connection is typically 13-30 minutes, depending on the OS.

During this period, the connection is technically "alive" from the client's perspective (no error has been returned), but no data is flowing. New connections may fail immediately (if the server's listen queue is full, the kernel responds with RST) or hang in the SYN queue (if the server is not processing the backlog). The result: some existing users see frozen pages while new visitors get connection timeouts. Your monitoring probe, which opens a fresh connection, might get a RST immediately (which it detects as an error) or might happen to connect during a brief window when the server's listen queue has space (which it reports as healthy).

The HTTP-level timeout (how long your monitoring probe waits for a response) is typically much shorter than TCP's built-in timeout. UptyBots uses configurable timeouts that are tuned to detect unresponsive servers within seconds, not minutes.

DNS TTL-Driven Partial Outages

DNS records have a TTL (Time To Live) value, typically between 60 and 3600 seconds. When you change a DNS record (new IP address, failover to a backup server, CDN migration), the change propagates at the speed of TTL expiration, not instantly.

Here is the timeline of a DNS-driven partial outage:

  1. You change your A record from IP 1.2.3.4 (old server) to 5.6.7.8 (new server). TTL is 300 seconds (5 minutes).
  2. At T+0: DNS resolvers that just cached the record will serve 1.2.3.4 for the next 5 minutes. Users whose resolvers cached recently still go to the old server.
  3. At T+60: Some resolvers' caches have expired. They re-query and get 5.6.7.8. Their users now go to the new server.
  4. At T+300: Most resolvers have updated. But some resolvers honor "minimum TTL" policies (some ISP resolvers enforce a minimum TTL of 300s or even 3600s regardless of what you set).
  5. At T+3600+: Stubborn resolvers finally update. But Google DNS (8.8.8.8) and Cloudflare DNS (1.1.1.1) typically respect TTL accurately.

If the old server at 1.2.3.4 is being decommissioned or is unhealthy, users whose resolvers still point to 1.2.3.4 experience downtime. Meanwhile, users on resolvers that have updated see the new server working perfectly. Your monitoring probe uses a specific resolver; if that resolver has already updated, monitoring shows 100% uptime while a chunk of your user base cannot reach your site.

Multi-location monitoring mitigates this because different monitoring locations use different DNS resolvers. If the probe in Frankfurt resolves to the old IP while the probe in Virginia resolves to the new IP, the discrepancy is visible in the dashboard. For a deep dive, see our guide on why your website appears down only in certain countries.

Load Balancer Health Check Lag

Load balancers check backend server health on their own schedule, typically every 10-30 seconds. When a backend server starts failing, there is a detection window: the time between when the server begins returning errors and when the load balancer marks it as unhealthy and stops routing traffic to it.

Most load balancers require multiple consecutive failed health checks before removing a server. If the health check interval is 10 seconds and the failure threshold is 3, it takes 30 seconds minimum for a failing server to be removed from the pool. During those 30 seconds, roughly 1/N of all requests (where N is the number of backend servers) hit the failing server and return errors. The other (N-1)/N requests succeed.

With 4 backend servers, 25% of requests fail for 30 seconds. That is a 30-second partial outage affecting one quarter of your traffic. Your single-location monitoring probe sends one request during that window and has a 25% chance of hitting the failing server. If it hits a healthy server, the check passes and you never know anything was wrong.

The fix: monitor individual backend servers, not just the load balancer's public endpoint. If each backend has its own health check in UptyBots, you detect the failing server immediately, even if the load balancer has not caught it yet.

CDN Edge Node Failures

CDNs route users to the geographically nearest edge node via anycast DNS or latency-based routing. Each edge node is an independent server (or cluster). If one edge node fails or serves stale/broken content, only users routed to that specific node are affected. Users routed to other nodes see a perfectly working site.

I once debugged an issue where a CDN edge node in Singapore was serving a cached 502 error page. Users in Southeast Asia saw an error page. Users in Europe and North America saw the site working normally. Our monitoring, running from a single location in Virginia, reported 100% uptime for the entire duration. It was not until a customer in Malaysia sent a screenshot that we even knew there was a problem.

Multi-location monitoring is the only way to catch this class of failure. If UptyBots checks from Singapore and gets an error while the check from Virginia succeeds, you know immediately that a specific CDN edge is the problem.

TCP Connection Pool Exhaustion

Application servers maintain connection pools for databases, Redis, external APIs, and internal microservices. Each pool has a maximum size (e.g., 20 PostgreSQL connections, 50 Redis connections). Under normal load, connections are reused: a request grabs a connection from the pool, uses it, and returns it. At peak load, all connections are in use. New requests wait in a queue. If the queue fills up, requests start failing with "connection pool exhausted" or "too many connections" errors.

The timing pattern is characteristic: load builds gradually, connection pool utilization climbs from 50% to 80% to 95%, then crosses 100% and requests start failing. The failure period lasts until load decreases (users give up, traffic subsides naturally) or until the pool recovers (slow queries finish and return connections). This entire cycle might last 20-60 seconds.

From a monitoring perspective, response time is the leading indicator. UptyBots tracks response time for every check. A normal response time of 200ms that suddenly spikes to 5000ms is a clear signal that the server is under pressure, even if the status code is still 200. By the time the status code changes to 503 or 504, the problem has been building for a while.

SSL/TLS Certificate Chain Issues

TLS certificate chain issues create one of the most confusing intermittent failures. Your server sends its leaf certificate, but fails to include intermediate certificates in the chain. Some clients work fine because they have cached the intermediate certificate from a previous connection (or the intermediate is in their local trust store). Other clients, making their first connection, cannot build a chain of trust and fail with a TLS handshake error.

The result: existing users with cached intermediates browse the site normally. New visitors or users who cleared their browser data get SSL errors. Mobile browsers are particularly susceptible because they often have smaller certificate caches than desktop browsers.

This type of failure is invisible to basic HTTP monitoring because the monitoring probe might also have cached the intermediate. Dedicated SSL monitoring that validates the full certificate chain, including intermediate certificate delivery, catches this before users report it. UptyBots's SSL checks validate the entire chain, not just expiration dates.

Memory Leaks and Garbage Collection Storms

Memory leaks create a sawtooth pattern: memory usage climbs gradually over hours or days, hits a critical threshold, and then either the application crashes (triggering a restart by a process manager) or the garbage collector runs a major collection that freezes the process for 0.5-5 seconds. During a GC pause, the server stops processing requests entirely. New TCP connections queue in the kernel's listen backlog. If the pause is long enough, the proxy times out and returns 502 or 504.

The GC pause pattern is particularly hard to catch with infrequent monitoring because the pauses are short (1-5 seconds) and happen at unpredictable intervals. But they happen regularly enough that users notice: "the site freezes for a second every few minutes." With 1-minute monitoring, you might catch one of these pauses every hour. With response time tracking, you can observe the gradual increase in response time that precedes the crash/restart cycle.

Configuring UptyBots to Catch Intermittent Issues

Step 1: Set 1-Minute Check Intervals for Critical Endpoints

Your homepage, login page, checkout page, and primary API endpoints should be checked every 60 seconds. The math is simple: a 1-minute interval guarantees detection of any outage lasting 60 seconds or longer. Combined with multi-location checks, you catch most outages lasting 20 seconds or more.

Do not check everything at 1-minute intervals. Static asset pages, blog posts, and low-traffic endpoints can use 3-minute or 5-minute intervals without significant risk. Prioritize the paths where failures directly cost you money or users.

Step 2: Enable Multi-Location Checks

Each monitoring location runs its own independent check on its own DNS resolver through its own network path. Three locations checking every minute means three independent samples per minute. This increases detection probability for short outages and catches geographically localized failures (CDN edge problems, regional network outages, DNS propagation inconsistencies) that single-location monitoring cannot see.

Step 3: Set Response Time Thresholds

Response time degradation is the canary in the coal mine. A server that normally responds in 200ms but is now taking 3000ms is about to start failing. Set a response time threshold at approximately 3x your normal average. If your baseline is 250ms, alert at 750ms. This gives you a warning period before full failure, often 30-60 seconds of advance notice that lets you investigate while the site is still technically up.

Step 4: Use API Monitoring for Transaction-Critical Paths

HTTP status code monitoring only tells you "the server responded with 200" or "the server responded with 500." It does not tell you whether the response content was correct. An API endpoint that returns {"status": "error", "message": "database unavailable"} with HTTP 200 is a failure that HTTP monitoring misses.

UptyBots's API monitoring lets you validate response body content. For checkout endpoints, verify that the response contains expected fields. For authentication endpoints, verify that the response format is correct. This catches "soft failures" where the server is technically responsive but functionally broken.

Step 5: Monitor SSL Certificates and Domains Separately

SSL certificate expiration and domain name expiration are not intermittent in the traditional sense. They create absolute, total outages on a known date. But they are listed here because they are the most preventable form of downtime, and because the warning period is long (months to weeks) if you are monitoring them. Set alerts for 30 days, 14 days, and 7 days before expiration. There is no reason for an SSL certificate or domain to expire unexpectedly in 2026.

Step 6: Configure Confirmation Checks

When a check fails, UptyBots can immediately run a confirmation check from a different location before alerting you. This filters out transient network blips (a lost packet between the monitoring node and your server, a momentary DNS resolver glitch) that would otherwise generate false positives. The confirmation check adds a few seconds of delay before alerting but dramatically reduces noise, preventing alert fatigue.

Deployment-Related Intermittent Downtime

Deployments are one of the most common sources of brief, partial outages. Unless you have true zero-downtime deployment, there is a window where your application is in a transitional state. The specific failure mode depends on your deployment strategy:

  • Process restart. The simplest deployment: stop the old process, start the new one. The gap between stop and start is a complete outage. If the restart takes 5 seconds, you have a 5-second outage. The proxy returns 502 during this window because the upstream is not listening on its socket.
  • Rolling restart. Servers restart one by one. While each server restarts, the remaining servers handle its share of traffic. If you have 4 servers and one is restarting, the other 3 handle 133% of normal load. If they are already at 80% capacity, they are now at 106%, which may push them into connection pool exhaustion territory.
  • Database migration locks. ALTER TABLE on a busy table acquires a lock that blocks all reads and writes. In PostgreSQL, even adding a column with a default value required a table rewrite (and therefore an exclusive lock) until version 11. If the migration takes 30 seconds on a large table, all queries against that table queue up for 30 seconds. The proxy times out waiting, and you see 504 errors.
  • Cache stampede. A deployment clears the cache. Every request that was previously served from cache now hits the database or application backend. If your cache hit rate was 95%, your backend load increases by 20x (5% of requests become 100%). The backend cannot handle this sudden spike, and requests fail until the cache warms back up.

UptyBots helps you correlate deployments with monitoring data. After each deployment, check the monitoring dashboard for response time spikes or failed checks during the deployment window. Over time, this data helps you quantify how much downtime your deployment process causes and whether improvements (blue-green deployments, gradual rollouts, online schema migrations) are reducing it. See our guide on monitoring during deployments for more.

Building an Investigation Playbook for Phantom Outages

When users report problems that your monitoring does not show, do not dismiss the reports. Follow this sequence:

  1. Collect specifics from the reporter. Exact time (to the minute), URL, error message or screenshot, their geographic location, and their ISP if possible. "It was broken earlier today" is not actionable. "At 14:23 UTC, https://app.example.com/dashboard returned a white page from my office in Berlin (Telekom ISP)" is actionable.
  2. Check monitoring history for the reported time window. Even if no alert fired, look at the raw check data. Was there a response time spike? Did any single-location check fail even once? UptyBots's historical data shows individual check results, not just aggregated uptime percentages.
  3. Check server-side logs for the reported time. Application error logs, reverse proxy access logs (look for 5xx responses), database slow query logs, and kernel logs (dmesg for OOM kills). Filter by the specific URL the user reported.
  4. Check DNS resolution from multiple resolvers. Run dig @8.8.8.8 yourdomain.com, dig @1.1.1.1 yourdomain.com, dig @9.9.9.9 yourdomain.com. If any returns a different IP, you have a DNS consistency problem.
  5. Check load balancer and CDN logs. Look for backend health check failures, 5xx error rates from specific edge nodes, or connection draining events during the reported time.
  6. Increase monitoring frequency temporarily. Set the affected endpoint to 1-minute checks from all available locations. If the issue is recurring, higher-frequency monitoring will catch the next occurrence.
  7. Add response body validation. If the reported error was a soft failure (page loaded but showed an error message), add an API monitor that checks for specific content in the response. This catches failures that HTTP status code monitoring misses entirely.

The Cumulative Cost of Undetected Intermittent Downtime

Each individual intermittent outage seems insignificant. A 30-second blip. A few failed requests. But compound these over time and the impact is real.

Assume your site has three 30-second partial outages per day, each affecting 10% of requests. That is 90 seconds of partial downtime daily. Over a month: 45 minutes of user-facing failures. Over a year: 9 hours. If your site serves 1000 requests per minute and 10% fail for 90 seconds per day, that is 150 failed requests per day, 4500 per month, 54,000 per year. At a 2% conversion rate and $50 average order, those failed requests could represent significant lost revenue.

Search engines notice too. If Googlebot hits your site during a 30-second outage and gets a 503 or 500, that URL's crawl priority drops. If it happens repeatedly (remember, Googlebot makes thousands of requests per day to popular sites), the cumulative effect on crawl budget and indexing is measurable. Read more about how minutes of outage affect revenue and why users report issues before monitoring alerts fire.

The insidious aspect is that nobody connects the dots. Support sees isolated tickets. Engineering sees green dashboards. Revenue dips are attributed to seasonality or marketing changes. Without monitoring that can actually detect these short outages, the root cause stays hidden.

Summary: Key Takeaways

  • Detection probability for an outage of duration D with check interval I is D/I (when D < I). A 30-second outage has only a 10% chance of being caught by a 5-minute check.
  • Multi-location monitoring multiplies your effective sampling rate. Three locations at 1-minute intervals give you near-certain detection for outages lasting 20+ seconds.
  • Ghost downtimes have specific causes: TCP connection timeouts, DNS TTL propagation, load balancer health check lag, CDN edge failures, connection pool exhaustion, TLS chain issues, and GC pauses. Each has a distinct signature and a distinct detection strategy.
  • Response time monitoring is the leading indicator. Performance degradation precedes full failure by 30-60 seconds in most cases.
  • API monitoring catches soft failures (200 status code with error content) that HTTP monitoring misses entirely.
  • Deployments are a primary source of brief outages. Monitor through every deployment and correlate the data.
  • When users report issues that monitoring misses, investigate systematically: collect specifics, check all log sources, verify DNS consistency, and increase monitoring frequency.

See setup tutorials or get started with UptyBots monitoring today.

Ready to get started?

Start Free