10 Most Common Monitoring Errors and How to Fix Them

When you set up uptime monitoring for the first time, it is common to encounter errors that seem confusing or alarming. Some of these errors indicate real problems with your website or server, while others are caused by misconfiguration, network conditions, or overly sensitive alert settings. Understanding the difference is critical to maintaining reliable monitoring without drowning in false alerts.

In this comprehensive guide, we break down the 10 most common monitoring errors, explain exactly what causes each one, and provide step-by-step instructions for fixing them. Whether you are a developer, system administrator, or business owner managing your first website, this guide will help you troubleshoot monitoring issues with confidence.

1. Unknown Host (DNS Resolution Failure)

What It Means

The monitoring system tried to resolve your domain name (like example.com) into an IP address, but DNS resolution failed. The domain name could not be translated into a server address, so no connection was even attempted.

Common Causes

  • Typo in the domain name: A simple misspelling like exmple.com instead of example.com will cause this error every time.
  • DNS records not configured: If you recently registered a domain or changed DNS providers, the A or CNAME records may not be set up yet.
  • DNS propagation delay: After changing DNS records, it can take 1-48 hours for the change to propagate globally. During this window, some monitoring locations may see the old records (or no records at all).
  • DNS provider outage: If your DNS provider (like Cloudflare, Route53, or your registrar DNS) is experiencing an outage, domains hosted there will fail to resolve.
  • Expired domain: If your domain registration has expired and the registrar has suspended it, DNS resolution will fail. This is more common than most people think -- many businesses have lost their domains due to missed renewal notices.

How to Fix It

  1. Double-check the domain name in your monitoring configuration for typos.
  2. Verify DNS records using tools like dig example.com or nslookup example.com from a terminal.
  3. Check your domain registrar to confirm the domain is active and not expired.
  4. If you recently changed DNS settings, wait for propagation. Use dig example.com @8.8.8.8 to check Google DNS specifically.
  5. Consider setting up multi-location monitoring to detect regional DNS issues.
  6. Add domain expiration monitoring to catch renewal issues before they cause outages.

2. Connection Timeout

What It Means

The monitoring system resolved the domain successfully and attempted to connect to the server, but the server did not respond within the allowed time window. The connection attempt was started but never completed.

Common Causes

  • Server overload: The server is too busy processing existing requests to accept new connections. This happens during traffic spikes, DDoS attacks, or when a runaway process consumes all resources.
  • Firewall blocking: A firewall rule is silently dropping incoming packets from the monitoring IP address. Unlike "Connection Refused" (which actively rejects), a firewall drop causes the connection to hang until timeout.
  • Network routing issues: Packets from the monitoring location cannot reach the server due to routing problems at an intermediate network hop.
  • Server is down but not responding: The physical or virtual server is unresponsive (kernel panic, hardware failure, hypervisor issue) but has not fully shut down.
  • Wrong IP address: DNS resolves to an incorrect IP address where no server is listening, or an old IP address from a previous hosting provider.

How to Fix It

  1. Check if the server is reachable from your own network: telnet example.com 80 or curl -v --connect-timeout 10 https://example.com.
  2. Verify that your hosting provider or cloud platform is not experiencing an outage.
  3. Check firewall rules to ensure monitoring IPs are not being blocked. UptyBots publishes the IP addresses used by its monitoring bots so you can whitelist them.
  4. Review server resource usage (CPU, RAM, open connections) to check for overload.
  5. If timeouts occur only from certain locations, the problem is likely network-related rather than server-related.

3. Connection Refused

What It Means

The monitoring system reached the server, but the server actively refused the TCP connection. This is different from a timeout -- the server is alive and reachable, but it is explicitly rejecting connections on the requested port.

Common Causes

  • Service not running: The web server (Apache, Nginx), application server (PHP-FPM, Gunicorn), or other service is stopped or has crashed.
  • Wrong port: The monitoring check is configured for a port where no service is listening. For example, checking port 80 when the server only listens on port 443.
  • Service restart: During a brief period when a service is being restarted, connections will be refused. This is normal and should resolve within seconds.
  • Host-based firewall rejection: Unlike a silent drop (which causes timeout), some firewall configurations actively send a TCP RST (reset) packet to refuse the connection.

How to Fix It

  1. SSH into the server and check if the service is running: systemctl status nginx or systemctl status apache2.
  2. Verify which ports are listening: ss -tlnp or netstat -tlnp.
  3. If the service has crashed, check its logs for the crash reason, fix the issue, and restart.
  4. Confirm that the monitoring configuration targets the correct port.
  5. Set your monitoring to retry before alerting to avoid false positives during brief service restarts.

4. Invalid Port / Port Closed

What It Means

The monitoring system attempted to connect to a specific TCP port, but that port is not open on the target server. This is particularly relevant for TCP/Port monitoring, where you are checking specific services like databases, mail servers, or custom application ports.

Common Causes

  • Service not installed or not started: The service that should listen on that port (e.g., MySQL on 3306, SMTP on 25) is not running.
  • Firewall rules blocking the port: The server firewall (iptables, ufw, firewalld) is configured to block incoming connections on that port.
  • Service bound to localhost only: The service is running but is configured to listen only on 127.0.0.1 (localhost), not on the external IP address. This is a common security configuration for databases.
  • Port changed: The service was reconfigured to use a different port (e.g., SSH moved from 22 to 2222 for security).
  • Cloud security group: On AWS, GCP, or Azure, the security group or network access rules may not allow traffic on that port.

How to Fix It

  1. Verify the service is running and listening on the expected port: ss -tlnp | grep :3306.
  2. Check firewall rules: sudo ufw status (Ubuntu) or sudo iptables -L -n.
  3. If using a cloud provider, review the security group or firewall rules in the cloud console.
  4. Confirm the service is listening on all interfaces (0.0.0.0) or on the specific external IP, not just localhost.
  5. Update the monitoring configuration if the port has intentionally changed.

5. SSL Certificate Errors

What It Means

The monitoring check detected a problem with the SSL/TLS certificate on your website. This can range from an expired certificate to a domain name mismatch or an untrusted certificate authority.

Common Causes

  • Expired certificate: The most common SSL error. Let's Encrypt certificates expire after 90 days, and commercial certificates after 1-2 years. If auto-renewal fails silently, the certificate expires and all HTTPS visitors see a security warning.
  • Domain name mismatch: The certificate was issued for www.example.com but the site is accessed at example.com (or vice versa), or the certificate was issued for a completely different domain.
  • Incomplete certificate chain: The server sends the leaf certificate but not the intermediate certificates needed to verify it. Some browsers handle this gracefully by fetching missing intermediates, but monitoring tools and APIs will reject the connection.
  • Self-signed certificate: The certificate was not issued by a trusted certificate authority. This is common in development environments but should never appear in production.
  • Mixed content: The main page loads over HTTPS but includes resources (images, scripts, stylesheets) loaded over HTTP, causing browser warnings.

How to Fix It

  1. Check certificate details: openssl s_client -connect example.com:443 -servername example.com.
  2. If expired, renew immediately. For Let's Encrypt: certbot renew.
  3. Verify the certificate covers all domain variants (with and without www, plus any subdomains).
  4. Ensure the complete certificate chain is configured on the server. Use SSL testing tools to verify.
  5. Set up SSL monitoring in UptyBots to receive alerts before your certificate expires, giving you time to renew without causing an outage.

6. HTTP Response Error (4xx and 5xx Status Codes)

What It Means

The server responded to the monitoring request, but with an error status code instead of the expected 200 OK. Common responses include 403 Forbidden, 404 Not Found, 500 Internal Server Error, 502 Bad Gateway, and 504 Gateway Timeout.

Common Causes

  • Application errors: Bugs in your code, database failures, or misconfiguration causing 5xx errors. See our detailed guide on understanding 500, 502, and 504 error codes.
  • URL changed or removed: The monitored URL has been moved or deleted, resulting in 404 errors.
  • Access restrictions: IP-based access control, geo-blocking, or authentication requirements causing 403 errors when the monitoring bot tries to access the page.
  • Rate limiting: The server is throttling requests and returning 429 Too Many Requests.
  • Redirects miscounted: The page redirects multiple times and the monitoring tool hits a redirect limit, reporting an error instead of following the final destination.

How to Fix It

  1. Use the HTTP Status Explainer to understand what the specific status code means.
  2. For 5xx errors, check server and application logs for the root cause.
  3. For 403 errors, whitelist the monitoring bot IP addresses in your firewall or access control.
  4. For 404 errors, update the monitored URL to match the current page location.
  5. For redirect issues, update the monitored URL to the final destination URL.

7. DNS Resolution Delay (Slow DNS)

What It Means

DNS resolution succeeds, but it takes an unusually long time. While the page eventually loads, the DNS lookup phase adds significant latency to the overall response time, potentially triggering slow-response alerts.

Common Causes

  • Slow DNS provider: Some budget DNS providers have high latency or limited geographic coverage, meaning requests from distant locations take longer to resolve.
  • Low TTL values: Very low DNS TTL (Time To Live) values mean DNS records expire from caches quickly, forcing frequent re-resolution.
  • DNS chain with many lookups: Multiple CNAME records pointing to each other creates a chain of DNS lookups that each add latency.
  • DNS provider under DDoS attack: DNS providers are frequent targets of DDoS attacks, which can slow down resolution for all customers.
  • No DNS redundancy: If your domain only uses one DNS provider and that provider has issues, there is no fallback.

How to Fix It

  1. Measure DNS resolution time: dig example.com | grep "Query time".
  2. Use a reputable, high-performance DNS provider with global presence (Cloudflare DNS, AWS Route53, Google Cloud DNS).
  3. Set reasonable TTL values (300-3600 seconds) to allow caching without stale records.
  4. Minimize CNAME chains -- use A records where possible.
  5. Consider using multiple DNS providers for redundancy on critical domains.

8. Geo-Specific Failures (Regional Outages)

What It Means

Your website works perfectly from some locations but fails from others. The monitoring system detects downtime from certain geographic check locations while other locations report the site as healthy.

Common Causes

  • CDN node failure: If you use a CDN (Cloudflare, AWS CloudFront, Fastly), a specific edge node may be experiencing issues while others work fine.
  • Regional network outage: An ISP or backbone provider in a specific region may be having connectivity problems.
  • Geo-based routing: DNS-based load balancing sends users in different regions to different servers. If one regional server is down, only users routed to that server experience downtime.
  • IP-based blocking: A firewall or security tool has blocked IP ranges associated with certain countries or regions, including monitoring bot IPs.
  • Anycast routing issues: Services using anycast IP addressing may route traffic differently depending on origin location.

How to Fix It

  1. Check CDN dashboards for node-specific issues or incidents.
  2. Test from multiple locations using your monitoring tool. UptyBots checks from multiple locations to help you identify regional problems.
  3. Review geo-blocking rules and ensure monitoring IPs are not accidentally blocked.
  4. If using DNS-based load balancing, verify that all regional servers are healthy.
  5. Read our detailed guide on why your website appears down only in certain countries for a deeper analysis.

9. API Response Errors (Unexpected Data)

What It Means

The API endpoint returns an HTTP 200 OK status code, but the response body contains error data, unexpected values, or incomplete results. Simple HTTP monitoring would mark this as "up," but the API is effectively broken.

Common Causes

  • Application logic error: The API code has a bug that returns error messages inside a 200 response. For example, {"status": "error", "message": "database unavailable"} wrapped in an HTTP 200.
  • Partial failure: The API returns some data correctly but critical fields are missing, null, or contain default/fallback values.
  • Rate limit response: Some APIs return a 200 response with a rate limit message in the body instead of using the standard 429 status code.
  • Cached stale data: The API returns outdated cached data because the cache has not been properly invalidated after a backend update.
  • Upstream dependency failure: The API gracefully handles an internal failure by returning an empty or partial response instead of erroring out.

How to Fix It

  1. Set up API monitoring that validates response content, not just HTTP status codes. Learn more about proper API monitoring.
  2. Configure response body assertions to check for expected values or keywords.
  3. Monitor specific fields in JSON responses for correct data types and values.
  4. For complex workflows, use synthetic API monitoring to test multi-step operations.
  5. Review API error handling code to ensure meaningful HTTP status codes are returned instead of wrapping errors in 200 responses.

10. False Positives and Noisy Alerts

What It Means

Your monitoring system reports downtime, but when you check your website, everything appears fine. The alert was a false alarm. While a single false positive is merely annoying, frequent false alerts lead to alert fatigue -- a dangerous state where you start ignoring all alerts, including real ones.

Common Causes

  • No retry logic: A single failed check triggers an alert immediately, without retrying to confirm the failure. Momentary network glitches, packet loss, or brief server hiccups can all cause a single check to fail.
  • Overly sensitive thresholds: Response time thresholds set too low will trigger alerts during normal traffic fluctuations.
  • Single monitoring location: If you only monitor from one location and that location has a network issue, you will receive a false downtime alert.
  • Monitoring IP blocked: Your firewall, CDN, or security service (like Cloudflare) is blocking or rate-limiting the monitoring bot.
  • Transient network issues: Brief routing changes, BGP convergence, or ISP maintenance can cause momentary connectivity blips that do not affect real users.

How to Fix It

  1. Enable retry logic in your monitoring configuration. UptyBots retries failed checks before sending alerts, significantly reducing false positives.
  2. Use multi-location monitoring so that an alert is only triggered when multiple locations confirm the failure.
  3. Set reasonable response time thresholds based on your site's normal performance, not arbitrary values.
  4. Whitelist monitoring bot IP addresses in your firewall and CDN settings.
  5. Read our guide on how to tell the difference between false positives and real downtime.

Quick Reference: Monitoring Error Comparison Table

Error Severity Most Likely Cause First Step
Unknown Host Critical DNS misconfiguration or expired domain Check DNS records
Connection Timeout Critical Server overload or firewall drop Check server status and firewall
Connection Refused Critical Service not running Check service status on server
Invalid Port High Service down or firewall blocking Verify service and port
SSL Certificate Error High Expired or misconfigured certificate Check certificate expiration
HTTP Error (4xx/5xx) High Application error or access restriction Check server error logs
DNS Delay Medium Slow DNS provider Measure DNS resolution time
Geo-Specific Failure Medium CDN node or regional network issue Test from multiple locations
API Response Error Medium-High Application logic bug Validate response content
False Positive Low Transient network issue Enable retries and multi-location

Building a Reliable Monitoring Setup

The errors above are not just problems to fix -- they are signals that your monitoring strategy needs to be layered and thoughtful. Here is a checklist for a robust setup:

  • Monitor the right endpoints: Check your homepage, critical pages, API endpoints, and background services -- not just one URL.
  • Use multiple check types: Combine HTTP, TCP/Port, Ping, SSL, and domain monitoring. Each type catches different failure modes. Learn more about the difference between HTTP and TCP monitoring.
  • Enable retries: Configure at least one retry before sending an alert to filter out transient issues.
  • Use multi-location checks: Confirm downtime from multiple vantage points before alerting.
  • Set meaningful alert thresholds: Base thresholds on your actual performance baseline, not arbitrary numbers.
  • Configure the right notification channels: Use email for non-urgent issues, Telegram or webhook for critical alerts that need immediate attention.
  • Review alerts regularly: If you are getting frequent false alerts, adjust your configuration. Alert fatigue is a real risk that undermines the value of monitoring.

Estimate the Cost of These Errors

Every monitoring error -- whether a real outage or a false alarm that delays your response to real issues -- has a financial impact. Use the Downtime Cost Calculator to understand what downtime is costing your business, and the HTTP Status Explainer to quickly decode any error code you encounter.

See setup tutorials or get started with UptyBots monitoring today.

Ready to get started?

Start Free