By James Wilson · Oct 1, 2025

Uptime Monitoring FAQ: Every Question Your Junior Engineers Keep Asking

Every Monday morning, someone on my team asks me the same kind of question. "What does 502 mean?" "How often should we check the API?" "Why did I get an alert at 3 AM for something that was fine when I looked?" I have been answering these questions for fifteen years. At some point I started writing the answers down.

This is that document. It is the FAQ I hand to every new engineer on their first day. No theory. No marketing fluff. Just the answers to the questions that actually come up when you are responsible for keeping something running.

The Basics

What is uptime monitoring?

A service pings your stuff on a schedule. Your website, your API, your server, your database port. If the thing does not respond, or responds wrong, or responds too slowly, you get a notification. That is it. The entire point is to find out about problems before your customers do. Or at least at the same time they do, instead of finding out three hours later when your boss forwards you a tweet.

Why do I need it? My site seems fine.

Your site seems fine because you are looking at it from your office, on your ISP, during business hours. You have no idea what it looks like from Tokyo at 2 AM. You do not know that your SSL certificate auto-renewal failed last Tuesday. You do not know that your API has been returning 500 errors to mobile clients since the last deploy. Without monitoring, your users are your monitoring. That is a bad look. It also kills your SEO. Google notices when your site is down, and your rankings suffer for it. We covered the financial side of this in our piece on the real cost of website downtime.

What is the difference between uptime and availability?

People use them interchangeably. Technically, uptime is the raw time a service is running. Availability is the percentage. If your server was up for 719 out of 720 hours in a month, your availability is 99.86%. The distinction barely matters in conversation. What matters is the number.

What does "99.9% uptime" actually mean?

It means you get 8 hours and 45 minutes of downtime per year. That sounds like a lot, right? It is. Here is the full table, because this comes up in every SLA negotiation:

Uptime % Downtime Per Year Downtime Per Month Downtime Per Week
99% 3 days, 15 hours 7 hours, 18 minutes 1 hour, 41 minutes
99.5% 1 day, 19 hours 3 hours, 39 minutes 50 minutes
99.9% 8 hours, 45 minutes 43 minutes 10 minutes
99.95% 4 hours, 22 minutes 21 minutes 5 minutes
99.99% 52 minutes 4 minutes 1 minute
99.999% 5 minutes 26 seconds 6 seconds

Anyone who promises you five nines (99.999%) is either running a very expensive operation or lying. Use our Downtime Cost Calculator to see what each minute of downtime costs your business in real money.

Types of Monitoring

What types of monitoring exist?

There are six types you should know about. Each one watches a different layer of your stack:

  • HTTP/HTTPS monitoring: Hits your URL, checks the status code, measures response time. Can also check if the page actually contains what it should. This is the one everyone starts with.
  • Ping (ICMP) monitoring: The most basic network check. Sends a ping, measures how long it takes to come back. If ping fails, the machine is either down or the network path is broken.
  • Port monitoring: Connects to a specific TCP port. Your database runs on 5432. Your Redis on 6379. Your mail server on 587. If the port is closed, the service is dead.
  • SSL certificate monitoring: Watches your certificate expiration date. When Let's Encrypt auto-renewal fails silently (and it will, eventually), this is how you find out before browsers start blocking your site.
  • Domain expiry monitoring: Tracks when your domain registration expires. Losing a domain because someone forgot to renew it is the kind of mistake that ends careers.
  • API monitoring: Sends actual requests to your API and checks the response body, not just the status code. Because a 200 OK with {"error": "database connection failed"} in the body is not actually OK.

UptyBots supports all six. You can mix and match to cover every layer of your infrastructure.

What is synthetic monitoring?

Regular monitoring asks: "Is the server responding?" Synthetic monitoring asks: "Can a user actually do the thing they came here to do?" It runs scripted sequences. Log in. Load the dashboard. Submit a form. Check the response. This catches the problems that simple checks miss. Your login page returns 200, sure, but the form inside it is throwing a JavaScript error and nobody can actually log in. UptyBots's API monitoring supports multi-step checks where each step validates the response body, headers, and status codes. That is synthetic monitoring in practice.

What is the difference between active and passive monitoring?

Active monitoring sends requests on a schedule, whether anyone is using the site or not. That is what UptyBots does. Passive monitoring (also called Real User Monitoring or RUM) collects data from actual user sessions. The problem with passive monitoring alone: if nobody visits your site at 3 AM, and your server crashes at 3 AM, you do not find out until 8 AM when someone tries to visit. Active monitoring catches it at 3:01 AM.

Why does multi-location monitoring matter?

Because the internet is not one thing. Your CDN might have a bad node in Frankfurt. Your DNS might not have propagated to Asia-Pacific resolvers yet. Cloudflare might be having a regional issue. If you only monitor from one location, you only see one slice of reality. Your site could be completely down for 30% of your users and you would never know. We wrote a full explanation of this in why your website appears down only in certain countries.

Configuration and Best Practices

How often should I check my site?

Depends on how much money the service makes and how fast you need to know when it breaks:

  • Every 1 minute: Payment flows, checkout pages, anything where downtime is directly measured in lost revenue
  • Every 2-3 minutes: Your main web app, SaaS product, customer-facing API
  • Every 5 minutes: Blogs, docs, marketing sites, internal tools
  • Every 15-60 minutes: SSL certificates, domain expiry. These do not change fast enough to justify frequent checks

UptyBots lets you set different frequencies per monitor. Check your checkout every minute. Check your blog every five. Nobody needs to know your blog went down for two minutes at midnight.

What should I monitor first?

Start with whatever makes money. Then work outward:

  1. Payment/checkout endpoints: This is where the revenue lives. Monitor it first, check it often
  2. Login/signup page: If users cannot get in, nothing else matters
  3. Homepage: First thing users and search engines see
  4. Your API: If your mobile app or third-party integrations depend on it
  5. SSL certificate: An expired cert blocks all HTTPS traffic. Modern browsers will not let users through
  6. Domain expiry: The nuclear option of outages. Your entire online presence vanishes

How do I avoid false positives?

False positives are the enemy. Get enough of them and your team starts ignoring all alerts, including the real ones. That is called alert fatigue and it is how outages go unnoticed for hours. Here is how to keep false positives low:

  • Enable confirmation checks. Make UptyBots re-check before firing an alert. One failed check is not an outage. Three in a row probably is
  • Monitor from multiple locations. If only one location sees a failure, it is likely a network blip, not a real outage
  • Set sane timeout values. A 2-second timeout on a page that legitimately takes 4 seconds to render will cry wolf every time
  • Validate response content, not just status codes. A 200 with an error page is still a problem, but a single timeout is not always one

We go deep on this in false positives vs. real downtime.

What timeout value should I use?

10 to 30 seconds is the safe range for most websites. Too low and you get false alarms from slow-but-functional servers. Too high and you miss the fact that your site takes 15 seconds to load, which might as well be down. For APIs, 5 seconds is usually enough. For heavy pages with lots of assets, 15 to 30 seconds. If your page legitimately needs more than 30 seconds to respond, you have a different problem.

Should I monitor staging?

Probably not. Staging exists to be broken. If your staging environment needs to be reliable for QA teams, partner demos, or CI/CD pipelines, then maybe throw a basic check on it. But do not put it in the same alert channel as production. That is how people start ignoring alerts.

Alerts and Notifications

Which notification channels should I use?

The one you will actually look at. That is the answer. UptyBots supports three:

  • Email: Good for keeping records. Bad for urgency. Nobody checks email at 3 AM
  • Telegram: Push notification straight to your phone. You will see it within seconds. This is what I use for critical stuff
  • Webhooks: Pipe alerts into whatever system you already use. PagerDuty, Opsgenie, Discord, Slack, a custom dashboard, a script that plays an air horn

Best practice: use two channels for anything that matters. Email for the paper trail, Telegram for the "wake up right now" alert. We covered the setup process in how to set up notification integrations without going crazy.

What is alert fatigue?

It is the thing that gets people fired. You set up monitoring. You get alerts. Lots of alerts. Most of them are noise. After a few weeks, you stop reading them. Then a real outage happens, the alert fires, and you ignore it because you have been trained by weeks of false alarms to assume every alert is garbage. Three hours later, your VP is asking why nobody noticed the site was down.

Prevention is simple in theory, hard in practice:

  • Only alert on things that need a human to do something right now
  • Use confirmation checks so one-off glitches do not page you
  • Review your alerts monthly. If a monitor keeps crying wolf, fix it or mute it
  • Separate informational alerts from action-required alerts. Different channels, different urgency
  • Set up escalation so the next person gets notified if the first one does not respond

We dedicated an entire article to this: alert fatigue and how too many notifications can hurt your monitoring. Also worth reading: why downtime notifications are often ignored.

How do I test my notification setup?

Before you need it for real. Send test messages to every channel. UptyBots has a test button for each notification channel. Click it. Confirm you received the message on the right device, in the right app, with the right sound. The worst time to discover your Telegram bot token is wrong is during an actual incident. Check our guide on configuring notifications per monitor with test messages.

Should I get notified when things come back up?

Yes. Always. Recovery alerts tell you:

  • The fix worked. You can stop panicking
  • How long the outage lasted. You need this number for your postmortem
  • When to update your status page
  • Whether the thing actually recovered or if you are still debugging a problem that fixed itself

UptyBots sends both down and up notifications automatically. Do not turn off recovery alerts. Ever.

Common Errors and What They Mean

What does HTTP 500 mean?

Something broke on the server. Could be a code bug, a database that is not responding, a disk that filled up, a config file with a typo. The server tried to do its job, failed, and threw up its hands. This is the error you will see most often in monitoring. Use our HTTP Status Explainer for the full list.

What does HTTP 502 (Bad Gateway) mean?

Your reverse proxy (Nginx, Apache, whatever sits in front of your app) is running fine. But the thing behind it is dead. Nginx tried to pass the request to PHP-FPM, Node.js, or Gunicorn, and got garbage back or nothing at all. The front door is open but nobody is home. This is the error that says "your application server crashed."

What does HTTP 503 (Service Unavailable) mean?

The server is alive but is refusing to work right now. Usually it is overloaded or in maintenance mode. Unlike 500, this is often intentional. Servers return 503 during deployments or when they are getting hammered with more traffic than they can handle. It means "try again later." Sometimes that is honest. Sometimes the server has been saying "try again later" for three hours.

What does HTTP 504 (Gateway Timeout) mean?

Same setup as 502, but instead of getting a bad response, the proxy got no response at all within its timeout. Your backend is alive enough to accept the connection but too slow or too stuck to actually answer. Usually points to a database query that is taking forever, a deadlock, or a service that is technically running but functionally useless.

What does HTTP 429 (Too Many Requests) mean?

Rate limiting kicked in. You sent too many requests in a short window and the server is telling you to back off. This comes up when monitoring third-party APIs. If your monitoring triggers 429s, reduce the check frequency for that endpoint. Checking a rate-limited API every 30 seconds is just asking for this.

What does a timeout mean when there is no HTTP code at all?

This is often worse than any error code. No response means the server is completely unreachable. It could be powered off, the network could be down, a firewall could be eating your packets, or the server is so overloaded it cannot even muster an error message. When you see timeouts in your monitoring, start with "is the server on?" and work up from there.

What does "SSL certificate expired" mean?

Your HTTPS certificate is past its expiration date. Every modern browser will throw a full-screen security warning that scares away about 99% of visitors. Let's Encrypt certificates last 90 days. Auto-renewal fails more often than people think. When it does, it fails silently. You find out when your traffic drops to zero. SSL monitoring catches this weeks before it happens.

What does "DNS resolution failed" mean?

Your domain name cannot be translated into an IP address. The server might be running perfectly, but nobody can find it. It is like having a store with no address. Common causes: expired domain, deleted DNS zone, wrong nameservers, or your DNS provider is having a bad day.

Monitoring Specific Services

How do I monitor an API properly?

Checking if an API returns 200 is not enough. APIs lie. They return 200 with error messages in the body all the time. Proper API monitoring means checking:

  1. The status code is what you expect (usually 200, sometimes 201 or 204)
  2. The response body is valid JSON (or XML, if you are unlucky)
  3. Specific fields exist and have the right values
  4. Response time is acceptable, not just "eventually responds"
  5. Auth tokens are accepted, not returning 401 because a key expired

UptyBots's API monitoring covers all of this. Custom headers, request bodies, response validation rules. The works.

How do I monitor SSL certificates?

UptyBots checks your certificate expiration date, chain validity, and whether it is actually being served on the right domain. You configure how many days before expiration you want to be alerted. I recommend 30 days. That gives you enough time to fix things even if the first renewal attempt fails and you are on vacation when it happens.

What ports should I monitor?

Any port that a service depends on. Here are the common ones:

  • 80 / 443: Web server. If you are reading this, you should be monitoring these
  • 22: SSH. If this dies, you cannot even get into the machine to fix it
  • 25 / 587 / 465: SMTP. Email sending
  • 3306 / 5432: MySQL / PostgreSQL. Your database
  • 6379: Redis. Cache and queue
  • 27015: Source engine game servers, if that is your world

How do I monitor a game server?

Port monitoring to check the game port is open. Ping monitoring to track latency, because for gamers, 150ms ping might as well be down. For game platform APIs like Steam or Epic, use API monitoring with content validation. We covered this in monitoring game platform APIs and Steam game server monitoring.

Can I monitor pages that require login?

Yes, but not with a simple HTTP check. You need API monitoring. Send the auth credentials as headers, cookies, or in the request body. UptyBots's API monitoring handles custom headers and request bodies, so you can pass Bearer tokens, API keys, or session cookies with each check.

Reading Your Monitoring Data

What is response time and why should I care?

Response time is how long the server takes to answer your monitoring request. It matters because "up but slow" is functionally the same as "down" for most users. A site that takes 8 seconds to load loses most of its visitors. They leave. They go to your competitor. The site was technically up the whole time, but you still lost the customer. UptyBots records response time for every single check. Watch the trend line. If it is climbing, something is degrading and you need to investigate before it becomes an outage. More on this in why users report issues before monitoring alerts fire.

What is a good response time?

For web pages: under 1 second is great, 1 to 3 seconds is fine, over 3 seconds is a problem. For APIs: under 200ms is great, under 500ms is acceptable, over 1 second is too slow. These are guidelines, not laws. A complex dashboard page at 2 seconds is fine. A health check endpoint at 2 seconds is not.

How is uptime percentage calculated?

Simple math: (total monitored time minus downtime) divided by total monitored time, times 100. If you monitored for 30 days (720 hours) and had 2 hours of downtime, that is (720 - 2) / 720 = 99.72%. UptyBots calculates this automatically. You just read the number.

How do I use monitoring data to actually improve things?

Monitoring data is not just for catching fires. It tells you things:

  • Patterns: Outages at the same time every day? That cron job that runs at midnight is crashing your app
  • Trends: Response times climbing 5% per week? You are running out of something. CPU, memory, database connections
  • Before/after: Deployed on Tuesday, response times jumped 40%. Something in that deploy is expensive
  • Weak points: Which endpoint has the worst uptime? That is where your engineering time should go
  • Budget justification: "We had 47 minutes of downtime last month that cost us an estimated $12,000" is a lot more convincing than "we should probably upgrade our servers"

Troubleshooting

Monitoring says the site is down but I can access it. What gives?

This is almost always one of these:

  • Your site blocks requests that do not look like a browser. Some WAFs and CDNs are aggressive about this
  • A firewall is blocking the monitoring service's IP addresses specifically
  • Your site has geographic restrictions and the monitoring location is in a blocked region
  • The timeout is too low. Your page takes 12 seconds to load, and the monitor gives up after 5
  • Your site needs cookies or JavaScript to render anything, and the HTTP check does not execute JavaScript

Fix: whitelist the monitoring IPs in your firewall, increase the timeout, or use API-style checks that do not depend on browser rendering.

Monitoring says the site is up but users say it is down. What is going on?

This is the scarier scenario. Your monitor is checking the wrong thing, or not enough things:

  • You are monitoring the homepage. The checkout page is broken. Users do not care about the homepage
  • The outage is regional. Your monitor is in the US and your European users are the ones suffering
  • The server returns 200 but the page content is actually an error message. Your monitor checks the status code but not the body
  • The site is brutally slow. Users experience it as "down" but it technically responds within the timeout
  • A third-party resource (CDN, payment gateway, analytics script) is broken. Your server is fine but the page is not

Fix: add monitors for the specific pages that users complain about, enable content validation, set tighter response time thresholds, use multi-location monitoring. See detecting intermittent downtime that users notice but monitoring misses.

How do I tell a false alarm from a real outage?

Real outage signs:

  • Multiple consecutive checks fail
  • Multiple locations report failure
  • The error is a clear server problem (500, 502, 503, 504, timeout)
  • Users are also complaining

False alarm signs:

  • One check fails, the next succeeds
  • Only one monitoring location sees the problem
  • You check manually and everything is fine
  • Nobody else noticed anything

For the full rundown, see false positives vs. real downtime: how to tell the difference.

UptyBots Specifics

What monitor types does UptyBots support?

Six: HTTP/HTTPS, API, SSL certificate, Ping (ICMP), Port (TCP), and Domain expiry. Each one watches a different layer. Used together, they cover your entire stack from network to application to certificate management.

How do I set up my first monitor?

Sign up. Click "Add Monitor." Pick HTTP (that covers most websites). Enter your URL. Choose your check interval and notification channels. Done. Takes less than a minute. UptyBots starts checking immediately. For step-by-step instructions, see our setup tutorials.

What notification channels does UptyBots support?

Email, Telegram, and webhooks. Webhooks can send data to anything that accepts HTTP requests: Slack, Discord, PagerDuty, Opsgenie, Teams, or your own custom systems. You can assign different channels to different monitors.

Can I monitor APIs that need authentication?

Yes. Custom headers (including Authorization with Bearer tokens or API keys), custom request bodies, all HTTP methods (GET, POST, PUT, DELETE). You send auth credentials with the request and validate that the response contains what it should.

Does UptyBots check from multiple locations?

Yes. Checks run from multiple geographic locations. If your site is down in Europe but up in the US, you will know.

Can I use UptyBots during deployments?

Absolutely. Keep your monitors running during deploys. If the deployment breaks something, you find out in minutes instead of hours. If you use zero-downtime deployment, your monitors verify that the cutover went clean. We wrote a guide on monitoring during deployments to help you avoid alert noise during planned changes.

Glossary

Term Definition
Uptime The total time a service is operational and accessible
Downtime The total time a service is unreachable or not functioning correctly
Availability The percentage of time a service is operational (e.g., 99.9%)
SLA (Service Level Agreement) A commitment from a provider guaranteeing a minimum uptime percentage
MTTR (Mean Time to Recovery) The average time it takes to restore service after a failure
MTTD (Mean Time to Detection) The average time between a failure occurring and it being detected
False positive An alert triggered when the service is actually functioning normally
Alert fatigue The tendency to ignore alerts after receiving too many, including false positives
Synthetic monitoring Simulated user interactions used to test service functionality proactively
Latency The time delay between sending a request and receiving a response
TTL (Time to Live) How long a DNS record is cached before being re-queried
Health check A dedicated endpoint that reports the application's internal health status
Incident A detected issue that requires investigation and response
Escalation The process of notifying additional people when an issue is not resolved quickly

See setup tutorials or get started with UptyBots monitoring today.

Ready to get started?

Start Free