By David Kim · Feb 7, 2026

Building a Reliability-First Architecture: Network Redundancy from the Protocol Level Up

What happens to your traffic when an entire upstream provider drops off the routing table at 2 AM? If the answer is "we page someone and wait," then your architecture has a single point of failure at the network layer, and everything built above it inherits that fragility.

Reliability engineering starts below the application. Before you think about container orchestration, microservice retries, or database replication, the network itself has to survive failures. BGP sessions go down. Transit links saturate. DNS resolvers return stale records. Fiber gets cut. These events are not hypothetical. They happen on a regular schedule across the global internet, and the architectures that handle them well are the ones designed with protocol-level redundancy from the start.

This guide walks through the network-layer building blocks of a reliability-first architecture: how BGP failover actually works, why DNS-based load balancing behaves differently than you might expect, and the protocol-level differences between TCP health checks and HTTP health checks that determine whether your failover triggers correctly or not at all. We will also cover how external monitoring from UptyBots ties these layers together by validating what real users experience from outside your network.

The Network Is the First Layer of Reliability

Application-level reliability patterns (circuit breakers, retry logic, graceful degradation) only work when packets can reach the application in the first place. A perfectly coded retry mechanism is useless if the TCP SYN never arrives because an upstream router is black-holing your prefix.

Network-level reliability operates at OSI layers 2 through 4. It deals with physical links, IP routing, and transport protocol behavior. The decisions made here determine the floor of your system's availability. No amount of application-layer engineering can compensate for a network that has no redundancy.

Consider a typical single-homed setup: one server, one upstream ISP, one IP range. The ISP announces your prefix via BGP to the rest of the internet. If that ISP has a peering dispute, a fiber cut, or an internal routing failure, your prefix disappears from the global routing table. Every user on the internet loses the ability to reach you. Your server is running fine. Your application is healthy. But nobody can connect.

Multi-homing, redundant transit, and anycast address this at the network layer. They give packets multiple paths to reach you, so the failure of any single path does not mean the failure of the entire service.

BGP Failover: How Routing Redundancy Works

BGP (Border Gateway Protocol) is the protocol that holds the internet together. Every autonomous system (AS) on the internet uses BGP to announce which IP prefixes it can route traffic to. When you have a single upstream provider, they announce your prefix on your behalf. When that provider has problems, your prefix goes dark.

Multi-Homed BGP

In a multi-homed setup, you connect to two or more upstream providers and announce your IP prefix through each of them. The global routing table sees multiple paths to reach you. If one provider fails, the other paths remain active. Traffic converges on the surviving paths automatically as BGP routers across the internet withdraw the failed route and select alternatives.

The key parameters that control BGP failover behavior:

AS path length. BGP prefers shorter AS paths by default. If you announce through two providers and one path has an AS path of length 3 while the other has length 2, most of the internet will prefer the shorter path. You can manipulate this with AS path prepending to bias traffic toward a preferred provider while keeping the other as a backup.
Local preference. Within your own AS, local preference determines which outbound path your routers prefer. Set a higher local preference on your primary upstream and a lower one on the backup. Outbound traffic uses the primary by default and fails over to the backup when the primary session drops.
MED (Multi-Exit Discriminator). When you have multiple connections to the same upstream provider (for example, links in two different data centers), MED tells that provider which link you prefer for inbound traffic. Lower MED values are preferred.
BGP communities. Many upstream providers accept BGP communities that let you control how they propagate your routes. You can tag a backup route with a "do not export to peers" community so it is only visible through the provider's own network, keeping it as a genuine last-resort path.
Convergence time. When a BGP session drops, it takes time for the route withdrawal to propagate across the internet. Default BGP hold timers are 90 seconds (with keepalives every 30 seconds). With BFD (Bidirectional Forwarding Detection) running alongside BGP, you can detect link failures in under a second and trigger BGP session teardown immediately, reducing convergence time from minutes to seconds.

Anycast Routing

Anycast takes a different approach to BGP redundancy. Instead of announcing different prefixes through different providers, you announce the same prefix from multiple locations. A DNS server in Frankfurt and a DNS server in Tokyo both announce 198.51.100.0/24. When a user in Japan sends a packet to an IP in that range, BGP routing naturally delivers it to the Tokyo instance because that is the "closest" in terms of AS path length and routing metrics.

If the Tokyo instance fails and withdraws its BGP announcement, traffic automatically shifts to the next-closest instance. This is how major DNS providers (Cloudflare, Google Public DNS, root nameservers) achieve global redundancy. It is also used for DDoS mitigation: spreading attack traffic across many anycast nodes dilutes its impact on any single location.

Anycast works best for stateless or short-lived protocols like DNS (UDP) and certain CDN delivery patterns. It is less straightforward for stateful TCP connections because a route change mid-connection sends subsequent packets to a different server, breaking the TCP session. Some implementations solve this with flow-based ECMP (Equal-Cost Multi-Path) or by running anycast only at the TCP SYN level and pinning established connections.

DNS-Based Load Balancing: Behavior and Limitations

DNS-based load balancing distributes traffic by returning different IP addresses in response to DNS queries. It operates at a fundamentally different layer than BGP, and understanding its behavior at the protocol level prevents common misconfiguration.

Round-Robin DNS

The simplest form: configure multiple A (or AAAA) records for the same hostname. Each DNS query receives all records, but in a rotated order. The client typically connects to the first address in the list. Over time, different clients receive different orderings, distributing traffic across your servers.

The protocol-level reality is less clean than the theory suggests. Many DNS resolvers cache the entire record set and return it in the same order to all clients behind that resolver. A single corporate resolver serving 10,000 employees might send all of them to the same backend. Operating system DNS caches and browser connection pools further reduce the actual distribution. Round-robin DNS is a rough traffic spreader, not a precise load balancer.

Weighted DNS Records

Some DNS providers (Route 53, Cloudflare, NS1) support weighted records. You assign each record a weight, and the DNS server returns records proportionally. A record with weight 70 appears roughly 70% of the time; a record with weight 30 appears roughly 30% of the time. This lets you shift traffic gradually between origins, which is useful during migrations or canary deployments.

The accuracy of weighted DNS depends on the number of queries and the caching behavior of downstream resolvers. For low-traffic sites, the actual distribution can deviate significantly from the configured weights because each cached response serves many users.

Health-Checked DNS

Health-checked DNS adds an active monitoring component to DNS resolution. The DNS provider regularly checks the health of each origin (via HTTP, TCP, or ICMP) and only returns records for healthy origins. If an origin fails its health check, the DNS provider removes it from the response set.

This is the DNS-level equivalent of a load balancer removing an unhealthy backend. The critical parameter is the health check interval combined with the DNS TTL. If your health check runs every 30 seconds and your DNS TTL is 300 seconds, there is a window of up to 330 seconds where clients might still be directed to a failed origin because their resolver cached the old record. Reducing the TTL to 60 seconds narrows this window but increases DNS query volume.

GeoDNS

GeoDNS returns different IP addresses based on the geographic location of the querying resolver. A user in Europe gets a European server IP; a user in Asia gets an Asian server IP. This reduces latency by keeping traffic close to the user.

GeoDNS relies on IP geolocation databases to map resolver IPs to locations. Its accuracy depends on the quality of that mapping. Some resolvers (notably Google Public DNS at 8.8.8.8) serve users globally from a small number of resolver IPs, which can cause GeoDNS to mislocate those users. The EDNS Client Subnet (ECS) extension partially addresses this by including a portion of the original client's IP in the DNS query, but not all resolvers and not all authoritative servers support it.

TCP Health Checks vs HTTP Health Checks: The Protocol Differences That Matter

Health checks are the trigger mechanism for all failover logic. If the health check does not accurately detect a failure, failover does not happen. The choice between TCP-level and HTTP-level health checks has real consequences for failure detection.

TCP Health Checks

A TCP health check opens a connection to a specific port and verifies that the three-way handshake (SYN, SYN-ACK, ACK) completes successfully. Once the handshake finishes, the checker closes the connection. The entire check involves three packets and takes milliseconds.

What a TCP health check tells you:

The host is reachable at the IP layer
A process is listening on the specified port
The operating system's TCP stack is functional
No firewall between checker and target is blocking the port

What a TCP health check does not tell you:

Whether the application behind the port is actually processing requests
Whether the application can reach its dependencies (database, cache, API)
Whether the application returns correct responses
Whether TLS negotiation succeeds

TCP checks are fast, lightweight, and protocol-agnostic. They work for any service that listens on a TCP port: HTTP servers, databases (port 5432 for PostgreSQL, 3306 for MySQL), mail servers (port 25, 587, 993), SSH (port 22), Redis (port 6379), and custom application protocols.

HTTP Health Checks

An HTTP health check goes further. It completes the TCP handshake, optionally negotiates TLS, sends an HTTP request (typically GET or HEAD), and evaluates the response. The check can validate:

HTTP status code (200 vs 500 vs 503)
Response body content (does it contain an expected string?)
Response headers (correct Content-Type, cache headers, etc.)
Response time (did the server respond within an acceptable threshold?)
TLS certificate validity (correct hostname, not expired, trusted chain)
Redirect behavior (does /health redirect somewhere unexpected?)

HTTP checks detect a much wider range of failures. A web server that has exhausted its worker pool will accept the TCP connection (the kernel backlog queue has space) but never send an HTTP response. A TCP check passes. An HTTP check times out, correctly identifying the failure.

Similarly, a server returning 503 Service Unavailable passes a TCP check because the port is open and the handshake completes. An HTTP check catches the 503 and flags the backend as unhealthy.

When to Use Each

Use TCP checks when you need to monitor non-HTTP services (databases, mail servers, game servers, custom protocols) or when you need extremely fast, low-overhead checks for a large number of targets. Use HTTP checks for anything that serves HTTP traffic, because they catch the application-layer failures that TCP checks miss.

For load balancers and failover decisions, always prefer the deepest check the service supports. If the service speaks HTTP, use HTTP checks. Reserve TCP checks for services where HTTP is not applicable.

Deep Health Check Endpoints

A common best practice is to expose a dedicated health check endpoint in your application that exercises its real dependencies. Instead of returning 200 OK from a static endpoint, the health check queries the database, reads from the cache, and calls a critical internal API. If any of these fail, it returns 503.

This pattern has a nuance. You need two endpoints:

/health/live (liveness): returns 200 if the process is running. Used by the container orchestrator to decide whether to restart the pod. This check should be fast and have no external dependencies.
/health/ready (readiness): returns 200 only if the service can handle traffic. Checks database connectivity, cache availability, and other dependencies. Used by load balancers to decide whether to route traffic to this instance.

If you use a single endpoint for both purposes, a database outage causes the orchestrator to restart your pods (because the liveness check fails), which makes the problem worse instead of better. Separating liveness from readiness avoids this cascade.

Designing Redundant Network Paths

Redundancy at the network layer means ensuring there is no single physical or logical component whose failure causes a complete outage. This requires thinking about every hop between the user and your application.

Upstream Provider Diversity

Connect to at least two upstream transit providers. Ideally, choose providers with different upstream networks and different physical infrastructure. Two ISPs that both buy transit from the same Tier 1 carrier share a dependency. Two ISPs with independent peering relationships and separate fiber paths provide genuine redundancy.

Check the physical path of each provider's last-mile connection to your data center. If both providers enter the building through the same conduit, a single construction incident can sever both links simultaneously. Ask your data center about diverse fiber entry points.

Link Aggregation and ECMP

Within a data center, link aggregation (LACP/802.3ad) bonds multiple physical links into a single logical interface. If one link fails, traffic continues over the surviving links. This provides both bandwidth aggregation and link-level redundancy.

ECMP (Equal-Cost Multi-Path) extends this concept to the routing layer. When multiple paths to a destination have equal cost, the router distributes traffic across all of them. Typically, routers use a hash of source IP, destination IP, source port, and destination port to assign flows to paths. This keeps individual TCP connections on a consistent path (avoiding packet reordering) while distributing the aggregate load.

Redundant Switching and Routing

A single top-of-rack switch is a point of failure for every server connected to it. A redundant design uses paired switches with each server dual-homed (one link to each switch). VRRP (Virtual Router Redundancy Protocol) or proprietary protocols like Cisco HSRP provide gateway redundancy so that servers continue to route traffic when one gateway router fails.

Modern data center designs use a leaf-spine topology. Every leaf switch connects to every spine switch, eliminating the single path between any two servers. A spine switch failure reduces total capacity but does not disconnect any server pair.

Redundancy at the Edge: Multiple PoPs

For services that need to survive an entire data center failure, deploy across multiple points of presence (PoPs). Each PoP has independent power, cooling, network connectivity, and server capacity. Traffic enters through whichever PoP is closest (via anycast or GeoDNS) and fails over to another PoP if one becomes unreachable.

The trade-off is complexity. Multi-PoP architectures require data synchronization between locations, careful handling of state, and operational tooling that can manage deployments across distributed infrastructure. For stateless services (CDN edges, DNS resolvers, API gateways), the overhead is manageable. For stateful services (databases, session stores), it requires replication strategies that match your consistency requirements.

Putting Protocol Knowledge into Practice

Understanding these protocol-level details transforms reliability from a vague goal into a set of specific, verifiable engineering decisions:

Audit your BGP topology. How many upstream providers do you have? Can you survive the loss of any single one? What is your expected convergence time? If you use BFD, what are the detection intervals?
Review your DNS architecture. What is the TTL on your A and AAAA records? Are health checks configured at the DNS level? What happens when an origin fails: how long until DNS stops sending traffic to it?
Classify your health checks. For each service, are you using TCP or HTTP checks? Are there services where a TCP check passes but the application is actually broken? Do you have separate liveness and readiness endpoints?
Map your physical paths. Do your upstream links enter through diverse conduits? Are your servers dual-homed to separate switches? Does your data center have redundant power feeds?
Test your failover. Shut down one upstream link and measure how long it takes for traffic to converge on the other. Remove a backend from the load balancer and verify the health check detects the failure within the expected interval. Simulate a DNS origin failure and measure how long clients continue hitting the dead origin.

External Monitoring: Validating What Users Actually Experience

Internal health checks and BGP monitoring tell you what your infrastructure thinks is happening. External monitoring tells you what is actually happening from the user's perspective. These two views frequently disagree.

A BGP session can be up while a misconfigured ACL on a transit router drops your traffic. An HTTP health check can pass from inside the data center while users in another country cannot reach you due to a peering dispute. Your DNS records can be correct at the authoritative server while a major recursive resolver has cached a stale response.

UptyBots provides external monitoring from the public internet, verifying reachability and response from the perspective of real users. HTTP checks validate the full stack from DNS resolution through TCP connection, TLS handshake, and application response. TCP port checks verify that services are accepting connections from outside. Ping monitors track ICMP reachability. SSL monitors verify certificate chains and expiration. Together, these checks confirm that your redundancy mechanisms are actually working as designed.

The combination is essential. Internal monitoring catches component-level failures fast. External monitoring catches everything that internal monitoring misses: network partitions, routing problems, DNS failures, CDN edge issues, and all the ways that a "healthy" infrastructure can still be unreachable to the people who matter.

Frequently Asked Questions

How much does a reliability-first architecture cost?

The cost depends on the level of redundancy. Multi-homed BGP with two upstream providers roughly doubles your transit costs. Multi-PoP deployments multiply hosting costs by the number of locations. For most small to medium businesses, the practical path is to start with a hosting provider or cloud platform that already handles network redundancy (multiple availability zones, managed load balancers) and add external monitoring on top. The cost of monitoring is negligible compared to the cost of undetected outages.

Can I add reliability to an existing system?

Yes, but prioritize based on impact. Start by adding external monitoring so you have a baseline measurement. Then add HTTP health checks to your load balancer or reverse proxy. Next, audit your DNS configuration for redundancy. Network-level changes (adding a second upstream provider, deploying to a second data center) are larger investments, but you can approach them incrementally. Each step measurably improves reliability, and monitoring data tells you exactly how much.

What is the difference between BFD and BGP keepalives for failure detection?

BGP keepalives are sent every 30 seconds by default, with a hold timer of 90 seconds. If three consecutive keepalives are missed, the session is declared down. That is up to 90 seconds of blindness. BFD (Bidirectional Forwarding Detection) runs alongside BGP at millisecond-level intervals (commonly 300ms with a 3x multiplier, meaning sub-second detection). When BFD detects a link failure, it immediately signals BGP to tear down the session, cutting convergence time from over a minute to under a second.

How does UptyBots fit into this architecture?

UptyBots acts as the external validation layer. Configure HTTP monitors for each public endpoint, TCP port monitors for non-HTTP services, ping monitors for network reachability, and SSL monitors for certificate validity. When a failover event occurs, UptyBots confirms whether users were affected and how quickly service recovered. The historical data lets you measure actual availability and compare it against your architecture's design targets. The free tier covers most small to medium setups; paid plans add faster check intervals and multi-location monitoring.

Conclusion

Reliability at the network level is not about adding more servers. It is about eliminating the single paths, single providers, and single points of failure that turn a component failure into a total outage. BGP multi-homing ensures your prefix survives provider failures. DNS-based load balancing distributes traffic and fails over at the resolution layer. Protocol-appropriate health checks trigger failover accurately. And external monitoring from UptyBots validates that all these mechanisms work as designed by checking what real users actually see.

The architectures that achieve high availability are the ones that made deliberate decisions about redundancy at every protocol layer. Audit yours, close the gaps, and let monitoring confirm the results.

Start improving your uptime today: See our tutorials or choose a plan.