High Availability

High availability (HA) ensures network services remain operational with minimal downtime through redundancy, failover mechanisms, and fault tolerance. Critical for business continuity where network outages directly impact revenue and operations.

Core HA Principles

Eliminate single points of failure - Every critical component needs backup
Detect failures quickly - Use monitoring protocols like BFD (Bidirectional Forwarding Detection)
Recover automatically - Manual intervention increases downtime
Maintain service during transitions - Users shouldn’t notice failover events

First Hop Redundancy Protocols (FHRP)

Default gateway redundancy prevents Layer 3 single point of failure:

HSRP (Hot Standby Router Protocol) - Cisco proprietary, uses virtual IP/MAC
- Active/Standby model with priority-based election
- Default priority 100, higher wins, preempt required for takeover
- Hellos every 3 seconds, dead timer 10 seconds
VRRP (Virtual Router Redundancy Protocol) - Industry standard (RFC 3768)
- Master/Backup terminology, uses virtual MAC 0000.5e00.01xx
- Priority range 1-254, IP owner automatically becomes master
GLBP (Gateway Load Balancing Protocol) - Cisco proprietary with load balancing
- Single virtual IP with multiple virtual MACs for traffic distribution

Protocol	Standard	Load Balancing	Virtual MAC	Default Priority
HSRP	Cisco	No	0000.0c07.acxx	100
VRRP	IEEE	No	0000.5e00.01xx	100
GLBP	Cisco	Yes	0007.b400.xxyy	100

Link Redundancy Technologies

EtherChannel/Port Channel - Bundles multiple physical links into logical interface
- LACP (802.3ad) industry standard vs PAgP (Cisco proprietary)
- Load balances based on src-dst-ip, src-dst-mac, or src-dst-port
- All links must have same speed, duplex, VLAN configuration
Spanning Tree Protocol (STP) - Prevents Layer 2 loops while maintaining redundancy
- Blocks redundant paths until primary fails
- RSTP (802.1w) converges in seconds vs minutes for legacy STP
- Per-VLAN STP variants (PVST+, MST) optimize per-VLAN topology

Router Redundancy Methods

Static Route Floating - Configure backup routes with higher administrative distance
- Primary route AD 1, backup route AD 5+ ensures failover order
Dynamic Routing Convergence - OSPF, EIGRP automatically reroute around failures
- OSPF LSA flooding triggers SPF recalculation network-wide
- EIGRP feasible successors enable sub-second convergence
BGP Multihoming - Multiple ISP connections with AS path manipulation
- Use AS prepending and local preference for traffic engineering

Vocabulary

RTO (Recovery Time Objective) - Maximum acceptable downtime duration
RPO (Recovery Point Objective) - Maximum acceptable data loss timeframe
MTTR (Mean Time To Repair) - Average time to restore service after failure
MTBF (Mean Time Between Failures) - Average operational time between failures
BFD (Bidirectional Forwarding Detection) - Fast failure detection protocol (sub-second)
Preemption - Higher priority device taking over from lower priority active device

Notes

FHRP virtual IPs must be in same subnet as host devices - Cannot route between VLANs
EtherChannel requires identical port configurations - mismatched settings cause err-disabled state
STP convergence can take 30-50 seconds with default timers - use RSTP for faster recovery
BFD reduces failure detection from seconds to milliseconds when used with routing protocols
For Internet connectivity, dual ISPs with BGP provides better redundancy than single ISP with dual circuits
Always test failover scenarios - automatic failover that’s never tested often fails when needed
Consider geographic redundancy for disaster recovery - local redundancy won’t help with site-wide outages
Monitor HA mechanisms actively - silent failures in standby systems are common and dangerous