High Availability

High availability (HA) ensures network services remain operational with minimal downtime through redundancy, failover mechanisms, and fault tolerance. Critical for business continuity where network outages directly impact revenue and operations.

Core HA Principles

  • Eliminate single points of failure - Every critical component needs backup
  • Detect failures quickly - Use monitoring protocols like BFD (Bidirectional Forwarding Detection)
  • Recover automatically - Manual intervention increases downtime
  • Maintain service during transitions - Users shouldn’t notice failover events

First Hop Redundancy Protocols (FHRP)

Default gateway redundancy prevents Layer 3 single point of failure:

  • HSRP (Hot Standby Router Protocol) - Cisco proprietary, uses virtual IP/MAC
    • Active/Standby model with priority-based election
    • Default priority 100, higher wins, preempt required for takeover
    • Hellos every 3 seconds, dead timer 10 seconds
  • VRRP (Virtual Router Redundancy Protocol) - Industry standard (RFC 3768)
    • Master/Backup terminology, uses virtual MAC 0000.5e00.01xx
    • Priority range 1-254, IP owner automatically becomes master
  • GLBP (Gateway Load Balancing Protocol) - Cisco proprietary with load balancing
    • Single virtual IP with multiple virtual MACs for traffic distribution
Protocol Standard Load Balancing Virtual MAC Default Priority
HSRP Cisco No 0000.0c07.acxx 100
VRRP IEEE No 0000.5e00.01xx 100
GLBP Cisco Yes 0007.b400.xxyy 100
  • EtherChannel/Port Channel - Bundles multiple physical links into logical interface
    • LACP (802.3ad) industry standard vs PAgP (Cisco proprietary)
    • Load balances based on src-dst-ip, src-dst-mac, or src-dst-port
    • All links must have same speed, duplex, VLAN configuration
  • Spanning Tree Protocol (STP) - Prevents Layer 2 loops while maintaining redundancy
    • Blocks redundant paths until primary fails
    • RSTP (802.1w) converges in seconds vs minutes for legacy STP
    • Per-VLAN STP variants (PVST+, MST) optimize per-VLAN topology

Router Redundancy Methods

  • Static Route Floating - Configure backup routes with higher administrative distance
    • Primary route AD 1, backup route AD 5+ ensures failover order
  • Dynamic Routing Convergence - OSPF, EIGRP automatically reroute around failures
    • OSPF LSA flooding triggers SPF recalculation network-wide
    • EIGRP feasible successors enable sub-second convergence
  • BGP Multihoming - Multiple ISP connections with AS path manipulation
    • Use AS prepending and local preference for traffic engineering

Vocabulary

  • RTO (Recovery Time Objective) - Maximum acceptable downtime duration
  • RPO (Recovery Point Objective) - Maximum acceptable data loss timeframe
  • MTTR (Mean Time To Repair) - Average time to restore service after failure
  • MTBF (Mean Time Between Failures) - Average operational time between failures
  • BFD (Bidirectional Forwarding Detection) - Fast failure detection protocol (sub-second)
  • Preemption - Higher priority device taking over from lower priority active device

Notes

  • FHRP virtual IPs must be in same subnet as host devices - Cannot route between VLANs
  • EtherChannel requires identical port configurations - mismatched settings cause err-disabled state
  • STP convergence can take 30-50 seconds with default timers - use RSTP for faster recovery
  • BFD reduces failure detection from seconds to milliseconds when used with routing protocols
  • For Internet connectivity, dual ISPs with BGP provides better redundancy than single ISP with dual circuits
  • Always test failover scenarios - automatic failover that’s never tested often fails when needed
  • Consider geographic redundancy for disaster recovery - local redundancy won’t help with site-wide outages
  • Monitor HA mechanisms actively - silent failures in standby systems are common and dangerous