High-Availability Architecture

Motivation & Why We Need HA

Many clients have thousands to hundreds of thousands of subscribers — load and scale demands are real.
Single-server deployments lead to resource saturation (CPU, RAM, I/O) — risk of outages or performance degradation.
Growing demand for carrier-grade reliability and 24/7 uptime — especially for LTE (PCRF/HSS/OCS) and real-time services (IPTV, VoIP).
HA ensures redundancy, failover, load balancing, and maintenance with zero downtime — protecting revenue streams and network stability.

Key Layers to Protect (What HA Covers in WISPGate Ecosystem)

Layer / Component	Role / What It Handles
AAA / Policy / Charging (PCRF/HSS/OCS / RADIUS / DHCP / SIP-auth)	Real-time subscriber control, authentication, policy, quota, billing triggers
BSS / OSS Core (WISPGate API, UI, Mediation, Rating, Billing)	Business logic: user management, billing, reporting, integrations, DB
Data-Plane (User Traffic) (EPC, BNG, BRAS, UPF, NAT, routing)	Not managed by WISPGate, but must remain unaffected if control-plane fails
Supporting Services (Cache, Queue, DB, Logs)	State, sessions, queues, data consistency, CDR persistence

Our HA design ensures that failure in one node doesn't impact subscriber control, billing, or service availability.

HA Option Matrix

#	Model	Best For
1	Vertical Scale (Single Large Server)	Small / low-criticality deployments
2	Single-Site Active/Passive (N+1)	Medium-size clients (~100–150k subs)
3	Single-Site Active/Active (Scale-Out)	Large clients (100k–500k+ subs)
4	Dual-Site Active/Passive (DR)	National operators needing DR/BC plans
5	Dual-Site Active/Active (Geo)	Carrier-scale multi-regional deployments

Option 1 — Vertical Scale (Single Large Server)

Pros

Fastest, cheapest initial step
No architectural complexity — minimal operational overhead

Cons

Single point of failure → full outage on crash or hardware fault
No redundancy, no failover, no rolling upgrade, limited scalability

When to Use

Short-term "bridge" until HA deployment
Very small subscriber base or low-criticality deployments

Option 2 — Single-Site, Active/Passive HA (N+1)

2 WISPGate app nodes behind a load-balancer (active/passive)
Primary DB + synchronous standby DB (auto-failover)
Clustered cache/queue layer (Redis, RabbitMQ)
2 AAA / RADIUS / DHCP / PCRF / HSS / OCS nodes — failover configured
All referencing the same shared DB & data store

Pros

Strong availability improvement vs a single server
Simple to implement and manage
Works for LTE and fixed-access (non-LTE) environments

Cons

Failover may cause a short interruption (DB switchover)
Capacity still limited by a single active DB node
No protection against the whole data center failure

Best For

Medium-size customers (up to ~100–150k subs)
Clients needing reliable uptime but not yet at carrier-scale

Option 3 — Single-Site, Active/Active (Scale-Out HA)

Multiple stateless WISPGate app/worker nodes behind LB — horizontal scaling
DB cluster (multi-node, master + replicas / Galera / cluster manager)
Redis / queue cluster for caching, sessions, tasks
Multiple AAA / RADIUS / PCRF / HSS / OCS nodes all active — load distribution
Diameter / RADIUS / SIP endpoints configured across all nodes

Pros

Real throughput scaling — can handle large subscriber counts and high request rates
Smooth rolling upgrades, no downtime for maintenance
Better performance isolation (billing load, AAA load, UI load separated)

Cons

Increased complexity — requires mature DevOps / monitoring / orchestration
DB cluster management, split-brain risk, replication consistency challenges
Configuration errors can cause worse failures than a single-node

Best For

Large clients (100k–500k+ subs)
When you expect growth, heavy traffic (especially LTE + real-time billing), multiple services (IPTV, VoIP, Internet)

Option 4 — Dual-Site, Active/Passive (DR / Disaster Recovery + HA)

Primary data center with full HA stack (as Option 1 or 2)
Secondary data center (warm standby) — replicate DB asynchronously
Idle or limited service load on the secondary DC
Network elements (BNG, MME, NAS, Diameter peers) configured with primary + secondary endpoints

Pros

Protection against the entire data center failure
Secondary DC can be scaled down (cost-efficient warm standby)
Clear Disaster Recovery path

Cons

Potential data loss during failover due to async replication (RPO)
Requires disciplined DR procedures (manual or scripted switchover)
Extra cost for duplicate infrastructure and maintenance overhead

Best For

National / regional operators needing high business continuity guarantees
Clients who require formal DR / BC (business continuity) plans

Option 5 — Dual-Site, Active/Active (Geo-Distributed HA & Scale-Out)

Full HA stack (Active/Active) replicated in two or more data centers
Either:
- Multi-primary DB cluster across DCs (complex)
- Sharding by region (DC1 handles region A subs, DC2 handles region B subs)
- Regional AAA / PCRF / RADIUS / HSS / OCS services — local to DC
- Cross-DC replication or synchronization for global data (billing, CDRs, reporting)

Pros

Maximum scalability and resilience
Can survive losing an entire DC with only partial degradation
Ideal for multi-region / multi-country deployments

Cons

Very high architecture complexity — significant expertise needed
Data model must be carefully designed (sharding, replication, session affinity, data consistency)
Overkill (and cost-heavy) for most use-cases

Best For

Carrier-scale clients with a multi-regional footprint
Strategic long-term deployments where growth and uptime at scale are critical

LTE vs Non-LTE — How HA Differs in Practice

LTE (EPC + HSS/PCRF/OCS)

Key points:

Session state on Gx/Gy/S6a must not be lost on node failure
Charging reliability: you cannot lose CCR/CCA, CDRs, or quota changes

Minimum sane design (short-term LTE HA):

Option 1 (Single Site A/P) with
- 2 PCRF nodes behind LB or with Diameter failover
- 2 HSS nodes sharing replicated DB
- OCS logic behind LB and DB cluster

Medium-term (proper LTE HA):

Option 2 (Single Site A/A) for PCRF/HSS/OCS with:
Stateless Diameter frontends
All states in DB/Redis cluster with proper durability
EPC is configured to load balance across multiple peers.

Non-LTE (FWA, Fiber, DSL)

Key points:

RADIUS and DHCP are critical but easier than Diameter
Payments and billing must not double charge or mis-rate under failover

Minimum HA

Two RADIUS/DHCP nodes (active/active) + DB primary/standby
BRAS/BNG/routers configured with both RADIUS servers
WISPGate app cluster in A/P or A/A behind LB

Extended HA

Add mediation/rating workers running active/active
Dedicated DB read replicas for heavy reporting / BI to keep the primary fast

IPTV & VoIP Billing/Management

For IPTV and VoIP, the HA pattern is mostly:

Signalling / Control:

SIP/Softswitch/RADIUS/Diameter: 2+ nodes active/active
WISPGate provides AAA and charging decisions via RADIUS/Diameter/API

Billing / CDR Rating:

Mediation jobs run on multiple worker nodes with idempotent rating logic
Message queues (Kafka/RabbitMQ) replicated across nodes

In terms of HA architecture, they fall under the same BSS/AAA HA stack you design for LTE and fixed access. You don't want a separate bespoke HA story per service; you want a unified "Charging & AAA Cluster" architecture.

Summary / Conclusion

Recap:

✔ HA is mandatory for any mid-size or large deployment — keeping control, billing, and revenue streams safe.
✔ Single-server vertical scale is a temporary "stop-gap," not a long-term solution.
✔ Recommended standard: Single-Site Active/Passive (Tier 1) or Active/Active (Tier 2) as baseline HA offering.
✔ For large clients / strategic deployments: Dual-Site / Geo-Distributed HA.

High-Availability Architecture

Motivation & Why We Need HA

Key Layers to Protect (What HA Covers in WISPGate Ecosystem)

HA Option Matrix

Option 1 — Vertical Scale (Single Large Server)

Pros

Cons

When to Use

Option 2 — Single-Site, Active/Passive HA (N+1)

Pros

Cons

Best For

Option 3 — Single-Site, Active/Active (Scale-Out HA)

Pros

Cons

Best For

Option 4 — Dual-Site, Active/Passive (DR / Disaster Recovery + HA)

Pros

Cons

Best For

Option 5 — Dual-Site, Active/Active (Geo-Distributed HA & Scale-Out)

Pros

Cons

Best For

LTE vs Non-LTE — How HA Differs in Practice

LTE (EPC + HSS/PCRF/OCS)

Non-LTE (FWA, Fiber, DSL)

IPTV & VoIP Billing/Management

Summary / Conclusion

Scale Your Operation

Technical Architecture