EdgeUp Colleges — AWS Solution Architecture

● LIVE — load-tested to 2,000 concurrent EKS 1.35 · Karpenter + KEDA Idle fleet: 5 nodes — 2 on-demand + 3 spot Peak (2,000 conc): 20–22 nodes — auto, then shrinks back Aurora MySQL Serverless v2 · 0.5–24 ACU VPC 10.2.0.0/16 · 2 AZs · 1 NAT

Show services for (tap to add/remove):

Design decisions a reviewer will ask about

Why 2 on-demand + 3 spot at idle: everything critical (controllers, the 5 in-cluster data stores) lives on the 2 on-demand nodes; all application pods are stateless and ride spot at ~60–70% discount. A spot reclaim never touches state.
Five databases on one node: MySQL, Redis, Kafka, Qdrant and Neo4j are bin-packed on the m6i.large infra node with a 150 GB gp3 volume — this single decision replaces 4–5 managed services and is the biggest cost saver in the stack.
Burst behaviour is proven, not theoretical: the June load test took the fleet 5 → 20–22 nodes at 2,000 concurrent users and back, with the database auto-scaling 0.5 → 24 ACU. You pay for peak hours, not peak size.
Edge protection: AWS WAF (web ACL, L7 filtering) sits in front of the shared ALB — all inbound traffic is inspected before it reaches the cluster. Single NAT and 2 AZs are deliberate cost choices; the production variant adds a second NAT and a third AZ.
GitOps end-to-end: GitHub builds → ECR → ArgoCD syncs the cluster from Git. Nobody deploys by hand; rollback = git revert.
Monitoring is centralized: metrics and logs stream to the central observability server instead of running Grafana/Loki per cluster — one pane of glass, no duplicated cost.

Source of truth: Terraform/Terragrunt repo (01-foundation → 02-compute → 03-data → 04-platform) + live AWS discovery, 12-Jun-2026. AWS reference design — other providers in the cost matrix map to equivalent services.

Service mapping — selected platforms

Design decisions a reviewer will ask about