Designing a system to handle a real‐time vote during a Super Bowl commercial—30 million votes cast in maybe 30 seconds (≈1 million votes/sec!)—requires careful attention to scalability, low latency, and resiliency. Here’s one possible end-to-end architecture:
1. Traffic Profile & SLAs
- Peak write load: 30 M votes over 30 s → ~1 M requests/sec.
- Read load: viewers want instant percentage updates (e.g. every 1–2 s).
- Latency target: end‐to‐end vote confirmation <100 ms; dashboard update <1 s.
2. Ingestion Layer
- Global DNS + CDN (e.g. CloudFront, Fastly)
- Direct users to the nearest edge PoP.
- API Gateway / Load Balancer
- Autoscaling fleet of stateless vote‐receiving endpoints (e.g. Kubernetes pods, AWS Lambda).
- Client SDK with Retry / Idempotency Key
- Attach a unique client vote-ID to avoid duplicate counting on retry.
3. Stream Buffering
- Managed streaming system (e.g. Kafka, AWS Kinesis, Google Pub/Sub) sits behind the API layer.
- Vote‐POST → enqueue onto partitioned stream (partition by region or by vote option).
- Automatically handles back‐pressure and smooths burstiness.
4. Real-Time Aggregation
- Stream Processing (e.g. Kafka Streams, Flink, Kinesis Data Analytics)
- Windowed aggregations over 1–2 s hops.
- Maintain running counters for “green” vs. “blue”.
- In-Memory Cache (e.g. Redis, ElastiCache)
- Write aggregates every second into a highly available, in-memory store keyed by time‐window & option.
5. Persistent Storage
- Time-series database (e.g. DynamoDB with TTLs, InfluxDB, Bigtable)
- Write final aggregate for each interval (e.g. every 5 s) for historical playback and post-game analysis.
- Enable ad-hoc queries: “What was the vote split at second 15?”
6. Front-End Dashboard / Mobile App
- WebSocket or Server-Sent Events
- Clients subscribe to a “vote stream” channel.
- Push updated percentages every 1–2 s.
- Graceful degradation
- If WS fails, fallback to polling a REST endpoint that returns the latest vote percentages.
7. Autoscaling & Resiliency
- Autoscale ingestion pods or serverless functions to ≥2× expected peak.
- Multi-AZ / Multi-Region deployment for split failover.
- Chaos testing before the event (e.g. inject latency, drop pods).
- Circuit breakers / bulkheads to isolate failures (e.g. if Redis is slow, degrade update frequency).
8. Observability & Monitoring
- Metrics: QPS, error rates, processing latency in stream jobs, Redis latency.
- Dashboards: Grafana / CloudWatch showing real-time vote‐rate vs. capacity.
- Alerts on anomalies (e.g. queue backlog >10 s).
Vote-Counting Data Flow (Simplified)
- User taps “Green” → HTTPS POST /vote →
- API Gateway → Ingress pods → Push to Kafka/Kinesis →
- Stream Processor aggregates → writes to Redis →
- WebSocket server reads from Redis → broadcasts to clients
Key Design Considerations
- Idempotency: ensure retries don’t double-count
- Partitioning: evenly distribute load across shards
- Back-pressure: stream buffer absorbs bursts; avoid overwhelming DB
- Latency vs. consistency: eventual consistency is fine for 1–2 s update windows
With this setup you can support 30 M near‐simultaneous votes, provide sub-second feedback to users, and keep the system robust even under the legendary Super Bowl traffic spike.
