# Production Architecture Design - mockupAWS v1.0.0 > **Version:** 1.0.0 > **Author:** @spec-architect > **Date:** 2026-04-07 > **Status:** DRAFT - Ready for Review --- ## Executive Summary This document defines the production architecture for mockupAWS v1.0.0, transforming the current single-node development setup into an enterprise-grade, scalable, and highly available system. ### Key Architectural Decisions | Decision | Rationale | |----------|-----------| | **Nginx Load Balancer** | Battle-tested, extensive configuration options, SSL termination | | **PostgreSQL Primary-Replica** | Read scaling for analytics workloads, failover capability | | **Redis Cluster** | Distributed caching, session storage, rate limiting | | **Container Orchestration** | Docker Compose for simplicity, Kubernetes-ready design | | **Multi-Region Active-Passive** | Cost-effective HA, 99.9% uptime target | --- ## 1. Scalability Architecture ### 1.1 System Overview ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ CLIENT LAYER │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Web Browser │ │ Mobile App │ │ API Clients │ │ CI/CD │ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ └─────────┼──────────────────┼──────────────────┼──────────────────┼───────────┘ │ │ │ │ └──────────────────┴──────────────────┴──────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ EDGE LAYER (CDN + WAF) │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ CloudFront / Cloudflare CDN │ │ │ │ • Static assets caching (React bundle, images, reports) │ │ │ │ • DDoS protection │ │ │ │ • Geo-routing │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ LOAD BALANCER LAYER │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Nginx Load Balancer (Active-Standby) │ │ │ │ • SSL Termination (TLS 1.3) │ │ │ │ • Health checks: /health endpoint │ │ │ │ • Sticky sessions (for WebSocket support) │ │ │ │ • Rate limiting: 1000 req/min per IP │ │ │ │ • Circuit breaker: 5xx threshold detection │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ │ ┌────────────────┼────────────────┐ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ APPLICATION LAYER (3x replicas) │ │ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ │ │ Backend API │ │ Backend API │ │ Backend API │ │ │ │ Instance 1 │ │ Instance 2 │ │ Instance 3 │ │ │ │ (Port 8000) │ │ (Port 8000) │ │ (Port 8000) │ │ │ ├──────────────────┤ ├──────────────────┤ ├──────────────────┤ │ │ │ • FastAPI │ │ • FastAPI │ │ • FastAPI │ │ │ │ • Uvicorn │ │ • Uvicorn │ │ • Uvicorn │ │ │ │ • 4 Workers │ │ • 4 Workers │ │ • 4 Workers │ │ │ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │ └───────────┼─────────────────────┼─────────────────────┼────────────────────┘ │ │ │ └─────────────────────┼─────────────────────┘ │ ┌─────────────┴─────────────┐ ▼ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ DATA LAYER │ │ ┌─────────────────────────┐ ┌────────────────────────────────────────┐ │ │ │ Redis Cluster │ │ PostgreSQL Primary-Replica │ │ │ │ ┌─────┐ ┌─────┐ ┌────┐│ │ ┌──────────┐ ┌──────────────┐ │ │ │ │ │ M1 │ │ M2 │ │ M3 ││ │ │ Primary │◄────►│ Replica 1 │ │ │ │ │ └──┬──┘ └──┬──┘ └──┬─┘│ │ │ (RW) │ Sync │ (RO) │ │ │ │ │ └───────┴───────┘ │ │ └────┬─────┘ └──────────────┘ │ │ │ │ ┌─────┐ ┌─────┐ ┌────┐│ │ │ ┌──────────────┐ │ │ │ │ │ S1 │ │ S2 │ │ S3 ││ │ └───────────►│ Replica 2 │ │ │ │ │ └─────┘ └─────┘ └────┘│ │ │ (RO) │ │ │ │ │ (3 Masters + 3 Slaves) │ │ └──────────────┘ │ │ │ └─────────────────────────┘ └────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ### 1.2 Load Balancer Configuration (Nginx) ```nginx # /etc/nginx/conf.d/mockupaws.conf upstream backend { least_conn; # Least connections load balancing server backend-1:8000 weight=1 max_fails=3 fail_timeout=30s; server backend-2:8000 weight=1 max_fails=3 fail_timeout=30s; server backend-3:8000 weight=1 max_fails=3 fail_timeout=30s backup; keepalive 32; # Keepalive connections } server { listen 80; server_name api.mockupaws.com; return 301 https://$server_name$request_uri; # Force HTTPS } server { listen 443 ssl http2; server_name api.mockupaws.com; # SSL Configuration ssl_certificate /etc/ssl/certs/mockupaws.crt; ssl_certificate_key /etc/ssl/private/mockupaws.key; ssl_protocols TLSv1.3; ssl_ciphers HIGH:!aNULL:!MD5; ssl_prefer_server_ciphers on; # Security Headers add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always; add_header X-Frame-Options "DENY" always; add_header X-Content-Type-Options "nosniff" always; add_header X-XSS-Protection "1; mode=block" always; # Rate Limiting Zones limit_req_zone $binary_remote_addr zone=api:10m rate=100r/m; limit_req_zone $binary_remote_addr zone=auth:10m rate=10r/m; limit_req_zone $binary_remote_addr zone=ingest:10m rate=1000r/m; # Health Check Endpoint location /health { access_log off; proxy_pass http://backend; proxy_connect_timeout 5s; proxy_send_timeout 5s; proxy_read_timeout 5s; } # API Endpoints with Circuit Breaker location /api/ { limit_req zone=api burst=20 nodelay; proxy_pass http://backend; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; # Timeouts proxy_connect_timeout 30s; proxy_send_timeout 60s; proxy_read_timeout 60s; # Circuit Breaker Pattern proxy_next_upstream error timeout http_502 http_503 http_504; proxy_next_upstream_tries 2; } # Auth Endpoints - Stricter Rate Limit location /api/v1/auth/ { limit_req zone=auth burst=5 nodelay; proxy_pass http://backend; } # Ingest Endpoints - Higher Throughput location /api/v1/ingest/ { limit_req zone=ingest burst=100 nodelay; client_max_body_size 10M; proxy_pass http://backend; } # Static Files (if served from backend) location /static/ { expires 1y; add_header Cache-Control "public, immutable"; proxy_pass http://backend; } } ``` ### 1.3 Horizontal Scaling Strategy #### Scaling Triggers | Metric | Scale Out Threshold | Scale In Threshold | Action | |--------|--------------------|--------------------|--------| | CPU Usage | >70% for 5 min | <30% for 10 min | ±1 instance | | Memory Usage | >80% for 5 min | <40% for 10 min | ±1 instance | | Request Latency (p95) | >500ms for 3 min | <200ms for 10 min | +1 instance | | Queue Depth (Celery) | >1000 jobs | <100 jobs | ±1 worker | | DB Connections | >80% pool | <50% pool | Review query optimization | #### Auto-Scaling Configuration (Docker Swarm) ```yaml # docker-compose.prod.yml - Scaling Configuration version: '3.8' services: backend: image: mockupaws/backend:v1.0.0 deploy: replicas: 3 update_config: parallelism: 1 delay: 10s failure_action: rollback restart_policy: condition: any delay: 5s max_attempts: 3 resources: limits: cpus: '2.0' memory: 4G reservations: cpus: '0.5' memory: 1G labels: - "prometheus-job=backend" healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 30s timeout: 10s retries: 3 start_period: 40s nginx: image: nginx:alpine deploy: replicas: 2 placement: constraints: - node.role == manager ports: - "80:80" - "443:443" ``` #### Kubernetes HPA Alternative ```yaml # k8s/hpa-backend.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: backend-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: backend minReplicas: 3 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 behavior: scaleUp: stabilizationWindowSeconds: 300 policies: - type: Pods value: 2 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 600 policies: - type: Pods value: 1 periodSeconds: 120 ``` ### 1.4 Database Read Replicas #### PostgreSQL Primary-Replica Setup ``` ┌─────────────────────────────────────────────────────────────┐ │ PostgreSQL Cluster │ │ │ │ ┌─────────────────┐ │ │ │ Primary │◄── Read/Write Operations │ │ │ (postgres-1) │ │ │ │ │ │ │ │ • All writes │ │ │ │ • WAL shipping │───┬────────────────────────┐ │ │ │ • Sync commit │ │ Streaming Replication │ │ │ └─────────────────┘ │ │ │ │ ▼ ▼ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Replica 1 │ │ Replica 2 │ │ │ │ (postgres-2) │ │ (postgres-3) │ │ │ │ │ │ │ │ │ │ • Read-only │ │ • Read-only │ │ │ │ • Async replica│ │ • Async replica│ │ │ │ • Hot standby │ │ • Hot standby │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ │ │ └──────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────┐ │ │ │ PgBouncer Connection Pool │ │ │ │ │ │ │ │ Pool Mode: Transaction │ │ │ │ Max Connections: 1000 │ │ │ │ Default Pool: 25 per db/user │ │ │ └─────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ ``` #### Connection Pooling (PgBouncer) ```ini ; /etc/pgbouncer/pgbouncer.ini [databases] mockupaws = host=postgres-primary port=5432 dbname=mockupaws mockupaws_replica = host=postgres-replica-1 port=5432 dbname=mockupaws [pgbouncer] listen_port = 6432 listen_addr = 0.0.0.0 auth_type = md5 auth_file = /etc/pgbouncer/userlist.txt ; Pool settings pool_mode = transaction max_client_conn = 1000 default_pool_size = 25 min_pool_size = 5 reserve_pool_size = 5 reserve_pool_timeout = 3 ; Timeouts server_idle_timeout = 600 server_lifetime = 3600 server_connect_timeout = 15 query_timeout = 0 query_wait_timeout = 120 ; Logging log_connections = 1 log_disconnections = 1 log_pooler_errors = 1 stats_period = 60 ``` #### Application-Level Read/Write Splitting ```python # src/core/database.py - Enhanced with read replica support import os from contextlib import asynccontextmanager from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession, async_sessionmaker from sqlalchemy.orm import declarative_base # Primary (RW) database PRIMARY_DATABASE_URL = os.getenv( "DATABASE_URL", "postgresql+asyncpg://postgres:postgres@localhost:5432/mockupaws" ) # Replica (RO) databases REPLICA_DATABASE_URLS = os.getenv( "REPLICA_DATABASE_URLS", "" ).split(",") if os.getenv("REPLICA_DATABASE_URLS") else [] # Primary engine (RW) primary_engine = create_async_engine( PRIMARY_DATABASE_URL, pool_size=10, max_overflow=20, pool_pre_ping=True, pool_recycle=3600, ) # Replica engines (RO) replica_engines = [ create_async_engine(url, pool_size=5, max_overflow=10, pool_pre_ping=True) for url in REPLICA_DATABASE_URLS if url ] # Session factories PrimarySessionLocal = async_sessionmaker(primary_engine, class_=AsyncSession) ReplicaSessionLocal = async_sessionmaker( replica_engines[0] if replica_engines else primary_engine, class_=AsyncSession ) Base = declarative_base() async def get_db(write: bool = False) -> AsyncSession: """Get database session with automatic read/write splitting.""" if write: async with PrimarySessionLocal() as session: yield session else: async with ReplicaSessionLocal() as session: yield session class DatabaseRouter: """Route queries to appropriate database based on operation type.""" @staticmethod def get_engine(operation: str = "read"): """Get appropriate engine for operation.""" if operation in ("write", "insert", "update", "delete"): return primary_engine return replica_engines[0] if replica_engines else primary_engine ``` --- ## 2. High Availability Design ### 2.1 Multi-Region Deployment Strategy #### Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ GLOBAL TRAFFIC MANAGER │ │ (Route53 / Cloudflare Load Balancing) │ │ │ │ Health Checks: /health endpoint every 30s │ │ Failover: Automatic on 3 consecutive failures │ │ Latency-based Routing: Route to nearest healthy region │ └─────────────────────────────────────────────────────────────────────────────┘ │ │ ▼ ▼ ┌──────────────────────────────┐ ┌──────────────────────────────┐ │ PRIMARY REGION │ │ STANDBY REGION │ │ (us-east-1) │ │ (eu-west-1) │ │ │ │ │ │ ┌────────────────────────┐ │ │ ┌────────────────────────┐ │ │ │ Application Stack │ │ │ │ Application Stack │ │ │ │ (3x backend, 2x LB) │ │ │ │ (2x backend, 2x LB) │ │ │ └────────────────────────┘ │ │ └────────────────────────┘ │ │ │ │ │ │ ┌────────────────────────┐ │ │ ┌────────────────────────┐ │ │ │ PostgreSQL Primary │──┼──┼──►│ PostgreSQL Replica │ │ │ │ + 2 Replicas │ │ │ │ (Hot Standby) │ │ │ └────────────────────────┘ │ │ └────────────────────────┘ │ │ │ │ │ │ ┌────────────────────────┐ │ │ ┌────────────────────────┐ │ │ │ Redis Cluster │──┼──┼──►│ Redis Replica │ │ │ │ (3 Masters) │ │ │ │ (Read-only) │ │ │ └────────────────────────┘ │ │ └────────────────────────┘ │ │ │ │ │ │ ┌────────────────────────┐ │ │ ┌────────────────────────┐ │ │ │ S3 Bucket │◄─┼──┼──►│ S3 Cross-Region │ │ │ │ (Primary) │ │ │ │ Replication │ │ │ └────────────────────────┘ │ │ └────────────────────────┘ │ └──────────────────────────────┘ └──────────────────────────────┘ │ │ │ ┌──────────────┐ │ └───────►│ BACKUP │◄─────┘ │ S3 Bucket │ │ (3rd Region)│ └──────────────┘ ``` #### Failover Mechanisms **Database Failover (Automatic)** ```python # scripts/db-failover.py """Automated database failover script.""" import asyncio import asyncpg from typing import Optional class DatabaseFailoverManager: """Manage PostgreSQL failover.""" async def check_primary_health(self, primary_host: str) -> bool: """Check if primary database is healthy.""" try: conn = await asyncpg.connect( host=primary_host, database="mockupaws", user="healthcheck", password=os.getenv("DB_HEALTH_PASSWORD"), timeout=5 ) result = await conn.fetchval("SELECT 1") await conn.close() return result == 1 except Exception: return False async def promote_replica(self, replica_host: str) -> bool: """Promote replica to primary.""" # Execute pg_ctl promote on replica # Update connection strings in application config # Notify application to reconnect pass async def run_failover(self) -> bool: """Execute full failover procedure.""" # 1. Verify primary is truly down (avoid split-brain) # 2. Promote best replica to primary # 3. Update DNS/load balancer configuration # 4. Notify on-call engineers # 5. Begin recovery of old primary as new replica pass # Health check endpoint for load balancer @app.get("/health/db") async def database_health_check(): """Deep health check including database connectivity.""" try: # Quick query to verify DB connection result = await db.execute("SELECT 1") return {"status": "healthy", "database": "connected"} except Exception as e: raise HTTPException( status_code=503, detail={"status": "unhealthy", "database": str(e)} ) ``` **Redis Failover (Redis Sentinel)** ```yaml # redis-sentinel.conf sentinel monitor mymaster redis-master 6379 2 sentinel down-after-milliseconds mymaster 5000 sentinel failover-timeout mymaster 60000 sentinel parallel-syncs mymaster 1 sentinel auth-pass mymaster ${REDIS_PASSWORD} # Notification sentinel notification-script mymaster /usr/local/bin/notify.sh ``` ### 2.2 Circuit Breaker Pattern ```python # src/core/circuit_breaker.py """Circuit breaker pattern implementation.""" import time from enum import Enum from functools import wraps from typing import Callable, Any import asyncio class CircuitState(Enum): CLOSED = "closed" # Normal operation OPEN = "open" # Failing, reject requests HALF_OPEN = "half_open" # Testing if service recovered class CircuitBreaker: """Circuit breaker for external service calls.""" def __init__( self, name: str, failure_threshold: int = 5, recovery_timeout: int = 60, half_open_max_calls: int = 3 ): self.name = name self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.half_open_max_calls = half_open_max_calls self.state = CircuitState.CLOSED self.failure_count = 0 self.success_count = 0 self.last_failure_time = None self._lock = asyncio.Lock() async def call(self, func: Callable, *args, **kwargs) -> Any: """Execute function with circuit breaker protection.""" async with self._lock: if self.state == CircuitState.OPEN: if time.time() - self.last_failure_time >= self.recovery_timeout: self.state = CircuitState.HALF_OPEN self.success_count = 0 else: raise CircuitBreakerOpen(f"Circuit {self.name} is OPEN") if self.state == CircuitState.HALF_OPEN and self.success_count >= self.half_open_max_calls: raise CircuitBreakerOpen(f"Circuit {self.name} HALF_OPEN limit reached") try: result = await func(*args, **kwargs) await self._on_success() return result except Exception as e: await self._on_failure() raise async def _on_success(self): async with self._lock: if self.state == CircuitState.HALF_OPEN: self.success_count += 1 if self.success_count >= self.half_open_max_calls: self._reset() else: self.failure_count = 0 async def _on_failure(self): async with self._lock: self.failure_count += 1 self.last_failure_time = time.time() if self.state == CircuitState.HALF_OPEN: self.state = CircuitState.OPEN elif self.failure_count >= self.failure_threshold: self.state = CircuitState.OPEN def _reset(self): self.state = CircuitState.CLOSED self.failure_count = 0 self.success_count = 0 self.last_failure_time = None def circuit_breaker( name: str, failure_threshold: int = 5, recovery_timeout: int = 60 ): """Decorator for circuit breaker pattern.""" breaker = CircuitBreaker(name, failure_threshold, recovery_timeout) def decorator(func: Callable) -> Callable: @wraps(func) async def wrapper(*args, **kwargs): return await breaker.call(func, *args, **kwargs) return wrapper return decorator # Usage example @circuit_breaker(name="aws_pricing_api", failure_threshold=3, recovery_timeout=30) async def fetch_aws_pricing(service: str, region: str): """Fetch AWS pricing with circuit breaker protection.""" async with httpx.AsyncClient() as client: response = await client.get( f"https://pricing.us-east-1.amazonaws.com/{service}/{region}", timeout=10.0 ) return response.json() ``` ### 2.3 Graceful Degradation ```python # src/core/degradation.py """Graceful degradation strategies.""" from functools import wraps from typing import Optional, Any import asyncio class DegradationStrategy: """Base class for degradation strategies.""" async def fallback(self, *args, **kwargs) -> Any: """Return fallback value when primary fails.""" raise NotImplementedError class CacheFallback(DegradationStrategy): """Fallback to cached data.""" def __init__(self, cache_key: str, max_age: int = 3600): self.cache_key = cache_key self.max_age = max_age async def fallback(self, *args, **kwargs) -> Any: # Return stale cache data return await redis.get(f"stale:{self.cache_key}") class StaticFallback(DegradationStrategy): """Fallback to static/default data.""" def __init__(self, default_value: Any): self.default_value = default_value async def fallback(self, *args, **kwargs) -> Any: return self.default_value class EmptyFallback(DegradationStrategy): """Fallback to empty result.""" async def fallback(self, *args, **kwargs) -> Any: return [] def with_degradation( strategy: DegradationStrategy, timeout: float = 5.0, exceptions: tuple = (Exception,) ): """Decorator for graceful degradation.""" def decorator(func): @wraps(func) async def wrapper(*args, **kwargs): try: return await asyncio.wait_for( func(*args, **kwargs), timeout=timeout ) except exceptions as e: logger.warning( f"Primary function failed, using fallback: {e}", extra={"function": func.__name__} ) return await strategy.fallback(*args, **kwargs) return wrapper return decorator # Usage examples @with_degradation( strategy=CacheFallback(cache_key="aws_pricing", max_age=86400), timeout=3.0 ) async def get_aws_pricing(service: str, region: str): """Get AWS pricing with cache fallback.""" # Primary: fetch from AWS API pass @with_degradation( strategy=StaticFallback(default_value={"status": "degraded", "metrics": []}), timeout=2.0 ) async def get_dashboard_metrics(scenario_id: str): """Get metrics with empty fallback on failure.""" # Primary: fetch from database pass @with_degradation( strategy=EmptyFallback(), timeout=1.0 ) async def get_recommendations(scenario_id: str): """Get recommendations with empty fallback.""" # Primary: ML-based recommendation engine pass ``` --- ## 3. Data Architecture ### 3.1 Database Partitioning Strategy #### Time-Based Partitioning for Logs and Metrics ```sql -- Enable pg_partman extension CREATE EXTENSION IF NOT EXISTS pg_partman; -- Partitioned scenario_logs table CREATE TABLE scenario_logs_partitioned ( id UUID DEFAULT gen_random_uuid(), scenario_id UUID NOT NULL, received_at TIMESTAMPTZ NOT NULL, message_hash VARCHAR(64) NOT NULL, message_preview VARCHAR(500), source VARCHAR(100) DEFAULT 'unknown', size_bytes INTEGER DEFAULT 0, has_pii BOOLEAN DEFAULT FALSE, token_count INTEGER DEFAULT 0, sqs_blocks INTEGER DEFAULT 1, PRIMARY KEY (id, received_at) ) PARTITION BY RANGE (received_at); -- Create partitions (monthly) SELECT create_parent('public.scenario_logs_partitioned', 'received_at', 'native', 'monthly'); -- Partitioned scenario_metrics table CREATE TABLE scenario_metrics_partitioned ( id UUID DEFAULT gen_random_uuid(), scenario_id UUID NOT NULL, timestamp TIMESTAMPTZ NOT NULL, metric_type VARCHAR(50) NOT NULL, metric_name VARCHAR(100) NOT NULL, value DECIMAL(15, 6) NOT NULL, unit VARCHAR(20) NOT NULL, extra_data JSONB DEFAULT '{}', PRIMARY KEY (id, timestamp) ) PARTITION BY RANGE (timestamp); SELECT create_parent('public.scenario_metrics_partitioned', 'timestamp', 'native', 'daily'); -- Automated partition maintenance SELECT partman.run_maintenance('scenario_logs_partitioned'); ``` #### Tenant Isolation Strategy ```sql -- Row-Level Security for multi-tenant support ALTER TABLE scenarios ENABLE ROW LEVEL SECURITY; ALTER TABLE scenario_logs ENABLE ROW LEVEL SECURITY; ALTER TABLE scenario_metrics ENABLE ROW LEVEL SECURITY; -- Add tenant_id column ALTER TABLE scenarios ADD COLUMN tenant_id UUID NOT NULL DEFAULT '00000000-0000-0000-0000-000000000000'; ALTER TABLE scenario_logs ADD COLUMN tenant_id UUID NOT NULL DEFAULT '00000000-0000-0000-0000-000000000000'; -- Create RLS policies CREATE POLICY tenant_isolation_scenarios ON scenarios USING (tenant_id = current_setting('app.current_tenant')::UUID); CREATE POLICY tenant_isolation_logs ON scenario_logs USING (tenant_id = current_setting('app.current_tenant')::UUID); -- Set tenant context per session SET app.current_tenant = 'tenant-uuid-here'; ``` ### 3.2 Data Archive Strategy #### Archive Policy | Data Type | Retention Hot | Retention Warm | Archive To | Compression | |-----------|--------------|----------------|------------|-------------| | Scenario Logs | 90 days | 1 year | S3 Glacier | GZIP | | Scenario Metrics | 30 days | 90 days | S3 Standard-IA | Parquet | | Reports | 30 days | 6 months | S3 Glacier | None (PDF) | | Audit Logs | 1 year | 7 years | S3 Glacier Deep | GZIP | #### Archive Implementation ```python # src/services/archive_service.py """Data archiving service for old records.""" from datetime import datetime, timedelta from typing import List import asyncio import aioboto3 import gzip import io class ArchiveService: """Service for archiving old data to S3.""" def __init__(self): self.s3_bucket = os.getenv("ARCHIVE_S3_BUCKET") self.s3_prefix = os.getenv("ARCHIVE_S3_PREFIX", "archives/") self.session = aioboto3.Session() async def archive_old_logs(self, days: int = 365) -> dict: """Archive logs older than specified days.""" cutoff_date = datetime.utcnow() - timedelta(days=days) # Query old logs query = """ SELECT * FROM scenario_logs WHERE received_at < :cutoff_date AND archived = FALSE LIMIT 100000 """ result = await db.execute(query, {"cutoff_date": cutoff_date}) logs = result.fetchall() if not logs: return {"archived": 0, "bytes": 0} # Group by month for efficient storage logs_by_month = self._group_by_month(logs) total_archived = 0 total_bytes = 0 async with self.session.client("s3") as s3: for month_key, month_logs in logs_by_month.items(): # Convert to Parquet/JSON Lines data = self._serialize_logs(month_logs) compressed = gzip.compress(data.encode()) # Upload to S3 s3_key = f"{self.s3_prefix}logs/{month_key}.jsonl.gz" await s3.put_object( Bucket=self.s3_bucket, Key=s3_key, Body=compressed, StorageClass="GLACIER" ) # Mark as archived in database await self._mark_archived([log.id for log in month_logs]) total_archived += len(month_logs) total_bytes += len(compressed) return { "archived": total_archived, "bytes": total_bytes, "months": len(logs_by_month) } async def query_archive( self, scenario_id: str, start_date: datetime, end_date: datetime ) -> List[dict]: """Query archived data (transparent to application).""" # Determine which months to query months = self._get_months_between(start_date, end_date) # Query hot data from database hot_data = await self._query_hot_data(scenario_id, start_date, end_date) # Query archived data from S3 archived_data = [] for month in months: if self._is_archived(month): data = await self._fetch_from_s3(month) archived_data.extend(data) # Merge and return return hot_data + archived_data # Nightly archive job async def run_nightly_archive(): """Run archive process nightly.""" service = ArchiveService() # Archive logs > 1 year logs_result = await service.archive_old_logs(days=365) logger.info(f"Archived {logs_result['archived']} logs") # Archive metrics > 2 years (aggregate first) metrics_result = await service.archive_old_metrics(days=730) logger.info(f"Archived {metrics_result['archived']} metrics") # Compress old reports > 6 months reports_result = await service.compress_old_reports(days=180) logger.info(f"Compressed {reports_result['compressed']} reports") ``` #### Archive Table Schema ```sql -- Archive tracking table CREATE TABLE archive_metadata ( id SERIAL PRIMARY KEY, table_name VARCHAR(100) NOT NULL, archive_date TIMESTAMPTZ NOT NULL DEFAULT NOW(), date_from DATE NOT NULL, date_to DATE NOT NULL, s3_key VARCHAR(500) NOT NULL, s3_bucket VARCHAR(100) NOT NULL, record_count INTEGER NOT NULL, compressed_size_bytes BIGINT NOT NULL, uncompressed_size_bytes BIGINT NOT NULL, compression_ratio DECIMAL(5,2), verification_hash VARCHAR(64), restored BOOLEAN DEFAULT FALSE, created_at TIMESTAMPTZ DEFAULT NOW() ); -- Indexes for archive queries CREATE INDEX idx_archive_table ON archive_metadata(table_name); CREATE INDEX idx_archive_dates ON archive_metadata(date_from, date_to); ``` ### 3.3 CDN Configuration #### CloudFront Distribution ```yaml # terraform/cdn.tf resource "aws_cloudfront_distribution" "mockupaws" { enabled = true is_ipv6_enabled = true default_root_object = "index.html" price_class = "PriceClass_100" # North America and Europe # Origin for static assets origin { domain_name = aws_s3_bucket.static_assets.bucket_regional_domain_name origin_id = "S3-static" s3_origin_config { origin_access_identity = aws_cloudfront_origin_access_identity.oai.cloudfront_access_identity_path } } # Origin for API (if caching API responses) origin { domain_name = aws_lb.main.dns_name origin_id = "ALB-api" custom_origin_config { http_port = 80 https_port = 443 origin_protocol_policy = "https-only" origin_ssl_protocols = ["TLSv1.2"] } } # Default cache behavior for static assets default_cache_behavior { allowed_methods = ["GET", "HEAD", "OPTIONS"] cached_methods = ["GET", "HEAD"] target_origin_id = "S3-static" forwarded_values { query_string = false cookies { forward = "none" } } viewer_protocol_policy = "redirect-to-https" min_ttl = 86400 # 1 day default_ttl = 604800 # 1 week max_ttl = 31536000 # 1 year compress = true } # Cache behavior for API (selective caching) ordered_cache_behavior { path_pattern = "/api/v1/pricing/*" allowed_methods = ["GET", "HEAD", "OPTIONS"] cached_methods = ["GET", "HEAD"] target_origin_id = "ALB-api" forwarded_values { query_string = true headers = ["Origin", "Access-Control-Request-Headers", "Access-Control-Request-Method"] cookies { forward = "none" } } viewer_protocol_policy = "https-only" min_ttl = 3600 # 1 hour default_ttl = 86400 # 24 hours (AWS pricing changes slowly) max_ttl = 604800 # 7 days } # Custom error responses for SPA custom_error_response { error_code = 403 response_code = 200 response_page_path = "/index.html" } custom_error_response { error_code = 404 response_code = 200 response_page_path = "/index.html" } restrictions { geo_restriction { restriction_type = "none" } } viewer_certificate { acm_certificate_arn = aws_acm_certificate.main.arn ssl_support_method = "sni-only" minimum_protocol_version = "TLSv1.2_2021" } } ``` --- ## 4. Capacity Planning ### 4.1 Resource Estimates #### Base Capacity (1000 Concurrent Users) | Component | Instance Type | Count | vCPU | Memory | Storage | |-----------|---------------|-------|------|--------|---------| | Load Balancer | t3.medium | 2 | 2 | 4 GB | 20 GB | | Backend API | t3.large | 3 | 8 | 12 GB | 50 GB | | PostgreSQL Primary | r6g.xlarge | 1 | 4 | 32 GB | 500 GB SSD | | PostgreSQL Replica | r6g.large | 2 | 2 | 16 GB | 500 GB SSD | | Redis | cache.r6g.large | 3 | 2 | 13 GB | - | | PgBouncer | t3.small | 2 | 2 | 2 GB | 20 GB | #### Scaling Projections | Users | Backend Instances | DB Connections | Redis Memory | Storage/Month | |-------|-------------------|----------------|--------------|---------------| | 1,000 | 3 | 100 | 10 GB | 100 GB | | 5,000 | 6 | 300 | 25 GB | 400 GB | | 10,000 | 12 | 600 | 50 GB | 800 GB | | 50,000 | 30 | 1500 | 150 GB | 3 TB | ### 4.2 Storage Estimates | Data Type | Daily Volume | Monthly Volume | Annual Volume | Compression | |-----------|--------------|----------------|---------------|-------------| | Logs | 10 GB | 300 GB | 3.6 TB | 70% | | Metrics | 2 GB | 60 GB | 720 GB | 50% | | Reports | 1 GB | 30 GB | 360 GB | 0% | | Backups | - | 500 GB | 6 TB | 80% | | **Total** | **13 GB** | **~900 GB** | **~10 TB** | - | ### 4.3 Network Bandwidth | Traffic Type | Daily | Monthly | Peak (Gbps) | |--------------|-------|---------|-------------| | Ingress (API) | 100 GB | 3 TB | 1 Gbps | | Egress (API) | 500 GB | 15 TB | 5 Gbps | | CDN (Static) | 1 TB | 30 TB | 10 Gbps | ### 4.4 Cost Estimates (AWS) | Service | Monthly Cost (1K users) | Monthly Cost (10K users) | |---------|------------------------|--------------------------| | EC2 (Compute) | $450 | $2,000 | | RDS (PostgreSQL) | $800 | $2,500 | | ElastiCache (Redis) | $400 | $1,200 | | S3 (Storage) | $200 | $800 | | CloudFront (CDN) | $300 | $1,500 | | ALB (Load Balancer) | $100 | $200 | | CloudWatch (Monitoring) | $100 | $300 | | **Total** | **~$2,350** | **~$8,500** | --- ## 5. Scaling Thresholds & Triggers ### 5.1 Auto-Scaling Rules ```yaml # Scaling policies scaling_policies: backend_scale_out: metric: cpu_utilization threshold: 70 duration: 300 # 5 minutes adjustment: +1 instance cooldown: 300 backend_scale_in: metric: cpu_utilization threshold: 30 duration: 600 # 10 minutes adjustment: -1 instance cooldown: 600 db_connection_scale: metric: database_connections threshold: 80 duration: 180 action: alert_and_review memory_pressure: metric: memory_utilization threshold: 85 duration: 120 adjustment: +1 instance cooldown: 300 ``` ### 5.2 Alert Thresholds | Metric | Warning | Critical | Emergency | |--------|---------|----------|-----------| | CPU Usage | >60% | >80% | >95% | | Memory Usage | >70% | >85% | >95% | | Disk Usage | >70% | >85% | >95% | | Response Time (p95) | >200ms | >500ms | >1000ms | | Error Rate | >0.1% | >1% | >5% | | DB Connections | >70% | >85% | >95% | | Queue Depth | >500 | >1000 | >5000 | --- ## 6. Component Interactions ### 6.1 Request Flow ``` 1. Client Request └──► CDN (CloudFront) └──► Nginx Load Balancer └──► Backend API (Round-robin) ├──► FastAPI Route Handler ├──► Authentication (JWT/API Key) ├──► Rate Limiting (Redis) ├──► Caching Check (Redis) ├──► Database Query (PgBouncer → PostgreSQL) ├──► Cache Update (Redis) └──► Response ``` ### 6.2 Data Flow ``` 1. Log Ingestion └──► API Endpoint (/api/v1/ingest) ├──► Validation (Pydantic) ├──► Rate Limit Check (Redis) ├──► PII Detection ├──► Token Counting ├──► Async DB Write └──► Background Metric Update 2. Report Generation └──► API Request ├──► Queue Job (Celery) ├──► Worker Processing ├──► Data Aggregation ├──► PDF Generation ├──► Upload to S3 └──► Notification ``` ### 6.3 Failure Scenarios | Failure | Impact | Mitigation | |---------|--------|------------| | Single backend down | 33% capacity | Auto-restart, health check removal | | Primary DB down | Read-only mode | Automatic failover to replica | | Redis down | No caching | Degrade to DB queries, queue to memory | | Nginx down | No traffic | Standby takeover (VIP) | | Region down | Full outage | DNS failover to standby region | --- ## 7. Critical Path for Other Teams ### 7.1 Dependencies ``` SPEC-001 (This Document) │ ├──► @db-engineer - DB-001, DB-002, DB-003 │ (Waiting for: partitioning strategy, connection pooling config) │ ├──► @backend-dev - BE-PERF-004, BE-PERF-005 │ (Waiting for: Redis config, async optimization guidelines) │ ├──► @devops-engineer - DEV-DEPLOY-013, DEV-INFRA-014 │ (Waiting for: infrastructure specs, scaling thresholds) │ └──► @qa-engineer - QA-PERF-017 (Waiting for: capacity targets, performance benchmarks) ``` ### 7.2 Blocking Items (MUST COMPLETE FIRST) 1. **Load Balancer Configuration** → Blocks: DEV-INFRA-014 2. **Database Connection Pool Settings** → Blocks: DB-001 3. **Redis Cluster Configuration** → Blocks: BE-PERF-004 4. **Scaling Thresholds** → Blocks: QA-PERF-017 ### 7.3 Handoff Checklist Before other teams can proceed: - [x] Architecture diagrams complete - [x] Component specifications defined - [x] Capacity planning estimates provided - [x] Scaling thresholds documented - [x] Configuration templates ready - [ ] Review meeting completed (scheduled) - [ ] Feedback incorporated - [ ] Architecture frozen for v1.0.0 --- ## Appendix A: Configuration Templates ### Docker Compose Production ```yaml # docker-compose.prod.yml version: '3.8' services: nginx: image: nginx:alpine ports: - "80:80" - "443:443" volumes: - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro - ./nginx/ssl:/etc/nginx/ssl:ro depends_on: - backend networks: - frontend deploy: replicas: 2 restart_policy: condition: any backend: image: mockupaws/backend:v1.0.0 environment: - DATABASE_URL=postgresql+asyncpg://app:${DB_PASSWORD}@pgbouncer:6432/mockupaws - REPLICA_DATABASE_URLS=${REPLICA_URLS} - REDIS_URL=redis://redis-cluster:6379 - JWT_SECRET_KEY=${JWT_SECRET} depends_on: - pgbouncer - redis-cluster networks: - frontend - backend deploy: replicas: 3 resources: limits: cpus: '2.0' memory: 4G update_config: parallelism: 1 delay: 10s pgbouncer: image: pgbouncer/pgbouncer:latest environment: - DATABASES_HOST=postgres-primary - DATABASES_PORT=5432 - DATABASES_DATABASE=mockupaws - POOL_MODE=transaction - MAX_CLIENT_CONN=1000 networks: - backend redis-cluster: image: redis:7-alpine command: redis-server /usr/local/etc/redis/redis.conf volumes: - ./redis/redis.conf:/usr/local/etc/redis/redis.conf networks: - backend deploy: replicas: 3 networks: frontend: driver: overlay backend: driver: overlay internal: true ``` ### Environment Variables Template ```bash # .env.production # Application APP_ENV=production DEBUG=false LOG_LEVEL=INFO # Database DATABASE_URL=postgresql+asyncpg://app:secure_password@pgbouncer:6432/mockupaws REPLICA_DATABASE_URLS=postgresql+asyncpg://app:secure_password@pgbouncer-replica-1:6432/mockupaws,postgresql+asyncpg://app:secure_password@pgbouncer-replica-2:6432/mockupaws DB_POOL_SIZE=20 DB_MAX_OVERFLOW=10 # Redis REDIS_URL=redis://redis-cluster:6379 REDIS_CLUSTER_NODES=redis-1:6379,redis-2:6379,redis-3:6379 # Security JWT_SECRET_KEY=change_me_in_production_32_chars_min JWT_ALGORITHM=HS256 ACCESS_TOKEN_EXPIRE_MINUTES=30 BCRYPT_ROUNDS=12 # Rate Limiting RATE_LIMIT_GENERAL=100/minute RATE_LIMIT_AUTH=5/minute RATE_LIMIT_INGEST=1000/minute # AWS/S3 AWS_REGION=us-east-1 S3_BUCKET=mockupaws-production ARCHIVE_S3_BUCKET=mockupaws-archives CLOUDFRONT_DOMAIN=cdn.mockupaws.com # Monitoring SENTRY_DSN=https://xxx@yyy.ingest.sentry.io/zzz PROMETHEUS_ENABLED=true JAEGER_ENDPOINT=http://jaeger:14268/api/traces ``` --- *Document Version: 1.0.0-Draft* *Last Updated: 2026-04-07* *Owner: @spec-architect*