# Production Architecture Design - mockupAWS v1.0.0

> **Version:** 1.0.0  
> **Author:** @spec-architect  
> **Date:** 2026-04-07  
> **Status:** DRAFT - Ready for Review  

---

## Executive Summary

This document defines the production architecture for mockupAWS v1.0.0, transforming the current single-node development setup into an enterprise-grade, scalable, and highly available system.

### Key Architectural Decisions

| Decision | Rationale |
|----------|-----------|
| **Nginx Load Balancer** | Battle-tested, extensive configuration options, SSL termination |
| **PostgreSQL Primary-Replica** | Read scaling for analytics workloads, failover capability |
| **Redis Cluster** | Distributed caching, session storage, rate limiting |
| **Container Orchestration** | Docker Compose for simplicity, Kubernetes-ready design |
| **Multi-Region Active-Passive** | Cost-effective HA, 99.9% uptime target |

---

## 1. Scalability Architecture

### 1.1 System Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              CLIENT LAYER                                    │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │ Web Browser  │  │ Mobile App   │  │ API Clients  │  │ CI/CD        │     │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘     │
└─────────┼──────────────────┼──────────────────┼──────────────────┼───────────┘
          │                  │                  │                  │
          └──────────────────┴──────────────────┴──────────────────┘
                                     │
                                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           EDGE LAYER (CDN + WAF)                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ CloudFront / Cloudflare CDN                                         │    │
│  │ • Static assets caching (React bundle, images, reports)            │    │
│  │ • DDoS protection                                                   │    │
│  │ • Geo-routing                                                       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘
                                     │
                                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          LOAD BALANCER LAYER                                 │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ Nginx Load Balancer (Active-Standby)                                │    │
│  │ • SSL Termination (TLS 1.3)                                         │    │
│  │ • Health checks: /health endpoint                                   │    │
│  │ • Sticky sessions (for WebSocket support)                          │    │
│  │ • Rate limiting: 1000 req/min per IP                                │    │
│  │ • Circuit breaker: 5xx threshold detection                         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘
                                     │
                    ┌────────────────┼────────────────┐
                    ▼                ▼                ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        APPLICATION LAYER (3x replicas)                       │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐          │
│  │   Backend API    │  │   Backend API    │  │   Backend API    │          │
│  │   Instance 1     │  │   Instance 2     │  │   Instance 3     │          │
│  │   (Port 8000)    │  │   (Port 8000)    │  │   (Port 8000)    │          │
│  ├──────────────────┤  ├──────────────────┤  ├──────────────────┤          │
│  │ • FastAPI        │  │ • FastAPI        │  │ • FastAPI        │          │
│  │ • Uvicorn        │  │ • Uvicorn        │  │ • Uvicorn        │          │
│  │ • 4 Workers      │  │ • 4 Workers      │  │ • 4 Workers      │          │
│  └────────┬─────────┘  └────────┬─────────┘  └────────┬─────────┘          │
└───────────┼─────────────────────┼─────────────────────┼────────────────────┘
            │                     │                     │
            └─────────────────────┼─────────────────────┘
                                  │
                    ┌─────────────┴─────────────┐
                    ▼                           ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        DATA LAYER                                            │
│  ┌─────────────────────────┐    ┌────────────────────────────────────────┐  │
│  │    Redis Cluster        │    │     PostgreSQL Primary-Replica         │  │
│  │  ┌─────┐ ┌─────┐ ┌────┐│    │  ┌──────────┐      ┌──────────────┐   │  │
│  │  │ M1  │ │ M2  │ │ M3 ││    │  │ Primary  │◄────►│  Replica 1   │   │  │
│  │  └──┬──┘ └──┬──┘ └──┬─┘│    │  │  (RW)    │  Sync │   (RO)       │   │  │
│  │     └───────┴───────┘  │    │  └────┬─────┘      └──────────────┘   │  │
│  │  ┌─────┐ ┌─────┐ ┌────┐│    │       │            ┌──────────────┐   │  │
│  │  │ S1  │ │ S2  │ │ S3 ││    │       └───────────►│  Replica 2   │   │  │
│  │  └─────┘ └─────┘ └────┘│    │                    │   (RO)       │   │  │
│  │  (3 Masters + 3 Slaves) │    │                    └──────────────┘   │  │
│  └─────────────────────────┘    └────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 1.2 Load Balancer Configuration (Nginx)

```nginx
# /etc/nginx/conf.d/mockupaws.conf

upstream backend {
    least_conn;  # Least connections load balancing
    server backend-1:8000 weight=1 max_fails=3 fail_timeout=30s;
    server backend-2:8000 weight=1 max_fails=3 fail_timeout=30s;
    server backend-3:8000 weight=1 max_fails=3 fail_timeout=30s backup;
    
    keepalive 32;  # Keepalive connections
}

server {
    listen 80;
    server_name api.mockupaws.com;
    return 301 https://$server_name$request_uri;  # Force HTTPS
}

server {
    listen 443 ssl http2;
    server_name api.mockupaws.com;
    
    # SSL Configuration
    ssl_certificate /etc/ssl/certs/mockupaws.crt;
    ssl_certificate_key /etc/ssl/private/mockupaws.key;
    ssl_protocols TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers on;
    
    # Security Headers
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
    add_header X-Frame-Options "DENY" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-XSS-Protection "1; mode=block" always;
    
    # Rate Limiting Zones
    limit_req_zone $binary_remote_addr zone=api:10m rate=100r/m;
    limit_req_zone $binary_remote_addr zone=auth:10m rate=10r/m;
    limit_req_zone $binary_remote_addr zone=ingest:10m rate=1000r/m;
    
    # Health Check Endpoint
    location /health {
        access_log off;
        proxy_pass http://backend;
        proxy_connect_timeout 5s;
        proxy_send_timeout 5s;
        proxy_read_timeout 5s;
    }
    
    # API Endpoints with Circuit Breaker
    location /api/ {
        limit_req zone=api burst=20 nodelay;
        
        proxy_pass http://backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # Timeouts
        proxy_connect_timeout 30s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
        
        # Circuit Breaker Pattern
        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_next_upstream_tries 2;
    }
    
    # Auth Endpoints - Stricter Rate Limit
    location /api/v1/auth/ {
        limit_req zone=auth burst=5 nodelay;
        proxy_pass http://backend;
    }
    
    # Ingest Endpoints - Higher Throughput
    location /api/v1/ingest/ {
        limit_req zone=ingest burst=100 nodelay;
        client_max_body_size 10M;
        proxy_pass http://backend;
    }
    
    # Static Files (if served from backend)
    location /static/ {
        expires 1y;
        add_header Cache-Control "public, immutable";
        proxy_pass http://backend;
    }
}
```

### 1.3 Horizontal Scaling Strategy

#### Scaling Triggers

| Metric | Scale Out Threshold | Scale In Threshold | Action |
|--------|--------------------|--------------------|--------|
| CPU Usage | >70% for 5 min | <30% for 10 min | ±1 instance |
| Memory Usage | >80% for 5 min | <40% for 10 min | ±1 instance |
| Request Latency (p95) | >500ms for 3 min | <200ms for 10 min | +1 instance |
| Queue Depth (Celery) | >1000 jobs | <100 jobs | ±1 worker |
| DB Connections | >80% pool | <50% pool | Review query optimization |

#### Auto-Scaling Configuration (Docker Swarm)

```yaml
# docker-compose.prod.yml - Scaling Configuration
version: '3.8'

services:
  backend:
    image: mockupaws/backend:v1.0.0
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
      restart_policy:
        condition: any
        delay: 5s
        max_attempts: 3
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
        reservations:
          cpus: '0.5'
          memory: 1G
      labels:
        - "prometheus-job=backend"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    
  nginx:
    image: nginx:alpine
    deploy:
      replicas: 2
      placement:
        constraints:
          - node.role == manager
    ports:
      - "80:80"
      - "443:443"
```

#### Kubernetes HPA Alternative

```yaml
# k8s/hpa-backend.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: backend-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: backend
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120
```

### 1.4 Database Read Replicas

#### PostgreSQL Primary-Replica Setup

```
┌─────────────────────────────────────────────────────────────┐
│                  PostgreSQL Cluster                          │
│                                                              │
│  ┌─────────────────┐                                        │
│  │    Primary      │◄── Read/Write Operations               │
│  │  (postgres-1)   │                                        │
│  │                 │                                        │
│  │  • All writes   │                                        │
│  │  • WAL shipping │───┬────────────────────────┐          │
│  │  • Sync commit  │   │  Streaming Replication │          │
│  └─────────────────┘   │                        │          │
│                        ▼                        ▼          │
│              ┌─────────────────┐  ┌─────────────────┐      │
│              │  Replica 1      │  │  Replica 2      │      │
│              │ (postgres-2)    │  │ (postgres-3)    │      │
│              │                 │  │                 │      │
│              │  • Read-only    │  │  • Read-only    │      │
│              │  • Async replica│  │  • Async replica│      │
│              │  • Hot standby  │  │  • Hot standby  │      │
│              └─────────────────┘  └─────────────────┘      │
│                        │              │                    │
│                        └──────────────┘                    │
│                               │                            │
│                               ▼                            │
│              ┌─────────────────────────────────┐           │
│              │   PgBouncer Connection Pool     │           │
│              │                                 │           │
│              │  Pool Mode: Transaction         │           │
│              │  Max Connections: 1000          │           │
│              │  Default Pool: 25 per db/user   │           │
│              └─────────────────────────────────┘           │
└─────────────────────────────────────────────────────────────┘
```

#### Connection Pooling (PgBouncer)

```ini
; /etc/pgbouncer/pgbouncer.ini
[databases]
mockupaws = host=postgres-primary port=5432 dbname=mockupaws
mockupaws_replica = host=postgres-replica-1 port=5432 dbname=mockupaws

[pgbouncer]
listen_port = 6432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

; Pool settings
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3

; Timeouts
server_idle_timeout = 600
server_lifetime = 3600
server_connect_timeout = 15
query_timeout = 0
query_wait_timeout = 120

; Logging
log_connections = 1
log_disconnections = 1
log_pooler_errors = 1
stats_period = 60
```

#### Application-Level Read/Write Splitting

```python
# src/core/database.py - Enhanced with read replica support
import os
from contextlib import asynccontextmanager
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession, async_sessionmaker
from sqlalchemy.orm import declarative_base

# Primary (RW) database
PRIMARY_DATABASE_URL = os.getenv(
    "DATABASE_URL", 
    "postgresql+asyncpg://postgres:postgres@localhost:5432/mockupaws"
)

# Replica (RO) databases
REPLICA_DATABASE_URLS = os.getenv(
    "REPLICA_DATABASE_URLS", 
    ""
).split(",") if os.getenv("REPLICA_DATABASE_URLS") else []

# Primary engine (RW)
primary_engine = create_async_engine(
    PRIMARY_DATABASE_URL,
    pool_size=10,
    max_overflow=20,
    pool_pre_ping=True,
    pool_recycle=3600,
)

# Replica engines (RO)
replica_engines = [
    create_async_engine(url, pool_size=5, max_overflow=10, pool_pre_ping=True)
    for url in REPLICA_DATABASE_URLS if url
]

# Session factories
PrimarySessionLocal = async_sessionmaker(primary_engine, class_=AsyncSession)
ReplicaSessionLocal = async_sessionmaker(
    replica_engines[0] if replica_engines else primary_engine, 
    class_=AsyncSession
)

Base = declarative_base()


async def get_db(write: bool = False) -> AsyncSession:
    """Get database session with automatic read/write splitting."""
    if write:
        async with PrimarySessionLocal() as session:
            yield session
    else:
        async with ReplicaSessionLocal() as session:
            yield session


class DatabaseRouter:
    """Route queries to appropriate database based on operation type."""
    
    @staticmethod
    def get_engine(operation: str = "read"):
        """Get appropriate engine for operation."""
        if operation in ("write", "insert", "update", "delete"):
            return primary_engine
        return replica_engines[0] if replica_engines else primary_engine
```

---

## 2. High Availability Design

### 2.1 Multi-Region Deployment Strategy

#### Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                           GLOBAL TRAFFIC MANAGER                             │
│                     (Route53 / Cloudflare Load Balancing)                    │
│                                                                              │
│  Health Checks: /health endpoint every 30s                                   │
│  Failover: Automatic on 3 consecutive failures                              │
│  Latency-based Routing: Route to nearest healthy region                      │
└─────────────────────────────────────────────────────────────────────────────┘
              │                              │
              ▼                              ▼
┌──────────────────────────────┐  ┌──────────────────────────────┐
│      PRIMARY REGION          │  │      STANDBY REGION          │
│      (us-east-1)             │  │      (eu-west-1)             │
│                              │  │                              │
│  ┌────────────────────────┐  │  │  ┌────────────────────────┐  │
│  │   Application Stack    │  │  │  │   Application Stack    │  │
│  │   (3x backend, 2x LB)  │  │  │  │   (2x backend, 2x LB)  │  │
│  └────────────────────────┘  │  │  └────────────────────────┘  │
│                              │  │                              │
│  ┌────────────────────────┐  │  │  ┌────────────────────────┐  │
│  │   PostgreSQL Primary   │──┼──┼──►│   PostgreSQL Replica   │  │
│  │   + 2 Replicas         │  │  │  │   (Hot Standby)        │  │
│  └────────────────────────┘  │  │  └────────────────────────┘  │
│                              │  │                              │
│  ┌────────────────────────┐  │  │  ┌────────────────────────┐  │
│  │   Redis Cluster        │──┼──┼──►│   Redis Replica        │  │
│  │   (3 Masters)          │  │  │  │   (Read-only)          │  │
│  └────────────────────────┘  │  │  └────────────────────────┘  │
│                              │  │                              │
│  ┌────────────────────────┐  │  │  ┌────────────────────────┐  │
│  │   S3 Bucket            │◄─┼──┼──►│   S3 Cross-Region      │  │
│  │   (Primary)            │  │  │  │   Replication          │  │
│  └────────────────────────┘  │  │  └────────────────────────┘  │
└──────────────────────────────┘  └──────────────────────────────┘
              │                              │
              │        ┌──────────────┐      │
              └───────►│   BACKUP     │◄─────┘
                       │   S3 Bucket  │
                       │  (3rd Region)│
                       └──────────────┘
```

#### Failover Mechanisms

**Database Failover (Automatic)**

```python
# scripts/db-failover.py
"""Automated database failover script."""

import asyncio
import asyncpg
from typing import Optional


class DatabaseFailoverManager:
    """Manage PostgreSQL failover."""
    
    async def check_primary_health(self, primary_host: str) -> bool:
        """Check if primary database is healthy."""
        try:
            conn = await asyncpg.connect(
                host=primary_host,
                database="mockupaws",
                user="healthcheck",
                password=os.getenv("DB_HEALTH_PASSWORD"),
                timeout=5
            )
            result = await conn.fetchval("SELECT 1")
            await conn.close()
            return result == 1
        except Exception:
            return False
    
    async def promote_replica(self, replica_host: str) -> bool:
        """Promote replica to primary."""
        # Execute pg_ctl promote on replica
        # Update connection strings in application config
        # Notify application to reconnect
        pass
    
    async def run_failover(self) -> bool:
        """Execute full failover procedure."""
        # 1. Verify primary is truly down (avoid split-brain)
        # 2. Promote best replica to primary
        # 3. Update DNS/load balancer configuration
        # 4. Notify on-call engineers
        # 5. Begin recovery of old primary as new replica
        pass


# Health check endpoint for load balancer
@app.get("/health/db")
async def database_health_check():
    """Deep health check including database connectivity."""
    try:
        # Quick query to verify DB connection
        result = await db.execute("SELECT 1")
        return {"status": "healthy", "database": "connected"}
    except Exception as e:
        raise HTTPException(
            status_code=503,
            detail={"status": "unhealthy", "database": str(e)}
        )
```

**Redis Failover (Redis Sentinel)**

```yaml
# redis-sentinel.conf
sentinel monitor mymaster redis-master 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
sentinel parallel-syncs mymaster 1
sentinel auth-pass mymaster ${REDIS_PASSWORD}

# Notification
sentinel notification-script mymaster /usr/local/bin/notify.sh
```

### 2.2 Circuit Breaker Pattern

```python
# src/core/circuit_breaker.py
"""Circuit breaker pattern implementation."""

import time
from enum import Enum
from functools import wraps
from typing import Callable, Any
import asyncio


class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing if service recovered


class CircuitBreaker:
    """Circuit breaker for external service calls."""
    
    def __init__(
        self,
        name: str,
        failure_threshold: int = 5,
        recovery_timeout: int = 60,
        half_open_max_calls: int = 3
    ):
        self.name = name
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls
        
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self._lock = asyncio.Lock()
    
    async def call(self, func: Callable, *args, **kwargs) -> Any:
        """Execute function with circuit breaker protection."""
        async with self._lock:
            if self.state == CircuitState.OPEN:
                if time.time() - self.last_failure_time >= self.recovery_timeout:
                    self.state = CircuitState.HALF_OPEN
                    self.success_count = 0
                else:
                    raise CircuitBreakerOpen(f"Circuit {self.name} is OPEN")
            
            if self.state == CircuitState.HALF_OPEN and self.success_count >= self.half_open_max_calls:
                raise CircuitBreakerOpen(f"Circuit {self.name} HALF_OPEN limit reached")
        
        try:
            result = await func(*args, **kwargs)
            await self._on_success()
            return result
        except Exception as e:
            await self._on_failure()
            raise
    
    async def _on_success(self):
        async with self._lock:
            if self.state == CircuitState.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.half_open_max_calls:
                    self._reset()
            else:
                self.failure_count = 0
    
    async def _on_failure(self):
        async with self._lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.OPEN
            elif self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
    
    def _reset(self):
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None


def circuit_breaker(
    name: str,
    failure_threshold: int = 5,
    recovery_timeout: int = 60
):
    """Decorator for circuit breaker pattern."""
    breaker = CircuitBreaker(name, failure_threshold, recovery_timeout)
    
    def decorator(func: Callable) -> Callable:
        @wraps(func)
        async def wrapper(*args, **kwargs):
            return await breaker.call(func, *args, **kwargs)
        return wrapper
    return decorator


# Usage example
@circuit_breaker(name="aws_pricing_api", failure_threshold=3, recovery_timeout=30)
async def fetch_aws_pricing(service: str, region: str):
    """Fetch AWS pricing with circuit breaker protection."""
    async with httpx.AsyncClient() as client:
        response = await client.get(
            f"https://pricing.us-east-1.amazonaws.com/{service}/{region}",
            timeout=10.0
        )
        return response.json()
```

### 2.3 Graceful Degradation

```python
# src/core/degradation.py
"""Graceful degradation strategies."""

from functools import wraps
from typing import Optional, Any
import asyncio


class DegradationStrategy:
    """Base class for degradation strategies."""
    
    async def fallback(self, *args, **kwargs) -> Any:
        """Return fallback value when primary fails."""
        raise NotImplementedError


class CacheFallback(DegradationStrategy):
    """Fallback to cached data."""
    
    def __init__(self, cache_key: str, max_age: int = 3600):
        self.cache_key = cache_key
        self.max_age = max_age
    
    async def fallback(self, *args, **kwargs) -> Any:
        # Return stale cache data
        return await redis.get(f"stale:{self.cache_key}")


class StaticFallback(DegradationStrategy):
    """Fallback to static/default data."""
    
    def __init__(self, default_value: Any):
        self.default_value = default_value
    
    async def fallback(self, *args, **kwargs) -> Any:
        return self.default_value


class EmptyFallback(DegradationStrategy):
    """Fallback to empty result."""
    
    async def fallback(self, *args, **kwargs) -> Any:
        return []


def with_degradation(
    strategy: DegradationStrategy,
    timeout: float = 5.0,
    exceptions: tuple = (Exception,)
):
    """Decorator for graceful degradation."""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            try:
                return await asyncio.wait_for(
                    func(*args, **kwargs),
                    timeout=timeout
                )
            except exceptions as e:
                logger.warning(
                    f"Primary function failed, using fallback: {e}",
                    extra={"function": func.__name__}
                )
                return await strategy.fallback(*args, **kwargs)
        return wrapper
    return decorator


# Usage examples

@with_degradation(
    strategy=CacheFallback(cache_key="aws_pricing", max_age=86400),
    timeout=3.0
)
async def get_aws_pricing(service: str, region: str):
    """Get AWS pricing with cache fallback."""
    # Primary: fetch from AWS API
    pass


@with_degradation(
    strategy=StaticFallback(default_value={"status": "degraded", "metrics": []}),
    timeout=2.0
)
async def get_dashboard_metrics(scenario_id: str):
    """Get metrics with empty fallback on failure."""
    # Primary: fetch from database
    pass


@with_degradation(
    strategy=EmptyFallback(),
    timeout=1.0
)
async def get_recommendations(scenario_id: str):
    """Get recommendations with empty fallback."""
    # Primary: ML-based recommendation engine
    pass
```

---

## 3. Data Architecture

### 3.1 Database Partitioning Strategy

#### Time-Based Partitioning for Logs and Metrics

```sql
-- Enable pg_partman extension
CREATE EXTENSION IF NOT EXISTS pg_partman;

-- Partitioned scenario_logs table
CREATE TABLE scenario_logs_partitioned (
    id UUID DEFAULT gen_random_uuid(),
    scenario_id UUID NOT NULL,
    received_at TIMESTAMPTZ NOT NULL,
    message_hash VARCHAR(64) NOT NULL,
    message_preview VARCHAR(500),
    source VARCHAR(100) DEFAULT 'unknown',
    size_bytes INTEGER DEFAULT 0,
    has_pii BOOLEAN DEFAULT FALSE,
    token_count INTEGER DEFAULT 0,
    sqs_blocks INTEGER DEFAULT 1,
    PRIMARY KEY (id, received_at)
) PARTITION BY RANGE (received_at);

-- Create partitions (monthly)
SELECT create_parent('public.scenario_logs_partitioned', 'received_at', 'native', 'monthly');

-- Partitioned scenario_metrics table
CREATE TABLE scenario_metrics_partitioned (
    id UUID DEFAULT gen_random_uuid(),
    scenario_id UUID NOT NULL,
    timestamp TIMESTAMPTZ NOT NULL,
    metric_type VARCHAR(50) NOT NULL,
    metric_name VARCHAR(100) NOT NULL,
    value DECIMAL(15, 6) NOT NULL,
    unit VARCHAR(20) NOT NULL,
    extra_data JSONB DEFAULT '{}',
    PRIMARY KEY (id, timestamp)
) PARTITION BY RANGE (timestamp);

SELECT create_parent('public.scenario_metrics_partitioned', 'timestamp', 'native', 'daily');

-- Automated partition maintenance
SELECT partman.run_maintenance('scenario_logs_partitioned');
```

#### Tenant Isolation Strategy

```sql
-- Row-Level Security for multi-tenant support
ALTER TABLE scenarios ENABLE ROW LEVEL SECURITY;
ALTER TABLE scenario_logs ENABLE ROW LEVEL SECURITY;
ALTER TABLE scenario_metrics ENABLE ROW LEVEL SECURITY;

-- Add tenant_id column
ALTER TABLE scenarios ADD COLUMN tenant_id UUID NOT NULL DEFAULT '00000000-0000-0000-0000-000000000000';
ALTER TABLE scenario_logs ADD COLUMN tenant_id UUID NOT NULL DEFAULT '00000000-0000-0000-0000-000000000000';

-- Create RLS policies
CREATE POLICY tenant_isolation_scenarios ON scenarios
    USING (tenant_id = current_setting('app.current_tenant')::UUID);

CREATE POLICY tenant_isolation_logs ON scenario_logs
    USING (tenant_id = current_setting('app.current_tenant')::UUID);

-- Set tenant context per session
SET app.current_tenant = 'tenant-uuid-here';
```

### 3.2 Data Archive Strategy

#### Archive Policy

| Data Type | Retention Hot | Retention Warm | Archive To | Compression |
|-----------|--------------|----------------|------------|-------------|
| Scenario Logs | 90 days | 1 year | S3 Glacier | GZIP |
| Scenario Metrics | 30 days | 90 days | S3 Standard-IA | Parquet |
| Reports | 30 days | 6 months | S3 Glacier | None (PDF) |
| Audit Logs | 1 year | 7 years | S3 Glacier Deep | GZIP |

#### Archive Implementation

```python
# src/services/archive_service.py
"""Data archiving service for old records."""

from datetime import datetime, timedelta
from typing import List
import asyncio
import aioboto3
import gzip
import io


class ArchiveService:
    """Service for archiving old data to S3."""
    
    def __init__(self):
        self.s3_bucket = os.getenv("ARCHIVE_S3_BUCKET")
        self.s3_prefix = os.getenv("ARCHIVE_S3_PREFIX", "archives/")
        self.session = aioboto3.Session()
    
    async def archive_old_logs(self, days: int = 365) -> dict:
        """Archive logs older than specified days."""
        cutoff_date = datetime.utcnow() - timedelta(days=days)
        
        # Query old logs
        query = """
            SELECT * FROM scenario_logs
            WHERE received_at < :cutoff_date
            AND archived = FALSE
            LIMIT 100000
        """
        
        result = await db.execute(query, {"cutoff_date": cutoff_date})
        logs = result.fetchall()
        
        if not logs:
            return {"archived": 0, "bytes": 0}
        
        # Group by month for efficient storage
        logs_by_month = self._group_by_month(logs)
        
        total_archived = 0
        total_bytes = 0
        
        async with self.session.client("s3") as s3:
            for month_key, month_logs in logs_by_month.items():
                # Convert to Parquet/JSON Lines
                data = self._serialize_logs(month_logs)
                compressed = gzip.compress(data.encode())
                
                # Upload to S3
                s3_key = f"{self.s3_prefix}logs/{month_key}.jsonl.gz"
                await s3.put_object(
                    Bucket=self.s3_bucket,
                    Key=s3_key,
                    Body=compressed,
                    StorageClass="GLACIER"
                )
                
                # Mark as archived in database
                await self._mark_archived([log.id for log in month_logs])
                
                total_archived += len(month_logs)
                total_bytes += len(compressed)
        
        return {
            "archived": total_archived,
            "bytes": total_bytes,
            "months": len(logs_by_month)
        }
    
    async def query_archive(
        self,
        scenario_id: str,
        start_date: datetime,
        end_date: datetime
    ) -> List[dict]:
        """Query archived data (transparent to application)."""
        # Determine which months to query
        months = self._get_months_between(start_date, end_date)
        
        # Query hot data from database
        hot_data = await self._query_hot_data(scenario_id, start_date, end_date)
        
        # Query archived data from S3
        archived_data = []
        for month in months:
            if self._is_archived(month):
                data = await self._fetch_from_s3(month)
                archived_data.extend(data)
        
        # Merge and return
        return hot_data + archived_data


# Nightly archive job
async def run_nightly_archive():
    """Run archive process nightly."""
    service = ArchiveService()
    
    # Archive logs > 1 year
    logs_result = await service.archive_old_logs(days=365)
    logger.info(f"Archived {logs_result['archived']} logs")
    
    # Archive metrics > 2 years (aggregate first)
    metrics_result = await service.archive_old_metrics(days=730)
    logger.info(f"Archived {metrics_result['archived']} metrics")
    
    # Compress old reports > 6 months
    reports_result = await service.compress_old_reports(days=180)
    logger.info(f"Compressed {reports_result['compressed']} reports")
```

#### Archive Table Schema

```sql
-- Archive tracking table
CREATE TABLE archive_metadata (
    id SERIAL PRIMARY KEY,
    table_name VARCHAR(100) NOT NULL,
    archive_date TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    date_from DATE NOT NULL,
    date_to DATE NOT NULL,
    s3_key VARCHAR(500) NOT NULL,
    s3_bucket VARCHAR(100) NOT NULL,
    record_count INTEGER NOT NULL,
    compressed_size_bytes BIGINT NOT NULL,
    uncompressed_size_bytes BIGINT NOT NULL,
    compression_ratio DECIMAL(5,2),
    verification_hash VARCHAR(64),
    restored BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Indexes for archive queries
CREATE INDEX idx_archive_table ON archive_metadata(table_name);
CREATE INDEX idx_archive_dates ON archive_metadata(date_from, date_to);
```

### 3.3 CDN Configuration

#### CloudFront Distribution

```yaml
# terraform/cdn.tf
resource "aws_cloudfront_distribution" "mockupaws" {
  enabled             = true
  is_ipv6_enabled     = true
  default_root_object = "index.html"
  price_class         = "PriceClass_100"  # North America and Europe

  # Origin for static assets
  origin {
    domain_name = aws_s3_bucket.static_assets.bucket_regional_domain_name
    origin_id   = "S3-static"

    s3_origin_config {
      origin_access_identity = aws_cloudfront_origin_access_identity.oai.cloudfront_access_identity_path
    }
  }

  # Origin for API (if caching API responses)
  origin {
    domain_name = aws_lb.main.dns_name
    origin_id   = "ALB-api"

    custom_origin_config {
      http_port              = 80
      https_port             = 443
      origin_protocol_policy = "https-only"
      origin_ssl_protocols   = ["TLSv1.2"]
    }
  }

  # Default cache behavior for static assets
  default_cache_behavior {
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "S3-static"

    forwarded_values {
      query_string = false
      cookies {
        forward = "none"
      }
    }

    viewer_protocol_policy = "redirect-to-https"
    min_ttl                = 86400      # 1 day
    default_ttl            = 604800     # 1 week
    max_ttl                = 31536000   # 1 year
    compress               = true
  }

  # Cache behavior for API (selective caching)
  ordered_cache_behavior {
    path_pattern     = "/api/v1/pricing/*"
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "ALB-api"

    forwarded_values {
      query_string = true
      headers      = ["Origin", "Access-Control-Request-Headers", "Access-Control-Request-Method"]
      cookies {
        forward = "none"
      }
    }

    viewer_protocol_policy = "https-only"
    min_ttl                = 3600       # 1 hour
    default_ttl            = 86400      # 24 hours (AWS pricing changes slowly)
    max_ttl                = 604800     # 7 days
  }

  # Custom error responses for SPA
  custom_error_response {
    error_code         = 403
    response_code      = 200
    response_page_path = "/index.html"
  }

  custom_error_response {
    error_code         = 404
    response_code      = 200
    response_page_path = "/index.html"
  }

  restrictions {
    geo_restriction {
      restriction_type = "none"
    }
  }

  viewer_certificate {
    acm_certificate_arn      = aws_acm_certificate.main.arn
    ssl_support_method       = "sni-only"
    minimum_protocol_version = "TLSv1.2_2021"
  }
}
```

---

## 4. Capacity Planning

### 4.1 Resource Estimates

#### Base Capacity (1000 Concurrent Users)

| Component | Instance Type | Count | vCPU | Memory | Storage |
|-----------|---------------|-------|------|--------|---------|
| Load Balancer | t3.medium | 2 | 2 | 4 GB | 20 GB |
| Backend API | t3.large | 3 | 8 | 12 GB | 50 GB |
| PostgreSQL Primary | r6g.xlarge | 1 | 4 | 32 GB | 500 GB SSD |
| PostgreSQL Replica | r6g.large | 2 | 2 | 16 GB | 500 GB SSD |
| Redis | cache.r6g.large | 3 | 2 | 13 GB | - |
| PgBouncer | t3.small | 2 | 2 | 2 GB | 20 GB |

#### Scaling Projections

| Users | Backend Instances | DB Connections | Redis Memory | Storage/Month |
|-------|-------------------|----------------|--------------|---------------|
| 1,000 | 3 | 100 | 10 GB | 100 GB |
| 5,000 | 6 | 300 | 25 GB | 400 GB |
| 10,000 | 12 | 600 | 50 GB | 800 GB |
| 50,000 | 30 | 1500 | 150 GB | 3 TB |

### 4.2 Storage Estimates

| Data Type | Daily Volume | Monthly Volume | Annual Volume | Compression |
|-----------|--------------|----------------|---------------|-------------|
| Logs | 10 GB | 300 GB | 3.6 TB | 70% |
| Metrics | 2 GB | 60 GB | 720 GB | 50% |
| Reports | 1 GB | 30 GB | 360 GB | 0% |
| Backups | - | 500 GB | 6 TB | 80% |
| **Total** | **13 GB** | **~900 GB** | **~10 TB** | - |

### 4.3 Network Bandwidth

| Traffic Type | Daily | Monthly | Peak (Gbps) |
|--------------|-------|---------|-------------|
| Ingress (API) | 100 GB | 3 TB | 1 Gbps |
| Egress (API) | 500 GB | 15 TB | 5 Gbps |
| CDN (Static) | 1 TB | 30 TB | 10 Gbps |

### 4.4 Cost Estimates (AWS)

| Service | Monthly Cost (1K users) | Monthly Cost (10K users) |
|---------|------------------------|--------------------------|
| EC2 (Compute) | $450 | $2,000 |
| RDS (PostgreSQL) | $800 | $2,500 |
| ElastiCache (Redis) | $400 | $1,200 |
| S3 (Storage) | $200 | $800 |
| CloudFront (CDN) | $300 | $1,500 |
| ALB (Load Balancer) | $100 | $200 |
| CloudWatch (Monitoring) | $100 | $300 |
| **Total** | **~$2,350** | **~$8,500** |

---

## 5. Scaling Thresholds & Triggers

### 5.1 Auto-Scaling Rules

```yaml
# Scaling policies
scaling_policies:
  backend_scale_out:
    metric: cpu_utilization
    threshold: 70
    duration: 300  # 5 minutes
    adjustment: +1 instance
    cooldown: 300
    
  backend_scale_in:
    metric: cpu_utilization
    threshold: 30
    duration: 600  # 10 minutes
    adjustment: -1 instance
    cooldown: 600
    
  db_connection_scale:
    metric: database_connections
    threshold: 80
    duration: 180
    action: alert_and_review
    
  memory_pressure:
    metric: memory_utilization
    threshold: 85
    duration: 120
    adjustment: +1 instance
    cooldown: 300
```

### 5.2 Alert Thresholds

| Metric | Warning | Critical | Emergency |
|--------|---------|----------|-----------|
| CPU Usage | >60% | >80% | >95% |
| Memory Usage | >70% | >85% | >95% |
| Disk Usage | >70% | >85% | >95% |
| Response Time (p95) | >200ms | >500ms | >1000ms |
| Error Rate | >0.1% | >1% | >5% |
| DB Connections | >70% | >85% | >95% |
| Queue Depth | >500 | >1000 | >5000 |

---

## 6. Component Interactions

### 6.1 Request Flow

```
1. Client Request
   └──► CDN (CloudFront)
        └──► Nginx Load Balancer
             └──► Backend API (Round-robin)
                  ├──► FastAPI Route Handler
                  ├──► Authentication (JWT/API Key)
                  ├──► Rate Limiting (Redis)
                  ├──► Caching Check (Redis)
                  ├──► Database Query (PgBouncer → PostgreSQL)
                  ├──► Cache Update (Redis)
                  └──► Response
```

### 6.2 Data Flow

```
1. Log Ingestion
   └──► API Endpoint (/api/v1/ingest)
        ├──► Validation (Pydantic)
        ├──► Rate Limit Check (Redis)
        ├──► PII Detection
        ├──► Token Counting
        ├──► Async DB Write
        └──► Background Metric Update

2. Report Generation
   └──► API Request
        ├──► Queue Job (Celery)
        ├──► Worker Processing
        ├──► Data Aggregation
        ├──► PDF Generation
        ├──► Upload to S3
        └──► Notification
```

### 6.3 Failure Scenarios

| Failure | Impact | Mitigation |
|---------|--------|------------|
| Single backend down | 33% capacity | Auto-restart, health check removal |
| Primary DB down | Read-only mode | Automatic failover to replica |
| Redis down | No caching | Degrade to DB queries, queue to memory |
| Nginx down | No traffic | Standby takeover (VIP) |
| Region down | Full outage | DNS failover to standby region |

---

## 7. Critical Path for Other Teams

### 7.1 Dependencies

```
SPEC-001 (This Document)
    │
    ├──► @db-engineer - DB-001, DB-002, DB-003
    │     (Waiting for: partitioning strategy, connection pooling config)
    │
    ├──► @backend-dev - BE-PERF-004, BE-PERF-005
    │     (Waiting for: Redis config, async optimization guidelines)
    │
    ├──► @devops-engineer - DEV-DEPLOY-013, DEV-INFRA-014
    │     (Waiting for: infrastructure specs, scaling thresholds)
    │
    └──► @qa-engineer - QA-PERF-017
          (Waiting for: capacity targets, performance benchmarks)
```

### 7.2 Blocking Items (MUST COMPLETE FIRST)

1. **Load Balancer Configuration** → Blocks: DEV-INFRA-014
2. **Database Connection Pool Settings** → Blocks: DB-001
3. **Redis Cluster Configuration** → Blocks: BE-PERF-004
4. **Scaling Thresholds** → Blocks: QA-PERF-017

### 7.3 Handoff Checklist

Before other teams can proceed:

- [x] Architecture diagrams complete
- [x] Component specifications defined
- [x] Capacity planning estimates provided
- [x] Scaling thresholds documented
- [x] Configuration templates ready
- [ ] Review meeting completed (scheduled)
- [ ] Feedback incorporated
- [ ] Architecture frozen for v1.0.0

---

## Appendix A: Configuration Templates

### Docker Compose Production

```yaml
# docker-compose.prod.yml
version: '3.8'

services:
  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./nginx/ssl:/etc/nginx/ssl:ro
    depends_on:
      - backend
    networks:
      - frontend
    deploy:
      replicas: 2
      restart_policy:
        condition: any

  backend:
    image: mockupaws/backend:v1.0.0
    environment:
      - DATABASE_URL=postgresql+asyncpg://app:${DB_PASSWORD}@pgbouncer:6432/mockupaws
      - REPLICA_DATABASE_URLS=${REPLICA_URLS}
      - REDIS_URL=redis://redis-cluster:6379
      - JWT_SECRET_KEY=${JWT_SECRET}
    depends_on:
      - pgbouncer
      - redis-cluster
    networks:
      - frontend
      - backend
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
      update_config:
        parallelism: 1
        delay: 10s

  pgbouncer:
    image: pgbouncer/pgbouncer:latest
    environment:
      - DATABASES_HOST=postgres-primary
      - DATABASES_PORT=5432
      - DATABASES_DATABASE=mockupaws
      - POOL_MODE=transaction
      - MAX_CLIENT_CONN=1000
    networks:
      - backend

  redis-cluster:
    image: redis:7-alpine
    command: redis-server /usr/local/etc/redis/redis.conf
    volumes:
      - ./redis/redis.conf:/usr/local/etc/redis/redis.conf
    networks:
      - backend
    deploy:
      replicas: 3

networks:
  frontend:
    driver: overlay
  backend:
    driver: overlay
    internal: true
```

### Environment Variables Template

```bash
# .env.production

# Application
APP_ENV=production
DEBUG=false
LOG_LEVEL=INFO

# Database
DATABASE_URL=postgresql+asyncpg://app:secure_password@pgbouncer:6432/mockupaws
REPLICA_DATABASE_URLS=postgresql+asyncpg://app:secure_password@pgbouncer-replica-1:6432/mockupaws,postgresql+asyncpg://app:secure_password@pgbouncer-replica-2:6432/mockupaws
DB_POOL_SIZE=20
DB_MAX_OVERFLOW=10

# Redis
REDIS_URL=redis://redis-cluster:6379
REDIS_CLUSTER_NODES=redis-1:6379,redis-2:6379,redis-3:6379

# Security
JWT_SECRET_KEY=change_me_in_production_32_chars_min
JWT_ALGORITHM=HS256
ACCESS_TOKEN_EXPIRE_MINUTES=30
BCRYPT_ROUNDS=12

# Rate Limiting
RATE_LIMIT_GENERAL=100/minute
RATE_LIMIT_AUTH=5/minute
RATE_LIMIT_INGEST=1000/minute

# AWS/S3
AWS_REGION=us-east-1
S3_BUCKET=mockupaws-production
ARCHIVE_S3_BUCKET=mockupaws-archives
CLOUDFRONT_DOMAIN=cdn.mockupaws.com

# Monitoring
SENTRY_DSN=https://xxx@yyy.ingest.sentry.io/zzz
PROMETHEUS_ENABLED=true
JAEGER_ENDPOINT=http://jaeger:14268/api/traces
```

---

*Document Version: 1.0.0-Draft*  
*Last Updated: 2026-04-07*  
*Owner: @spec-architect*