lucasacchi/mockupAWS

Fork 0

Files

Luca Sacchi Ricciardi 38fd6cb562

CI/CD - Build & Test / Backend Tests (push) Has been cancelled

Details

CI/CD - Build & Test / Frontend Tests (push) Has been cancelled

Details

CI/CD - Build & Test / Security Scans (push) Has been cancelled

Details

CI/CD - Build & Test / Docker Build Test (push) Has been cancelled

Details

CI/CD - Build & Test / Terraform Validate (push) Has been cancelled

Details

Deploy to Production / Build & Test (push) Has been cancelled

Details

Deploy to Production / Security Scan (push) Has been cancelled

Details

Deploy to Production / Build Docker Images (push) Has been cancelled

Details

Deploy to Production / Deploy to Staging (push) Has been cancelled

Details

Deploy to Production / E2E Tests (push) Has been cancelled

Details

Deploy to Production / Deploy to Production (push) Has been cancelled

Details

E2E Tests / Run E2E Tests (push) Has been cancelled

Details

E2E Tests / Visual Regression Tests (push) Has been cancelled

Details

E2E Tests / Smoke Tests (push) Has been cancelled

Details

release: v1.0.0 - Production Ready

Complete production-ready release with all v1.0.0 features:

Architecture & Planning (@spec-architect):
- Production architecture design with scalability and HA
- Security audit plan and compliance review
- Technical debt assessment and refactoring roadmap

Database (@db-engineer):
- 17 performance indexes and 3 materialized views
- PgBouncer connection pooling
- Automated backup/restore with PITR (RTO<1h, RPO<5min)
- Data archiving strategy (~65% storage savings)

Backend (@backend-dev):
- Redis caching layer with 3-tier strategy
- Celery async jobs with Flower monitoring
- API v2 with rate limiting (tiered: free/premium/enterprise)
- Prometheus metrics and OpenTelemetry tracing
- Security hardening (headers, audit logging)

Frontend (@frontend-dev):
- Bundle optimization: 308KB (code splitting, lazy loading)
- Onboarding tutorial (react-joyride)
- Command palette (Cmd+K) and keyboard shortcuts
- Analytics dashboard with cost predictions
- i18n (English + Italian) and WCAG 2.1 AA compliance

DevOps (@devops-engineer):
- Complete deployment guide (Docker, K8s, AWS ECS)
- Terraform AWS infrastructure (Multi-AZ RDS, ElastiCache, ECS)
- CI/CD pipelines with blue-green deployment
- Prometheus + Grafana monitoring with 15+ alert rules
- SLA definition and incident response procedures

QA (@qa-engineer):
- 153+ E2E test cases (85% coverage)
- k6 performance tests (1000+ concurrent users, p95<200ms)
- Security testing (0 critical vulnerabilities)
- Cross-browser and mobile testing
- Official QA sign-off

Production Features:
✅ Horizontal scaling ready
✅ 99.9% uptime target
✅ <200ms response time (p95)
✅ Enterprise-grade security
✅ Complete observability
✅ Disaster recovery
✅ SLA monitoring

Ready for production deployment! 🚀

2026-04-07 20:14:51 +02:00

52 KiB

Raw Blame History

Production Architecture Design - mockupAWS v1.0.0

Version: 1.0.0
Author: @spec-architect
Date: 2026-04-07
Status: DRAFT - Ready for Review

Executive Summary

This document defines the production architecture for mockupAWS v1.0.0, transforming the current single-node development setup into an enterprise-grade, scalable, and highly available system.

Key Architectural Decisions

Decision	Rationale
Nginx Load Balancer	Battle-tested, extensive configuration options, SSL termination
PostgreSQL Primary-Replica	Read scaling for analytics workloads, failover capability
Redis Cluster	Distributed caching, session storage, rate limiting
Container Orchestration	Docker Compose for simplicity, Kubernetes-ready design
Multi-Region Active-Passive	Cost-effective HA, 99.9% uptime target

1. Scalability Architecture

1.1 System Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              CLIENT LAYER                                    │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │ Web Browser  │  │ Mobile App   │  │ API Clients  │  │ CI/CD        │     │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘     │
└─────────┼──────────────────┼──────────────────┼──────────────────┼───────────┘
          │                  │                  │                  │
          └──────────────────┴──────────────────┴──────────────────┘
                                     │
                                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           EDGE LAYER (CDN + WAF)                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ CloudFront / Cloudflare CDN                                         │    │
│  │ • Static assets caching (React bundle, images, reports)            │    │
│  │ • DDoS protection                                                   │    │
│  │ • Geo-routing                                                       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘
                                     │
                                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          LOAD BALANCER LAYER                                 │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ Nginx Load Balancer (Active-Standby)                                │    │
│  │ • SSL Termination (TLS 1.3)                                         │    │
│  │ • Health checks: /health endpoint                                   │    │
│  │ • Sticky sessions (for WebSocket support)                          │    │
│  │ • Rate limiting: 1000 req/min per IP                                │    │
│  │ • Circuit breaker: 5xx threshold detection                         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘
                                     │
                    ┌────────────────┼────────────────┐
                    ▼                ▼                ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        APPLICATION LAYER (3x replicas)                       │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐          │
│  │   Backend API    │  │   Backend API    │  │   Backend API    │          │
│  │   Instance 1     │  │   Instance 2     │  │   Instance 3     │          │
│  │   (Port 8000)    │  │   (Port 8000)    │  │   (Port 8000)    │          │
│  ├──────────────────┤  ├──────────────────┤  ├──────────────────┤          │
│  │ • FastAPI        │  │ • FastAPI        │  │ • FastAPI        │          │
│  │ • Uvicorn        │  │ • Uvicorn        │  │ • Uvicorn        │          │
│  │ • 4 Workers      │  │ • 4 Workers      │  │ • 4 Workers      │          │
│  └────────┬─────────┘  └────────┬─────────┘  └────────┬─────────┘          │
└───────────┼─────────────────────┼─────────────────────┼────────────────────┘
            │                     │                     │
            └─────────────────────┼─────────────────────┘
                                  │
                    ┌─────────────┴─────────────┐
                    ▼                           ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        DATA LAYER                                            │
│  ┌─────────────────────────┐    ┌────────────────────────────────────────┐  │
│  │    Redis Cluster        │    │     PostgreSQL Primary-Replica         │  │
│  │  ┌─────┐ ┌─────┐ ┌────┐│    │  ┌──────────┐      ┌──────────────┐   │  │
│  │  │ M1  │ │ M2  │ │ M3 ││    │  │ Primary  │◄────►│  Replica 1   │   │  │
│  │  └──┬──┘ └──┬──┘ └──┬─┘│    │  │  (RW)    │  Sync │   (RO)       │   │  │
│  │     └───────┴───────┘  │    │  └────┬─────┘      └──────────────┘   │  │
│  │  ┌─────┐ ┌─────┐ ┌────┐│    │       │            ┌──────────────┐   │  │
│  │  │ S1  │ │ S2  │ │ S3 ││    │       └───────────►│  Replica 2   │   │  │
│  │  └─────┘ └─────┘ └────┘│    │                    │   (RO)       │   │  │
│  │  (3 Masters + 3 Slaves) │    │                    └──────────────┘   │  │
│  └─────────────────────────┘    └────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

1.2 Load Balancer Configuration (Nginx)

# /etc/nginx/conf.d/mockupaws.conf

upstream backend {
    least_conn;  # Least connections load balancing
    server backend-1:8000 weight=1 max_fails=3 fail_timeout=30s;
    server backend-2:8000 weight=1 max_fails=3 fail_timeout=30s;
    server backend-3:8000 weight=1 max_fails=3 fail_timeout=30s backup;
    
    keepalive 32;  # Keepalive connections
}

server {
    listen 80;
    server_name api.mockupaws.com;
    return 301 https://$server_name$request_uri;  # Force HTTPS
}

server {
    listen 443 ssl http2;
    server_name api.mockupaws.com;
    
    # SSL Configuration
    ssl_certificate /etc/ssl/certs/mockupaws.crt;
    ssl_certificate_key /etc/ssl/private/mockupaws.key;
    ssl_protocols TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers on;
    
    # Security Headers
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
    add_header X-Frame-Options "DENY" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-XSS-Protection "1; mode=block" always;
    
    # Rate Limiting Zones
    limit_req_zone $binary_remote_addr zone=api:10m rate=100r/m;
    limit_req_zone $binary_remote_addr zone=auth:10m rate=10r/m;
    limit_req_zone $binary_remote_addr zone=ingest:10m rate=1000r/m;
    
    # Health Check Endpoint
    location /health {
        access_log off;
        proxy_pass http://backend;
        proxy_connect_timeout 5s;
        proxy_send_timeout 5s;
        proxy_read_timeout 5s;
    }
    
    # API Endpoints with Circuit Breaker
    location /api/ {
        limit_req zone=api burst=20 nodelay;
        
        proxy_pass http://backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # Timeouts
        proxy_connect_timeout 30s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
        
        # Circuit Breaker Pattern
        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_next_upstream_tries 2;
    }
    
    # Auth Endpoints - Stricter Rate Limit
    location /api/v1/auth/ {
        limit_req zone=auth burst=5 nodelay;
        proxy_pass http://backend;
    }
    
    # Ingest Endpoints - Higher Throughput
    location /api/v1/ingest/ {
        limit_req zone=ingest burst=100 nodelay;
        client_max_body_size 10M;
        proxy_pass http://backend;
    }
    
    # Static Files (if served from backend)
    location /static/ {
        expires 1y;
        add_header Cache-Control "public, immutable";
        proxy_pass http://backend;
    }
}

1.3 Horizontal Scaling Strategy

Scaling Triggers

Metric	Scale Out Threshold	Scale In Threshold	Action
CPU Usage	>70% for 5 min	<30% for 10 min	±1 instance
Memory Usage	>80% for 5 min	<40% for 10 min	±1 instance
Request Latency (p95)	>500ms for 3 min	<200ms for 10 min	+1 instance
Queue Depth (Celery)	>1000 jobs	<100 jobs	±1 worker
DB Connections	>80% pool	<50% pool	Review query optimization

Auto-Scaling Configuration (Docker Swarm)

# docker-compose.prod.yml - Scaling Configuration
version: '3.8'

services:
  backend:
    image: mockupaws/backend:v1.0.0
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
      restart_policy:
        condition: any
        delay: 5s
        max_attempts: 3
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
        reservations:
          cpus: '0.5'
          memory: 1G
      labels:
        - "prometheus-job=backend"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    
  nginx:
    image: nginx:alpine
    deploy:
      replicas: 2
      placement:
        constraints:
          - node.role == manager
    ports:
      - "80:80"
      - "443:443"

Kubernetes HPA Alternative

# k8s/hpa-backend.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: backend-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: backend
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

1.4 Database Read Replicas

PostgreSQL Primary-Replica Setup

┌─────────────────────────────────────────────────────────────┐
│                  PostgreSQL Cluster                          │
│                                                              │
│  ┌─────────────────┐                                        │
│  │    Primary      │◄── Read/Write Operations               │
│  │  (postgres-1)   │                                        │
│  │                 │                                        │
│  │  • All writes   │                                        │
│  │  • WAL shipping │───┬────────────────────────┐          │
│  │  • Sync commit  │   │  Streaming Replication │          │
│  └─────────────────┘   │                        │          │
│                        ▼                        ▼          │
│              ┌─────────────────┐  ┌─────────────────┐      │
│              │  Replica 1      │  │  Replica 2      │      │
│              │ (postgres-2)    │  │ (postgres-3)    │      │
│              │                 │  │                 │      │
│              │  • Read-only    │  │  • Read-only    │      │
│              │  • Async replica│  │  • Async replica│      │
│              │  • Hot standby  │  │  • Hot standby  │      │
│              └─────────────────┘  └─────────────────┘      │
│                        │              │                    │
│                        └──────────────┘                    │
│                               │                            │
│                               ▼                            │
│              ┌─────────────────────────────────┐           │
│              │   PgBouncer Connection Pool     │           │
│              │                                 │           │
│              │  Pool Mode: Transaction         │           │
│              │  Max Connections: 1000          │           │
│              │  Default Pool: 25 per db/user   │           │
│              └─────────────────────────────────┘           │
└─────────────────────────────────────────────────────────────┘

Connection Pooling (PgBouncer)

; /etc/pgbouncer/pgbouncer.ini
[databases]
mockupaws = host=postgres-primary port=5432 dbname=mockupaws
mockupaws_replica = host=postgres-replica-1 port=5432 dbname=mockupaws

[pgbouncer]
listen_port = 6432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

; Pool settings
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3

; Timeouts
server_idle_timeout = 600
server_lifetime = 3600
server_connect_timeout = 15
query_timeout = 0
query_wait_timeout = 120

; Logging
log_connections = 1
log_disconnections = 1
log_pooler_errors = 1
stats_period = 60

Application-Level Read/Write Splitting

# src/core/database.py - Enhanced with read replica support
import os
from contextlib import asynccontextmanager
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession, async_sessionmaker
from sqlalchemy.orm import declarative_base

# Primary (RW) database
PRIMARY_DATABASE_URL = os.getenv(
    "DATABASE_URL", 
    "postgresql+asyncpg://postgres:postgres@localhost:5432/mockupaws"
)

# Replica (RO) databases
REPLICA_DATABASE_URLS = os.getenv(
    "REPLICA_DATABASE_URLS", 
    ""
).split(",") if os.getenv("REPLICA_DATABASE_URLS") else []

# Primary engine (RW)
primary_engine = create_async_engine(
    PRIMARY_DATABASE_URL,
    pool_size=10,
    max_overflow=20,
    pool_pre_ping=True,
    pool_recycle=3600,
)

# Replica engines (RO)
replica_engines = [
    create_async_engine(url, pool_size=5, max_overflow=10, pool_pre_ping=True)
    for url in REPLICA_DATABASE_URLS if url
]

# Session factories
PrimarySessionLocal = async_sessionmaker(primary_engine, class_=AsyncSession)
ReplicaSessionLocal = async_sessionmaker(
    replica_engines[0] if replica_engines else primary_engine, 
    class_=AsyncSession
)

Base = declarative_base()


async def get_db(write: bool = False) -> AsyncSession:
    """Get database session with automatic read/write splitting."""
    if write:
        async with PrimarySessionLocal() as session:
            yield session
    else:
        async with ReplicaSessionLocal() as session:
            yield session


class DatabaseRouter:
    """Route queries to appropriate database based on operation type."""
    
    @staticmethod
    def get_engine(operation: str = "read"):
        """Get appropriate engine for operation."""
        if operation in ("write", "insert", "update", "delete"):
            return primary_engine
        return replica_engines[0] if replica_engines else primary_engine

2. High Availability Design

2.1 Multi-Region Deployment Strategy

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                           GLOBAL TRAFFIC MANAGER                             │
│                     (Route53 / Cloudflare Load Balancing)                    │
│                                                                              │
│  Health Checks: /health endpoint every 30s                                   │
│  Failover: Automatic on 3 consecutive failures                              │
│  Latency-based Routing: Route to nearest healthy region                      │
└─────────────────────────────────────────────────────────────────────────────┘
              │                              │
              ▼                              ▼
┌──────────────────────────────┐  ┌──────────────────────────────┐
│      PRIMARY REGION          │  │      STANDBY REGION          │
│      (us-east-1)             │  │      (eu-west-1)             │
│                              │  │                              │
│  ┌────────────────────────┐  │  │  ┌────────────────────────┐  │
│  │   Application Stack    │  │  │  │   Application Stack    │  │
│  │   (3x backend, 2x LB)  │  │  │  │   (2x backend, 2x LB)  │  │
│  └────────────────────────┘  │  │  └────────────────────────┘  │
│                              │  │                              │
│  ┌────────────────────────┐  │  │  ┌────────────────────────┐  │
│  │   PostgreSQL Primary   │──┼──┼──►│   PostgreSQL Replica   │  │
│  │   + 2 Replicas         │  │  │  │   (Hot Standby)        │  │
│  └────────────────────────┘  │  │  └────────────────────────┘  │
│                              │  │                              │
│  ┌────────────────────────┐  │  │  ┌────────────────────────┐  │
│  │   Redis Cluster        │──┼──┼──►│   Redis Replica        │  │
│  │   (3 Masters)          │  │  │  │   (Read-only)          │  │
│  └────────────────────────┘  │  │  └────────────────────────┘  │
│                              │  │                              │
│  ┌────────────────────────┐  │  │  ┌────────────────────────┐  │
│  │   S3 Bucket            │◄─┼──┼──►│   S3 Cross-Region      │  │
│  │   (Primary)            │  │  │  │   Replication          │  │
│  └────────────────────────┘  │  │  └────────────────────────┘  │
└──────────────────────────────┘  └──────────────────────────────┘
              │                              │
              │        ┌──────────────┐      │
              └───────►│   BACKUP     │◄─────┘
                       │   S3 Bucket  │
                       │  (3rd Region)│
                       └──────────────┘

Failover Mechanisms

Database Failover (Automatic)

# scripts/db-failover.py
"""Automated database failover script."""

import asyncio
import asyncpg
from typing import Optional


class DatabaseFailoverManager:
    """Manage PostgreSQL failover."""
    
    async def check_primary_health(self, primary_host: str) -> bool:
        """Check if primary database is healthy."""
        try:
            conn = await asyncpg.connect(
                host=primary_host,
                database="mockupaws",
                user="healthcheck",
                password=os.getenv("DB_HEALTH_PASSWORD"),
                timeout=5
            )
            result = await conn.fetchval("SELECT 1")
            await conn.close()
            return result == 1
        except Exception:
            return False
    
    async def promote_replica(self, replica_host: str) -> bool:
        """Promote replica to primary."""
        # Execute pg_ctl promote on replica
        # Update connection strings in application config
        # Notify application to reconnect
        pass
    
    async def run_failover(self) -> bool:
        """Execute full failover procedure."""
        # 1. Verify primary is truly down (avoid split-brain)
        # 2. Promote best replica to primary
        # 3. Update DNS/load balancer configuration
        # 4. Notify on-call engineers
        # 5. Begin recovery of old primary as new replica
        pass


# Health check endpoint for load balancer
@app.get("/health/db")
async def database_health_check():
    """Deep health check including database connectivity."""
    try:
        # Quick query to verify DB connection
        result = await db.execute("SELECT 1")
        return {"status": "healthy", "database": "connected"}
    except Exception as e:
        raise HTTPException(
            status_code=503,
            detail={"status": "unhealthy", "database": str(e)}
        )

Redis Failover (Redis Sentinel)

# redis-sentinel.conf
sentinel monitor mymaster redis-master 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
sentinel parallel-syncs mymaster 1
sentinel auth-pass mymaster ${REDIS_PASSWORD}

# Notification
sentinel notification-script mymaster /usr/local/bin/notify.sh

2.2 Circuit Breaker Pattern

# src/core/circuit_breaker.py
"""Circuit breaker pattern implementation."""

import time
from enum import Enum
from functools import wraps
from typing import Callable, Any
import asyncio


class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing if service recovered


class CircuitBreaker:
    """Circuit breaker for external service calls."""
    
    def __init__(
        self,
        name: str,
        failure_threshold: int = 5,
        recovery_timeout: int = 60,
        half_open_max_calls: int = 3
    ):
        self.name = name
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls
        
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self._lock = asyncio.Lock()
    
    async def call(self, func: Callable, *args, **kwargs) -> Any:
        """Execute function with circuit breaker protection."""
        async with self._lock:
            if self.state == CircuitState.OPEN:
                if time.time() - self.last_failure_time >= self.recovery_timeout:
                    self.state = CircuitState.HALF_OPEN
                    self.success_count = 0
                else:
                    raise CircuitBreakerOpen(f"Circuit {self.name} is OPEN")
            
            if self.state == CircuitState.HALF_OPEN and self.success_count >= self.half_open_max_calls:
                raise CircuitBreakerOpen(f"Circuit {self.name} HALF_OPEN limit reached")
        
        try:
            result = await func(*args, **kwargs)
            await self._on_success()
            return result
        except Exception as e:
            await self._on_failure()
            raise
    
    async def _on_success(self):
        async with self._lock:
            if self.state == CircuitState.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.half_open_max_calls:
                    self._reset()
            else:
                self.failure_count = 0
    
    async def _on_failure(self):
        async with self._lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.OPEN
            elif self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
    
    def _reset(self):
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None


def circuit_breaker(
    name: str,
    failure_threshold: int = 5,
    recovery_timeout: int = 60
):
    """Decorator for circuit breaker pattern."""
    breaker = CircuitBreaker(name, failure_threshold, recovery_timeout)
    
    def decorator(func: Callable) -> Callable:
        @wraps(func)
        async def wrapper(*args, **kwargs):
            return await breaker.call(func, *args, **kwargs)
        return wrapper
    return decorator


# Usage example
@circuit_breaker(name="aws_pricing_api", failure_threshold=3, recovery_timeout=30)
async def fetch_aws_pricing(service: str, region: str):
    """Fetch AWS pricing with circuit breaker protection."""
    async with httpx.AsyncClient() as client:
        response = await client.get(
            f"https://pricing.us-east-1.amazonaws.com/{service}/{region}",
            timeout=10.0
        )
        return response.json()

2.3 Graceful Degradation

# src/core/degradation.py
"""Graceful degradation strategies."""

from functools import wraps
from typing import Optional, Any
import asyncio


class DegradationStrategy:
    """Base class for degradation strategies."""
    
    async def fallback(self, *args, **kwargs) -> Any:
        """Return fallback value when primary fails."""
        raise NotImplementedError


class CacheFallback(DegradationStrategy):
    """Fallback to cached data."""
    
    def __init__(self, cache_key: str, max_age: int = 3600):
        self.cache_key = cache_key
        self.max_age = max_age
    
    async def fallback(self, *args, **kwargs) -> Any:
        # Return stale cache data
        return await redis.get(f"stale:{self.cache_key}")


class StaticFallback(DegradationStrategy):
    """Fallback to static/default data."""
    
    def __init__(self, default_value: Any):
        self.default_value = default_value
    
    async def fallback(self, *args, **kwargs) -> Any:
        return self.default_value


class EmptyFallback(DegradationStrategy):
    """Fallback to empty result."""
    
    async def fallback(self, *args, **kwargs) -> Any:
        return []


def with_degradation(
    strategy: DegradationStrategy,
    timeout: float = 5.0,
    exceptions: tuple = (Exception,)
):
    """Decorator for graceful degradation."""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            try:
                return await asyncio.wait_for(
                    func(*args, **kwargs),
                    timeout=timeout
                )
            except exceptions as e:
                logger.warning(
                    f"Primary function failed, using fallback: {e}",
                    extra={"function": func.__name__}
                )
                return await strategy.fallback(*args, **kwargs)
        return wrapper
    return decorator


# Usage examples

@with_degradation(
    strategy=CacheFallback(cache_key="aws_pricing", max_age=86400),
    timeout=3.0
)
async def get_aws_pricing(service: str, region: str):
    """Get AWS pricing with cache fallback."""
    # Primary: fetch from AWS API
    pass


@with_degradation(
    strategy=StaticFallback(default_value={"status": "degraded", "metrics": []}),
    timeout=2.0
)
async def get_dashboard_metrics(scenario_id: str):
    """Get metrics with empty fallback on failure."""
    # Primary: fetch from database
    pass


@with_degradation(
    strategy=EmptyFallback(),
    timeout=1.0
)
async def get_recommendations(scenario_id: str):
    """Get recommendations with empty fallback."""
    # Primary: ML-based recommendation engine
    pass

3. Data Architecture

3.1 Database Partitioning Strategy

Time-Based Partitioning for Logs and Metrics

-- Enable pg_partman extension
CREATE EXTENSION IF NOT EXISTS pg_partman;

-- Partitioned scenario_logs table
CREATE TABLE scenario_logs_partitioned (
    id UUID DEFAULT gen_random_uuid(),
    scenario_id UUID NOT NULL,
    received_at TIMESTAMPTZ NOT NULL,
    message_hash VARCHAR(64) NOT NULL,
    message_preview VARCHAR(500),
    source VARCHAR(100) DEFAULT 'unknown',
    size_bytes INTEGER DEFAULT 0,
    has_pii BOOLEAN DEFAULT FALSE,
    token_count INTEGER DEFAULT 0,
    sqs_blocks INTEGER DEFAULT 1,
    PRIMARY KEY (id, received_at)
) PARTITION BY RANGE (received_at);

-- Create partitions (monthly)
SELECT create_parent('public.scenario_logs_partitioned', 'received_at', 'native', 'monthly');

-- Partitioned scenario_metrics table
CREATE TABLE scenario_metrics_partitioned (
    id UUID DEFAULT gen_random_uuid(),
    scenario_id UUID NOT NULL,
    timestamp TIMESTAMPTZ NOT NULL,
    metric_type VARCHAR(50) NOT NULL,
    metric_name VARCHAR(100) NOT NULL,
    value DECIMAL(15, 6) NOT NULL,
    unit VARCHAR(20) NOT NULL,
    extra_data JSONB DEFAULT '{}',
    PRIMARY KEY (id, timestamp)
) PARTITION BY RANGE (timestamp);

SELECT create_parent('public.scenario_metrics_partitioned', 'timestamp', 'native', 'daily');

-- Automated partition maintenance
SELECT partman.run_maintenance('scenario_logs_partitioned');

Tenant Isolation Strategy

-- Row-Level Security for multi-tenant support
ALTER TABLE scenarios ENABLE ROW LEVEL SECURITY;
ALTER TABLE scenario_logs ENABLE ROW LEVEL SECURITY;
ALTER TABLE scenario_metrics ENABLE ROW LEVEL SECURITY;

-- Add tenant_id column
ALTER TABLE scenarios ADD COLUMN tenant_id UUID NOT NULL DEFAULT '00000000-0000-0000-0000-000000000000';
ALTER TABLE scenario_logs ADD COLUMN tenant_id UUID NOT NULL DEFAULT '00000000-0000-0000-0000-000000000000';

-- Create RLS policies
CREATE POLICY tenant_isolation_scenarios ON scenarios
    USING (tenant_id = current_setting('app.current_tenant')::UUID);

CREATE POLICY tenant_isolation_logs ON scenario_logs
    USING (tenant_id = current_setting('app.current_tenant')::UUID);

-- Set tenant context per session
SET app.current_tenant = 'tenant-uuid-here';

3.2 Data Archive Strategy

Archive Policy

Data Type	Retention Hot	Retention Warm	Archive To	Compression
Scenario Logs	90 days	1 year	S3 Glacier	GZIP
Scenario Metrics	30 days	90 days	S3 Standard-IA	Parquet
Reports	30 days	6 months	S3 Glacier	None (PDF)
Audit Logs	1 year	7 years	S3 Glacier Deep	GZIP

Archive Implementation

# src/services/archive_service.py
"""Data archiving service for old records."""

from datetime import datetime, timedelta
from typing import List
import asyncio
import aioboto3
import gzip
import io


class ArchiveService:
    """Service for archiving old data to S3."""
    
    def __init__(self):
        self.s3_bucket = os.getenv("ARCHIVE_S3_BUCKET")
        self.s3_prefix = os.getenv("ARCHIVE_S3_PREFIX", "archives/")
        self.session = aioboto3.Session()
    
    async def archive_old_logs(self, days: int = 365) -> dict:
        """Archive logs older than specified days."""
        cutoff_date = datetime.utcnow() - timedelta(days=days)
        
        # Query old logs
        query = """
            SELECT * FROM scenario_logs
            WHERE received_at < :cutoff_date
            AND archived = FALSE
            LIMIT 100000
        """
        
        result = await db.execute(query, {"cutoff_date": cutoff_date})
        logs = result.fetchall()
        
        if not logs:
            return {"archived": 0, "bytes": 0}
        
        # Group by month for efficient storage
        logs_by_month = self._group_by_month(logs)
        
        total_archived = 0
        total_bytes = 0
        
        async with self.session.client("s3") as s3:
            for month_key, month_logs in logs_by_month.items():
                # Convert to Parquet/JSON Lines
                data = self._serialize_logs(month_logs)
                compressed = gzip.compress(data.encode())
                
                # Upload to S3
                s3_key = f"{self.s3_prefix}logs/{month_key}.jsonl.gz"
                await s3.put_object(
                    Bucket=self.s3_bucket,
                    Key=s3_key,
                    Body=compressed,
                    StorageClass="GLACIER"
                )
                
                # Mark as archived in database
                await self._mark_archived([log.id for log in month_logs])
                
                total_archived += len(month_logs)
                total_bytes += len(compressed)
        
        return {
            "archived": total_archived,
            "bytes": total_bytes,
            "months": len(logs_by_month)
        }
    
    async def query_archive(
        self,
        scenario_id: str,
        start_date: datetime,
        end_date: datetime
    ) -> List[dict]:
        """Query archived data (transparent to application)."""
        # Determine which months to query
        months = self._get_months_between(start_date, end_date)
        
        # Query hot data from database
        hot_data = await self._query_hot_data(scenario_id, start_date, end_date)
        
        # Query archived data from S3
        archived_data = []
        for month in months:
            if self._is_archived(month):
                data = await self._fetch_from_s3(month)
                archived_data.extend(data)
        
        # Merge and return
        return hot_data + archived_data


# Nightly archive job
async def run_nightly_archive():
    """Run archive process nightly."""
    service = ArchiveService()
    
    # Archive logs > 1 year
    logs_result = await service.archive_old_logs(days=365)
    logger.info(f"Archived {logs_result['archived']} logs")
    
    # Archive metrics > 2 years (aggregate first)
    metrics_result = await service.archive_old_metrics(days=730)
    logger.info(f"Archived {metrics_result['archived']} metrics")
    
    # Compress old reports > 6 months
    reports_result = await service.compress_old_reports(days=180)
    logger.info(f"Compressed {reports_result['compressed']} reports")

Archive Table Schema

-- Archive tracking table
CREATE TABLE archive_metadata (
    id SERIAL PRIMARY KEY,
    table_name VARCHAR(100) NOT NULL,
    archive_date TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    date_from DATE NOT NULL,
    date_to DATE NOT NULL,
    s3_key VARCHAR(500) NOT NULL,
    s3_bucket VARCHAR(100) NOT NULL,
    record_count INTEGER NOT NULL,
    compressed_size_bytes BIGINT NOT NULL,
    uncompressed_size_bytes BIGINT NOT NULL,
    compression_ratio DECIMAL(5,2),
    verification_hash VARCHAR(64),
    restored BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Indexes for archive queries
CREATE INDEX idx_archive_table ON archive_metadata(table_name);
CREATE INDEX idx_archive_dates ON archive_metadata(date_from, date_to);

3.3 CDN Configuration

CloudFront Distribution

# terraform/cdn.tf
resource "aws_cloudfront_distribution" "mockupaws" {
  enabled             = true
  is_ipv6_enabled     = true
  default_root_object = "index.html"
  price_class         = "PriceClass_100"  # North America and Europe

  # Origin for static assets
  origin {
    domain_name = aws_s3_bucket.static_assets.bucket_regional_domain_name
    origin_id   = "S3-static"

    s3_origin_config {
      origin_access_identity = aws_cloudfront_origin_access_identity.oai.cloudfront_access_identity_path
    }
  }

  # Origin for API (if caching API responses)
  origin {
    domain_name = aws_lb.main.dns_name
    origin_id   = "ALB-api"

    custom_origin_config {
      http_port              = 80
      https_port             = 443
      origin_protocol_policy = "https-only"
      origin_ssl_protocols   = ["TLSv1.2"]
    }
  }

  # Default cache behavior for static assets
  default_cache_behavior {
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "S3-static"

    forwarded_values {
      query_string = false
      cookies {
        forward = "none"
      }
    }

    viewer_protocol_policy = "redirect-to-https"
    min_ttl                = 86400      # 1 day
    default_ttl            = 604800     # 1 week
    max_ttl                = 31536000   # 1 year
    compress               = true
  }

  # Cache behavior for API (selective caching)
  ordered_cache_behavior {
    path_pattern     = "/api/v1/pricing/*"
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "ALB-api"

    forwarded_values {
      query_string = true
      headers      = ["Origin", "Access-Control-Request-Headers", "Access-Control-Request-Method"]
      cookies {
        forward = "none"
      }
    }

    viewer_protocol_policy = "https-only"
    min_ttl                = 3600       # 1 hour
    default_ttl            = 86400      # 24 hours (AWS pricing changes slowly)
    max_ttl                = 604800     # 7 days
  }

  # Custom error responses for SPA
  custom_error_response {
    error_code         = 403
    response_code      = 200
    response_page_path = "/index.html"
  }

  custom_error_response {
    error_code         = 404
    response_code      = 200
    response_page_path = "/index.html"
  }

  restrictions {
    geo_restriction {
      restriction_type = "none"
    }
  }

  viewer_certificate {
    acm_certificate_arn      = aws_acm_certificate.main.arn
    ssl_support_method       = "sni-only"
    minimum_protocol_version = "TLSv1.2_2021"
  }
}

4. Capacity Planning

4.1 Resource Estimates

Base Capacity (1000 Concurrent Users)

Component	Instance Type	Count	vCPU	Memory	Storage
Load Balancer	t3.medium	2	2	4 GB	20 GB
Backend API	t3.large	3	8	12 GB	50 GB
PostgreSQL Primary	r6g.xlarge	1	4	32 GB	500 GB SSD
PostgreSQL Replica	r6g.large	2	2	16 GB	500 GB SSD
Redis	cache.r6g.large	3	2	13 GB	-
PgBouncer	t3.small	2	2	2 GB	20 GB

Scaling Projections

Users	Backend Instances	DB Connections	Redis Memory	Storage/Month
1,000	3	100	10 GB	100 GB
5,000	6	300	25 GB	400 GB
10,000	12	600	50 GB	800 GB
50,000	30	1500	150 GB	3 TB

4.2 Storage Estimates

Data Type	Daily Volume	Monthly Volume	Annual Volume	Compression
Logs	10 GB	300 GB	3.6 TB	70%
Metrics	2 GB	60 GB	720 GB	50%
Reports	1 GB	30 GB	360 GB	0%
Backups	-	500 GB	6 TB	80%
Total	13 GB	~900 GB	~10 TB	-

4.3 Network Bandwidth

Traffic Type	Daily	Monthly	Peak (Gbps)
Ingress (API)	100 GB	3 TB	1 Gbps
Egress (API)	500 GB	15 TB	5 Gbps
CDN (Static)	1 TB	30 TB	10 Gbps

4.4 Cost Estimates (AWS)

Service	Monthly Cost (1K users)	Monthly Cost (10K users)
EC2 (Compute)	$450	$2,000
RDS (PostgreSQL)	$800	$2,500
ElastiCache (Redis)	$400	$1,200
S3 (Storage)	$200	$800
CloudFront (CDN)	$300	$1,500
ALB (Load Balancer)	$100	$200
CloudWatch (Monitoring)	$100	$300
Total	~$2,350	~$8,500

5. Scaling Thresholds & Triggers

5.1 Auto-Scaling Rules

# Scaling policies
scaling_policies:
  backend_scale_out:
    metric: cpu_utilization
    threshold: 70
    duration: 300  # 5 minutes
    adjustment: +1 instance
    cooldown: 300
    
  backend_scale_in:
    metric: cpu_utilization
    threshold: 30
    duration: 600  # 10 minutes
    adjustment: -1 instance
    cooldown: 600
    
  db_connection_scale:
    metric: database_connections
    threshold: 80
    duration: 180
    action: alert_and_review
    
  memory_pressure:
    metric: memory_utilization
    threshold: 85
    duration: 120
    adjustment: +1 instance
    cooldown: 300

5.2 Alert Thresholds

Metric	Warning	Critical	Emergency
CPU Usage	>60%	>80%	>95%
Memory Usage	>70%	>85%	>95%
Disk Usage	>70%	>85%	>95%
Response Time (p95)	>200ms	>500ms	>1000ms
Error Rate	>0.1%	>1%	>5%
DB Connections	>70%	>85%	>95%
Queue Depth	>500	>1000	>5000

6. Component Interactions

6.1 Request Flow

1. Client Request
   └──► CDN (CloudFront)
        └──► Nginx Load Balancer
             └──► Backend API (Round-robin)
                  ├──► FastAPI Route Handler
                  ├──► Authentication (JWT/API Key)
                  ├──► Rate Limiting (Redis)
                  ├──► Caching Check (Redis)
                  ├──► Database Query (PgBouncer → PostgreSQL)
                  ├──► Cache Update (Redis)
                  └──► Response

6.2 Data Flow

1. Log Ingestion
   └──► API Endpoint (/api/v1/ingest)
        ├──► Validation (Pydantic)
        ├──► Rate Limit Check (Redis)
        ├──► PII Detection
        ├──► Token Counting
        ├──► Async DB Write
        └──► Background Metric Update

2. Report Generation
   └──► API Request
        ├──► Queue Job (Celery)
        ├──► Worker Processing
        ├──► Data Aggregation
        ├──► PDF Generation
        ├──► Upload to S3
        └──► Notification

6.3 Failure Scenarios

Failure	Impact	Mitigation
Single backend down	33% capacity	Auto-restart, health check removal
Primary DB down	Read-only mode	Automatic failover to replica
Redis down	No caching	Degrade to DB queries, queue to memory
Nginx down	No traffic	Standby takeover (VIP)
Region down	Full outage	DNS failover to standby region

7. Critical Path for Other Teams

7.1 Dependencies

SPEC-001 (This Document)
    │
    ├──► @db-engineer - DB-001, DB-002, DB-003
    │     (Waiting for: partitioning strategy, connection pooling config)
    │
    ├──► @backend-dev - BE-PERF-004, BE-PERF-005
    │     (Waiting for: Redis config, async optimization guidelines)
    │
    ├──► @devops-engineer - DEV-DEPLOY-013, DEV-INFRA-014
    │     (Waiting for: infrastructure specs, scaling thresholds)
    │
    └──► @qa-engineer - QA-PERF-017
          (Waiting for: capacity targets, performance benchmarks)

7.2 Blocking Items (MUST COMPLETE FIRST)

Load Balancer Configuration → Blocks: DEV-INFRA-014
Database Connection Pool Settings → Blocks: DB-001
Redis Cluster Configuration → Blocks: BE-PERF-004
Scaling Thresholds → Blocks: QA-PERF-017

7.3 Handoff Checklist

Before other teams can proceed:

Architecture diagrams complete
Component specifications defined
Capacity planning estimates provided
Scaling thresholds documented
Configuration templates ready
Review meeting completed (scheduled)
Feedback incorporated
Architecture frozen for v1.0.0

Appendix A: Configuration Templates

Docker Compose Production

# docker-compose.prod.yml
version: '3.8'

services:
  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./nginx/ssl:/etc/nginx/ssl:ro
    depends_on:
      - backend
    networks:
      - frontend
    deploy:
      replicas: 2
      restart_policy:
        condition: any

  backend:
    image: mockupaws/backend:v1.0.0
    environment:
      - DATABASE_URL=postgresql+asyncpg://app:${DB_PASSWORD}@pgbouncer:6432/mockupaws
      - REPLICA_DATABASE_URLS=${REPLICA_URLS}
      - REDIS_URL=redis://redis-cluster:6379
      - JWT_SECRET_KEY=${JWT_SECRET}
    depends_on:
      - pgbouncer
      - redis-cluster
    networks:
      - frontend
      - backend
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
      update_config:
        parallelism: 1
        delay: 10s

  pgbouncer:
    image: pgbouncer/pgbouncer:latest
    environment:
      - DATABASES_HOST=postgres-primary
      - DATABASES_PORT=5432
      - DATABASES_DATABASE=mockupaws
      - POOL_MODE=transaction
      - MAX_CLIENT_CONN=1000
    networks:
      - backend

  redis-cluster:
    image: redis:7-alpine
    command: redis-server /usr/local/etc/redis/redis.conf
    volumes:
      - ./redis/redis.conf:/usr/local/etc/redis/redis.conf
    networks:
      - backend
    deploy:
      replicas: 3

networks:
  frontend:
    driver: overlay
  backend:
    driver: overlay
    internal: true

Environment Variables Template

# .env.production

# Application
APP_ENV=production
DEBUG=false
LOG_LEVEL=INFO

# Database
DATABASE_URL=postgresql+asyncpg://app:secure_password@pgbouncer:6432/mockupaws
REPLICA_DATABASE_URLS=postgresql+asyncpg://app:secure_password@pgbouncer-replica-1:6432/mockupaws,postgresql+asyncpg://app:secure_password@pgbouncer-replica-2:6432/mockupaws
DB_POOL_SIZE=20
DB_MAX_OVERFLOW=10

# Redis
REDIS_URL=redis://redis-cluster:6379
REDIS_CLUSTER_NODES=redis-1:6379,redis-2:6379,redis-3:6379

# Security
JWT_SECRET_KEY=change_me_in_production_32_chars_min
JWT_ALGORITHM=HS256
ACCESS_TOKEN_EXPIRE_MINUTES=30
BCRYPT_ROUNDS=12

# Rate Limiting
RATE_LIMIT_GENERAL=100/minute
RATE_LIMIT_AUTH=5/minute
RATE_LIMIT_INGEST=1000/minute

# AWS/S3
AWS_REGION=us-east-1
S3_BUCKET=mockupaws-production
ARCHIVE_S3_BUCKET=mockupaws-archives
CLOUDFRONT_DOMAIN=cdn.mockupaws.com

# Monitoring
SENTRY_DSN=https://xxx@yyy.ingest.sentry.io/zzz
PROMETHEUS_ENABLED=true
JAEGER_ENDPOINT=http://jaeger:14268/api/traces

Document Version: 1.0.0-Draft
Last Updated: 2026-04-07
Owner: @spec-architect

52 KiB Raw Blame History

Production Architecture Design - mockupAWS v1.0.0

Executive Summary

Key Architectural Decisions

1. Scalability Architecture

1.1 System Overview

1.2 Load Balancer Configuration (Nginx)

1.3 Horizontal Scaling Strategy

Scaling Triggers

Auto-Scaling Configuration (Docker Swarm)

Kubernetes HPA Alternative

1.4 Database Read Replicas

PostgreSQL Primary-Replica Setup

Connection Pooling (PgBouncer)

Application-Level Read/Write Splitting

2. High Availability Design

2.1 Multi-Region Deployment Strategy

Architecture Overview

Failover Mechanisms

2.2 Circuit Breaker Pattern

2.3 Graceful Degradation

3. Data Architecture

3.1 Database Partitioning Strategy

Time-Based Partitioning for Logs and Metrics

Tenant Isolation Strategy

3.2 Data Archive Strategy

Archive Policy

Archive Implementation

Archive Table Schema

3.3 CDN Configuration

CloudFront Distribution

4. Capacity Planning

4.1 Resource Estimates

Base Capacity (1000 Concurrent Users)

Scaling Projections

4.2 Storage Estimates

4.3 Network Bandwidth

4.4 Cost Estimates (AWS)

5. Scaling Thresholds & Triggers

5.1 Auto-Scaling Rules

5.2 Alert Thresholds

6. Component Interactions

6.1 Request Flow

6.2 Data Flow

6.3 Failure Scenarios

7. Critical Path for Other Teams

7.1 Dependencies

7.2 Blocking Items (MUST COMPLETE FIRST)

7.3 Handoff Checklist

Appendix A: Configuration Templates

Docker Compose Production

Environment Variables Template

52 KiB

Raw Blame History