# Database Optimization & Production Readiness v1.0.0 ## Implementation Summary - @db-engineer --- ## Overview This document summarizes the database optimization and production readiness implementation for mockupAWS v1.0.0, covering three major workstreams: 1. **DB-001**: Database Optimization (Indexing, Query Optimization, Connection Pooling) 2. **DB-002**: Backup & Restore System 3. **DB-003**: Data Archiving Strategy --- ## DB-001: Database Optimization ### Migration: Performance Indexes **File**: `alembic/versions/a1b2c3d4e5f6_add_performance_indexes_v1_0_0.py` #### Implemented Features 1. **Composite Indexes** (9 indexes) - `idx_logs_scenario_received` - Optimizes date range queries on logs - `idx_logs_scenario_source` - Speeds up analytics queries - `idx_logs_scenario_pii` - Accelerates PII reports - `idx_logs_scenario_size` - Optimizes "top logs" queries - `idx_metrics_scenario_time_type` - Time-series with type filtering - `idx_metrics_scenario_name` - Metric name aggregations - `idx_reports_scenario_created` - Report listing optimization - `idx_scenarios_status_created` - Dashboard queries - `idx_scenarios_region_status` - Filtering optimization 2. **Partial Indexes** (6 indexes) - `idx_scenarios_active` - Excludes archived scenarios - `idx_scenarios_running` - Running scenarios monitoring - `idx_logs_pii_only` - Security audit queries - `idx_logs_recent` - Last 30 days only - `idx_apikeys_active` - Active API keys - `idx_apikeys_valid` - Non-expired keys 3. **Covering Indexes** (2 indexes) - `idx_scenarios_covering` - All commonly queried columns - `idx_logs_covering` - Avoids table lookups 4. **Materialized Views** (3 views) - `mv_scenario_daily_stats` - Daily aggregated statistics - `mv_monthly_costs` - Monthly cost aggregations - `mv_source_analytics` - Source-based analytics 5. **Query Performance Logging** - `query_performance_log` table for slow query tracking ### PgBouncer Configuration **File**: `config/pgbouncer.ini` ```ini Key Settings: - pool_mode = transaction # Transaction-level pooling - max_client_conn = 1000 # Max client connections - default_pool_size = 25 # Connections per database - reserve_pool_size = 5 # Emergency connections - server_idle_timeout = 600 # 10 min idle timeout - server_lifetime = 3600 # 1 hour max connection life ``` **Usage**: ```bash # Start PgBouncer docker run -d \ -v $(pwd)/config/pgbouncer.ini:/etc/pgbouncer/pgbouncer.ini \ -v $(pwd)/config/pgbouncer_userlist.txt:/etc/pgbouncer/userlist.txt \ -p 6432:6432 \ pgbouncer/pgbouncer:latest # Update connection string DATABASE_URL=postgresql+asyncpg://user:pass@localhost:6432/mockupaws ``` ### Performance Benchmark Tool **File**: `scripts/benchmark_db.py` ```bash # Run before optimization python scripts/benchmark_db.py --before # Run after optimization python scripts/benchmark_db.py --after # Compare results python scripts/benchmark_db.py --compare ``` **Benchmarked Queries**: - scenario_list - List scenarios with pagination - scenario_by_status - Filtered scenario queries - scenario_with_relations - N+1 query test - logs_by_scenario - Log retrieval by scenario - logs_by_scenario_and_date - Date range queries - logs_aggregate - Aggregation queries - metrics_time_series - Time-series data - pii_detection_query - PII filtering - reports_by_scenario - Report listing - materialized_view - Materialized view performance - count_by_status - Status aggregation --- ## DB-002: Backup & Restore System ### Backup Script **File**: `scripts/backup.sh` #### Features 1. **Full Backups** - Daily automated backups via `pg_dump` - Custom format with compression (gzip -9) - AES-256 encryption - Checksum verification 2. **WAL Archiving** - Continuous archiving for PITR - Automated WAL switching - Archive compression 3. **Storage & Replication** - S3 upload with Standard-IA storage class - Multi-region replication for DR - Metadata tracking 4. **Retention** - 30-day default retention - Automated cleanup - Configurable per environment #### Usage ```bash # Full backup ./scripts/backup.sh full # WAL archive ./scripts/backup.sh wal # Verify backup ./scripts/backup.sh verify /path/to/backup.enc # Cleanup old backups ./scripts/backup.sh cleanup # List available backups ./scripts/backup.sh list ``` #### Environment Variables ```bash export DATABASE_URL="postgresql://user:pass@host:5432/dbname" export BACKUP_BUCKET="mockupaws-backups-prod" export BACKUP_REGION="us-east-1" export BACKUP_ENCRYPTION_KEY="your-aes-256-key" export BACKUP_SECONDARY_BUCKET="mockupaws-backups-dr" export BACKUP_SECONDARY_REGION="eu-west-1" export BACKUP_RETENTION_DAYS=30 ``` ### Restore Script **File**: `scripts/restore.sh` #### Features 1. **Full Restore** - Database creation/drop - Integrity verification - Parallel restore (4 jobs) - Progress logging 2. **Point-in-Time Recovery (PITR)** - Recovery to specific timestamp - WAL replay support - Safety backup of existing data 3. **Validation** - Pre-restore checks - Post-restore validation - Table accessibility verification 4. **Safety Features** - Dry-run mode - Verify-only mode - Automatic safety backups #### Usage ```bash # Restore latest backup ./scripts/restore.sh latest # Restore with PITR ./scripts/restore.sh latest --target-time "2026-04-07 14:30:00" # Restore from S3 ./scripts/restore.sh s3://bucket/path/to/backup.enc # Verify only (no restore) ./scripts/restore.sh backup.enc --verify-only # Dry run ./scripts/restore.sh latest --dry-run ``` #### Recovery Objectives | Metric | Target | Status | |--------|--------|--------| | RTO (Recovery Time Objective) | < 1 hour | ✓ Implemented | | RPO (Recovery Point Objective) | < 5 minutes | ✓ WAL Archiving | ### Documentation **File**: `docs/BACKUP-RESTORE.md` Complete disaster recovery guide including: - Recovery procedures for different scenarios - PITR implementation details - DR testing schedule - Monitoring and alerting - Troubleshooting guide --- ## DB-003: Data Archiving Strategy ### Migration: Archive Tables **File**: `alembic/versions/b2c3d4e5f6a7_create_archive_tables_v1_0_0.py` #### Implemented Features 1. **Archive Tables** (3 tables) - `scenario_logs_archive` - Logs > 1 year, partitioned by month - `scenario_metrics_archive` - Metrics > 2 years, with aggregation - `reports_archive` - Reports > 6 months, S3 references 2. **Partitioning** - Monthly partitions for logs and metrics - Automatic partition management - Efficient date-based queries 3. **Unified Views** (Query Transparency) - `v_scenario_logs_all` - Combines live and archived logs - `v_scenario_metrics_all` - Combines live and archived metrics 4. **Tracking & Monitoring** - `archive_jobs` table for job tracking - `v_archive_statistics` view for statistics - `archive_policies` table for configuration ### Archive Job Script **File**: `scripts/archive_job.py` #### Features 1. **Automated Archiving** - Nightly job execution - Batch processing (configurable size) - Progress tracking 2. **Data Aggregation** - Metrics aggregation before archive - Daily rollups for old metrics - Sample count tracking 3. **S3 Integration** - Report file upload - Metadata preservation - Local file cleanup 4. **Safety Features** - Dry-run mode - Transaction safety - Error handling and recovery #### Usage ```bash # Preview what would be archived python scripts/archive_job.py --dry-run --all # Archive all eligible data python scripts/archive_job.py --all # Archive specific types python scripts/archive_job.py --logs python scripts/archive_job.py --metrics python scripts/archive_job.py --reports # Combine options python scripts/archive_job.py --logs --metrics --dry-run ``` #### Archive Policies | Table | Archive After | Aggregation | Compression | S3 Storage | |-------|--------------|-------------|-------------|------------| | scenario_logs | 365 days | No | No | No | | scenario_metrics | 730 days | Daily | No | No | | reports | 180 days | No | Yes | Yes | #### Cron Configuration ```bash # Run nightly at 3:00 AM 0 3 * * * /opt/mockupaws/.venv/bin/python /opt/mockupaws/scripts/archive_job.py --all ``` ### Documentation **File**: `docs/DATA-ARCHIVING.md` Complete archiving guide including: - Archive policies and retention - Implementation details - Query examples (transparent access) - Monitoring and alerts - Storage cost estimation --- ## Migration Execution ### Apply Migrations ```bash # Activate virtual environment source .venv/bin/activate # Apply performance optimization migration alembic upgrade a1b2c3d4e5f6 # Apply archive tables migration alembic upgrade b2c3d4e5f6a7 # Or apply all pending migrations alembic upgrade head ``` ### Rollback (if needed) ```bash # Rollback archive migration alembic downgrade b2c3d4e5f6a7 # Rollback performance migration alembic downgrade a1b2c3d4e5f6 ``` --- ## Files Created ### Migrations ``` alembic/versions/ ├── a1b2c3d4e5f6_add_performance_indexes_v1_0_0.py # DB-001 └── b2c3d4e5f6a7_create_archive_tables_v1_0_0.py # DB-003 ``` ### Scripts ``` scripts/ ├── benchmark_db.py # Performance benchmarking ├── backup.sh # Backup automation ├── restore.sh # Restore automation └── archive_job.py # Data archiving ``` ### Configuration ``` config/ ├── pgbouncer.ini # PgBouncer configuration └── pgbouncer_userlist.txt # User credentials ``` ### Documentation ``` docs/ ├── BACKUP-RESTORE.md # DR procedures └── DATA-ARCHIVING.md # Archiving guide ``` --- ## Performance Improvements Summary ### Expected Improvements | Query Type | Before | After | Improvement | |------------|--------|-------|-------------| | Scenario list with filters | ~150ms | ~20ms | 87% | | Logs by scenario + date | ~200ms | ~30ms | 85% | | Metrics time-series | ~300ms | ~50ms | 83% | | PII detection queries | ~500ms | ~25ms | 95% | | Report generation | ~2s | ~500ms | 75% | | Materialized view queries | ~1s | ~100ms | 90% | ### Connection Pooling Benefits - **Before**: Direct connections to PostgreSQL - **After**: PgBouncer with transaction pooling - **Benefits**: - Reduced connection overhead - Better handling of connection spikes - Connection reuse across requests - Protection against connection exhaustion ### Storage Optimization | Data Type | Before | After | Savings | |-----------|--------|-------|---------| | Active logs | All history | Last year only | ~50% | | Metrics | All history | Aggregated after 2y | ~66% | | Reports | All local | S3 after 6 months | ~80% | | **Total** | - | - | **~65%** | --- ## Production Checklist ### Before Deployment - [ ] Test migrations in staging environment - [ ] Run benchmark before/after comparison - [ ] Verify PgBouncer configuration - [ ] Test backup/restore procedures - [ ] Configure archive cron job - [ ] Set up monitoring and alerting - [ ] Document S3 bucket configuration - [ ] Configure encryption keys ### After Deployment - [ ] Verify migrations applied successfully - [ ] Monitor query performance metrics - [ ] Check PgBouncer connection stats - [ ] Verify first backup completes - [ ] Test restore procedure - [ ] Monitor archive job execution - [ ] Review disk space usage - [ ] Update runbooks --- ## Monitoring & Alerting ### Key Metrics to Monitor ```sql -- Query performance (should be < 200ms p95) SELECT query_hash, avg_execution_time FROM query_performance_log WHERE execution_time_ms > 200 ORDER BY created_at DESC; -- Archive job status SELECT job_type, status, records_archived, completed_at FROM archive_jobs ORDER BY started_at DESC; -- PgBouncer stats SHOW STATS; SHOW POOLS; -- Backup history SELECT * FROM backup_history ORDER BY created_at DESC LIMIT 5; ``` ### Prometheus Alerts ```yaml alerts: - name: SlowQuery condition: query_p95_latency > 200ms - name: ArchiveJobFailed condition: archive_job_status == 'failed' - name: BackupStale condition: time_since_last_backup > 25h - name: PgBouncerConnectionsHigh condition: pgbouncer_active_connections > 800 ``` --- ## Support & Troubleshooting ### Common Issues 1. **Migration fails** ```bash alembic downgrade -1 # Fix issue, then alembic upgrade head ``` 2. **Backup script fails** ```bash # Check environment variables env | grep -E "(DATABASE_URL|BACKUP|AWS)" # Test manually ./scripts/backup.sh full ``` 3. **Archive job slow** ```bash # Reduce batch size # Edit ARCHIVE_CONFIG in scripts/archive_job.py ``` 4. **PgBouncer connection issues** ```bash # Check PgBouncer logs docker logs pgbouncer # Verify userlist cat config/pgbouncer_userlist.txt ``` --- ## Next Steps 1. **Immediate (Week 1)** - Deploy migrations to production - Configure PgBouncer - Schedule first backup - Run initial archive job 2. **Short-term (Week 2-4)** - Monitor performance improvements - Tune index usage based on pg_stat_statements - Verify backup/restore procedures - Document operational procedures 3. **Long-term (Month 2+)** - Implement automated DR testing - Optimize archive schedules - Review and adjust retention policies - Capacity planning based on growth --- ## References - [PostgreSQL Index Documentation](https://www.postgresql.org/docs/current/indexes.html) - [PgBouncer Documentation](https://www.pgbouncer.org/usage.html) - [PostgreSQL WAL Archiving](https://www.postgresql.org/docs/current/continuous-archiving.html) - [PostgreSQL Table Partitioning](https://www.postgresql.org/docs/current/ddl-partitioning.html) --- *Implementation completed: 2026-04-07* *Version: 1.0.0* *Owner: Database Engineering Team*