# Backup & Restore Documentation ## mockupAWS v1.0.0 - Database Disaster Recovery Guide --- ## Table of Contents 1. [Overview](#overview) 2. [Recovery Objectives](#recovery-objectives) 3. [Backup Strategy](#backup-strategy) 4. [Restore Procedures](#restore-procedures) 5. [Point-in-Time Recovery (PITR)](#point-in-time-recovery-pitr) 6. [Disaster Recovery Procedures](#disaster-recovery-procedures) 7. [Monitoring & Alerting](#monitoring--alerting) 8. [Troubleshooting](#troubleshooting) --- ## Overview This document describes the backup, restore, and disaster recovery procedures for the mockupAWS PostgreSQL database. ### Components - **Automated Backups**: Daily full backups via `pg_dump` - **WAL Archiving**: Continuous archiving for Point-in-Time Recovery - **Encryption**: AES-256 encryption for all backups - **Storage**: S3 with cross-region replication - **Retention**: 30 days for daily backups, 7 days for WAL archives --- ## Recovery Objectives | Metric | Target | Description | |--------|--------|-------------| | **RTO** | < 1 hour | Time to restore service after failure | | **RPO** | < 5 minutes | Maximum data loss acceptable | | **Backup Window** | 02:00-04:00 UTC | Daily backup execution time | | **Retention** | 30 days | Backup retention period | --- ## Backup Strategy ### Backup Types #### 1. Full Backups (Daily) - **Schedule**: Daily at 02:00 UTC - **Tool**: `pg_dump` with custom format - **Compression**: gzip level 9 - **Encryption**: AES-256-CBC - **Retention**: 30 days #### 2. WAL Archiving (Continuous) - **Method**: PostgreSQL `archive_command` - **Frequency**: Every WAL segment (16MB) - **Storage**: S3 nearline storage - **Retention**: 7 days #### 3. Configuration Backups - **Files**: `postgresql.conf`, `pg_hba.conf` - **Schedule**: Weekly - **Storage**: Version control + S3 ### Storage Architecture ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Primary Region │────▶│ S3 Standard │────▶│ S3 Glacier │ │ (us-east-1) │ │ (30 days) │ │ (long-term) │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ ▼ ┌─────────────────┐ │ Secondary Region│ │ (eu-west-1) │ ← Cross-region replication for DR └─────────────────┘ ``` ### Required Environment Variables ```bash # Required export DATABASE_URL="postgresql://user:pass@host:5432/dbname" export BACKUP_BUCKET="mockupaws-backups-prod" export BACKUP_ENCRYPTION_KEY="your-256-bit-key-here" # Optional export BACKUP_REGION="us-east-1" export BACKUP_SECONDARY_REGION="eu-west-1" export BACKUP_SECONDARY_BUCKET="mockupaws-backups-dr" export BACKUP_RETENTION_DAYS="30" ``` --- ## Restore Procedures ### Quick Reference | Scenario | Command | ETA | |----------|---------|-----| | Latest full backup | `./scripts/restore.sh latest` | 15-30 min | | Specific backup | `./scripts/restore.sh s3://bucket/path` | 15-30 min | | Point-in-Time | `./scripts/restore.sh latest --target-time "..."` | 30-60 min | | Verify only | `./scripts/restore.sh --verify-only` | 5-10 min | ### Step-by-Step Restore #### 1. Pre-Restore Checklist - [ ] Identify target database (should be empty or disposable) - [ ] Ensure sufficient disk space (2x database size) - [ ] Verify backup integrity: `./scripts/restore.sh --verify-only` - [ ] Notify team about maintenance window - [ ] Document current database state #### 2. Full Restore from Latest Backup ```bash # Set environment variables export DATABASE_URL="postgresql://postgres:password@localhost:5432/mockupaws" export BACKUP_ENCRYPTION_KEY="your-encryption-key" export BACKUP_BUCKET="mockupaws-backups-prod" # Perform restore ./scripts/restore.sh latest ``` #### 3. Restore from Specific Backup ```bash # From S3 ./scripts/restore.sh s3://mockupaws-backups-prod/backups/full/20260407/backup.enc # From local file ./scripts/restore.sh /path/to/backup/mockupaws_full_20260407_120000.sql.gz.enc ``` #### 4. Post-Restore Verification ```bash # Check database connectivity psql $DATABASE_URL -c "SELECT COUNT(*) FROM scenarios;" # Verify key tables psql $DATABASE_URL -c "\dt" # Check recent data psql $DATABASE_URL -c "SELECT MAX(created_at) FROM scenario_logs;" ``` --- ## Point-in-Time Recovery (PITR) ### Prerequisites 1. **Base Backup**: Full backup from before target time 2. **WAL Archives**: All WAL segments from backup time to target time 3. **Configuration**: PostgreSQL configured for archiving ### PostgreSQL Configuration Add to `postgresql.conf`: ```ini # WAL Archiving wal_level = replica archive_mode = on archive_command = 'aws s3 cp %p s3://mockupaws-wal-archive/wal/%f' archive_timeout = 60 # Recovery settings (applied during restore) recovery_target_time = '2026-04-07 14:30:00 UTC' recovery_target_action = promote ``` ### PITR Procedure ```bash # Restore to specific point in time ./scripts/restore.sh latest --target-time "2026-04-07 14:30:00" ``` ### Manual PITR (Advanced) ```bash # 1. Stop PostgreSQL sudo systemctl stop postgresql # 2. Clear data directory sudo rm -rf /var/lib/postgresql/data/* # 3. Restore base backup pg_basebackup -h primary -D /var/lib/postgresql/data -Fp -Xs -P # 4. Create recovery signal touch /var/lib/postgresql/data/recovery.signal # 5. Configure recovery cat >> /var/lib/postgresql/data/postgresql.conf < missing_data.sql # 4. Restore to current and import missing data ``` ### DR Testing Schedule | Test Type | Frequency | Responsible | |-----------|-----------|-------------| | Backup verification | Daily | Automated | | Restore test (dev) | Weekly | DevOps | | Full DR drill | Monthly | SRE Team | | Cross-region failover | Quarterly | Platform Team | --- ## Monitoring & Alerting ### Backup Monitoring ```sql -- Check backup history SELECT backup_type, created_at, status, EXTRACT(EPOCH FROM (NOW() - created_at))/3600 as hours_since_backup FROM backup_history ORDER BY created_at DESC LIMIT 10; ``` ### Prometheus Alerts ```yaml # backup-alerts.yml groups: - name: backup_alerts rules: - alert: BackupNotRun expr: time() - max(backup_last_success_timestamp) > 90000 for: 1h labels: severity: critical annotations: summary: "Database backup has not run in 25 hours" - alert: BackupFailed expr: increase(backup_failures_total[1h]) > 0 for: 5m labels: severity: warning annotations: summary: "Database backup failed" - alert: LowBackupStorage expr: s3_bucket_free_bytes / s3_bucket_total_bytes < 0.1 for: 1h labels: severity: warning annotations: summary: "Backup storage capacity < 10%" ``` ### Health Checks ```bash # Check backup status curl -f http://localhost:8000/health/backup || echo "Backup check failed" # Check WAL archiving psql -c "SELECT archived_count, failed_count FROM pg_stat_archiver;" # Check replication lag (if applicable) psql -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;" ``` --- ## Troubleshooting ### Common Issues #### Issue: Backup fails with "disk full" ```bash # Check disk space df -h # Clean old backups ./scripts/backup.sh cleanup # Or manually remove old local backups find /path/to/backups -mtime +7 -delete ``` #### Issue: Decryption fails ```bash # Verify encryption key matches export BACKUP_ENCRYPTION_KEY="correct-key" # Test decryption openssl enc -aes-256-cbc -d -pbkdf2 -in backup.enc -out backup.sql -pass pass:"$BACKUP_ENCRYPTION_KEY" ``` #### Issue: Restore fails with "database in use" ```bash # Terminate connections psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'mockupaws' AND pid <> pg_backend_pid();" # Retry restore ./scripts/restore.sh latest ``` #### Issue: S3 upload fails ```bash # Check AWS credentials aws sts get-caller-identity # Test S3 access aws s3 ls s3://$BACKUP_BUCKET/ # Check bucket permissions aws s3api get-bucket-acl --bucket $BACKUP_BUCKET ``` ### Log Files | Log File | Purpose | |----------|---------| | `storage/logs/backup_*.log` | Backup execution logs | | `storage/logs/restore_*.log` | Restore execution logs | | `/var/log/postgresql/*.log` | PostgreSQL server logs | ### Getting Help 1. Check this documentation 2. Review logs in `storage/logs/` 3. Contact: #database-ops Slack channel 4. Escalate to: on-call SRE (PagerDuty) --- ## Appendix ### A. Backup Retention Policy | Backup Type | Retention | Storage Class | |-------------|-----------|---------------| | Daily Full | 30 days | S3 Standard-IA | | Weekly Full | 12 weeks | S3 Standard-IA | | Monthly Full | 12 months | S3 Glacier | | Yearly Full | 7 years | S3 Glacier Deep Archive | | WAL Archives | 7 days | S3 Standard | ### B. Backup Encryption ```bash # Generate encryption key openssl rand -base64 32 # Store in secrets manager aws secretsmanager create-secret \ --name mockupaws/backup-encryption-key \ --secret-string "$(openssl rand -base64 32)" ``` ### C. Cron Configuration ```bash # /etc/cron.d/mockupaws-backup # Daily full backup at 02:00 UTC 0 2 * * * root /opt/mockupaws/scripts/backup.sh full >> /var/log/mockupaws/backup.log 2>&1 # Hourly WAL archive 0 * * * * root /opt/mockupaws/scripts/backup.sh wal >> /var/log/mockupaws/wal.log 2>&1 # Daily cleanup 0 4 * * * root /opt/mockupaws/scripts/backup.sh cleanup >> /var/log/mockupaws/cleanup.log 2>&1 ``` --- ## Document History | Version | Date | Author | Changes | |---------|------|--------|---------| | 1.0.0 | 2026-04-07 | DB Team | Initial release | --- *For questions or updates to this document, contact the Database Engineering team.*