Files
mockupAWS/docs/BACKUP-RESTORE.md
Luca Sacchi Ricciardi 38fd6cb562
Some checks failed
CI/CD - Build & Test / Backend Tests (push) Has been cancelled
CI/CD - Build & Test / Frontend Tests (push) Has been cancelled
CI/CD - Build & Test / Security Scans (push) Has been cancelled
CI/CD - Build & Test / Docker Build Test (push) Has been cancelled
CI/CD - Build & Test / Terraform Validate (push) Has been cancelled
Deploy to Production / Build & Test (push) Has been cancelled
Deploy to Production / Security Scan (push) Has been cancelled
Deploy to Production / Build Docker Images (push) Has been cancelled
Deploy to Production / Deploy to Staging (push) Has been cancelled
Deploy to Production / E2E Tests (push) Has been cancelled
Deploy to Production / Deploy to Production (push) Has been cancelled
E2E Tests / Run E2E Tests (push) Has been cancelled
E2E Tests / Visual Regression Tests (push) Has been cancelled
E2E Tests / Smoke Tests (push) Has been cancelled
release: v1.0.0 - Production Ready
Complete production-ready release with all v1.0.0 features:

Architecture & Planning (@spec-architect):
- Production architecture design with scalability and HA
- Security audit plan and compliance review
- Technical debt assessment and refactoring roadmap

Database (@db-engineer):
- 17 performance indexes and 3 materialized views
- PgBouncer connection pooling
- Automated backup/restore with PITR (RTO<1h, RPO<5min)
- Data archiving strategy (~65% storage savings)

Backend (@backend-dev):
- Redis caching layer with 3-tier strategy
- Celery async jobs with Flower monitoring
- API v2 with rate limiting (tiered: free/premium/enterprise)
- Prometheus metrics and OpenTelemetry tracing
- Security hardening (headers, audit logging)

Frontend (@frontend-dev):
- Bundle optimization: 308KB (code splitting, lazy loading)
- Onboarding tutorial (react-joyride)
- Command palette (Cmd+K) and keyboard shortcuts
- Analytics dashboard with cost predictions
- i18n (English + Italian) and WCAG 2.1 AA compliance

DevOps (@devops-engineer):
- Complete deployment guide (Docker, K8s, AWS ECS)
- Terraform AWS infrastructure (Multi-AZ RDS, ElastiCache, ECS)
- CI/CD pipelines with blue-green deployment
- Prometheus + Grafana monitoring with 15+ alert rules
- SLA definition and incident response procedures

QA (@qa-engineer):
- 153+ E2E test cases (85% coverage)
- k6 performance tests (1000+ concurrent users, p95<200ms)
- Security testing (0 critical vulnerabilities)
- Cross-browser and mobile testing
- Official QA sign-off

Production Features:
 Horizontal scaling ready
 99.9% uptime target
 <200ms response time (p95)
 Enterprise-grade security
 Complete observability
 Disaster recovery
 SLA monitoring

Ready for production deployment! 🚀
2026-04-07 20:14:51 +02:00

462 lines
11 KiB
Markdown

# Backup & Restore Documentation
## mockupAWS v1.0.0 - Database Disaster Recovery Guide
---
## Table of Contents
1. [Overview](#overview)
2. [Recovery Objectives](#recovery-objectives)
3. [Backup Strategy](#backup-strategy)
4. [Restore Procedures](#restore-procedures)
5. [Point-in-Time Recovery (PITR)](#point-in-time-recovery-pitr)
6. [Disaster Recovery Procedures](#disaster-recovery-procedures)
7. [Monitoring & Alerting](#monitoring--alerting)
8. [Troubleshooting](#troubleshooting)
---
## Overview
This document describes the backup, restore, and disaster recovery procedures for the mockupAWS PostgreSQL database.
### Components
- **Automated Backups**: Daily full backups via `pg_dump`
- **WAL Archiving**: Continuous archiving for Point-in-Time Recovery
- **Encryption**: AES-256 encryption for all backups
- **Storage**: S3 with cross-region replication
- **Retention**: 30 days for daily backups, 7 days for WAL archives
---
## Recovery Objectives
| Metric | Target | Description |
|--------|--------|-------------|
| **RTO** | < 1 hour | Time to restore service after failure |
| **RPO** | < 5 minutes | Maximum data loss acceptable |
| **Backup Window** | 02:00-04:00 UTC | Daily backup execution time |
| **Retention** | 30 days | Backup retention period |
---
## Backup Strategy
### Backup Types
#### 1. Full Backups (Daily)
- **Schedule**: Daily at 02:00 UTC
- **Tool**: `pg_dump` with custom format
- **Compression**: gzip level 9
- **Encryption**: AES-256-CBC
- **Retention**: 30 days
#### 2. WAL Archiving (Continuous)
- **Method**: PostgreSQL `archive_command`
- **Frequency**: Every WAL segment (16MB)
- **Storage**: S3 nearline storage
- **Retention**: 7 days
#### 3. Configuration Backups
- **Files**: `postgresql.conf`, `pg_hba.conf`
- **Schedule**: Weekly
- **Storage**: Version control + S3
### Storage Architecture
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Primary Region │────▶│ S3 Standard │────▶│ S3 Glacier │
│ (us-east-1) │ │ (30 days) │ │ (long-term) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
┌─────────────────┐
│ Secondary Region│
│ (eu-west-1) │ ← Cross-region replication for DR
└─────────────────┘
```
### Required Environment Variables
```bash
# Required
export DATABASE_URL="postgresql://user:pass@host:5432/dbname"
export BACKUP_BUCKET="mockupaws-backups-prod"
export BACKUP_ENCRYPTION_KEY="your-256-bit-key-here"
# Optional
export BACKUP_REGION="us-east-1"
export BACKUP_SECONDARY_REGION="eu-west-1"
export BACKUP_SECONDARY_BUCKET="mockupaws-backups-dr"
export BACKUP_RETENTION_DAYS="30"
```
---
## Restore Procedures
### Quick Reference
| Scenario | Command | ETA |
|----------|---------|-----|
| Latest full backup | `./scripts/restore.sh latest` | 15-30 min |
| Specific backup | `./scripts/restore.sh s3://bucket/path` | 15-30 min |
| Point-in-Time | `./scripts/restore.sh latest --target-time "..."` | 30-60 min |
| Verify only | `./scripts/restore.sh <file> --verify-only` | 5-10 min |
### Step-by-Step Restore
#### 1. Pre-Restore Checklist
- [ ] Identify target database (should be empty or disposable)
- [ ] Ensure sufficient disk space (2x database size)
- [ ] Verify backup integrity: `./scripts/restore.sh <backup> --verify-only`
- [ ] Notify team about maintenance window
- [ ] Document current database state
#### 2. Full Restore from Latest Backup
```bash
# Set environment variables
export DATABASE_URL="postgresql://postgres:password@localhost:5432/mockupaws"
export BACKUP_ENCRYPTION_KEY="your-encryption-key"
export BACKUP_BUCKET="mockupaws-backups-prod"
# Perform restore
./scripts/restore.sh latest
```
#### 3. Restore from Specific Backup
```bash
# From S3
./scripts/restore.sh s3://mockupaws-backups-prod/backups/full/20260407/backup.enc
# From local file
./scripts/restore.sh /path/to/backup/mockupaws_full_20260407_120000.sql.gz.enc
```
#### 4. Post-Restore Verification
```bash
# Check database connectivity
psql $DATABASE_URL -c "SELECT COUNT(*) FROM scenarios;"
# Verify key tables
psql $DATABASE_URL -c "\dt"
# Check recent data
psql $DATABASE_URL -c "SELECT MAX(created_at) FROM scenario_logs;"
```
---
## Point-in-Time Recovery (PITR)
### Prerequisites
1. **Base Backup**: Full backup from before target time
2. **WAL Archives**: All WAL segments from backup time to target time
3. **Configuration**: PostgreSQL configured for archiving
### PostgreSQL Configuration
Add to `postgresql.conf`:
```ini
# WAL Archiving
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://mockupaws-wal-archive/wal/%f'
archive_timeout = 60
# Recovery settings (applied during restore)
recovery_target_time = '2026-04-07 14:30:00 UTC'
recovery_target_action = promote
```
### PITR Procedure
```bash
# Restore to specific point in time
./scripts/restore.sh latest --target-time "2026-04-07 14:30:00"
```
### Manual PITR (Advanced)
```bash
# 1. Stop PostgreSQL
sudo systemctl stop postgresql
# 2. Clear data directory
sudo rm -rf /var/lib/postgresql/data/*
# 3. Restore base backup
pg_basebackup -h primary -D /var/lib/postgresql/data -Fp -Xs -P
# 4. Create recovery signal
touch /var/lib/postgresql/data/recovery.signal
# 5. Configure recovery
cat >> /var/lib/postgresql/data/postgresql.conf <<EOF
restore_command = 'aws s3 cp s3://mockupaws-wal-archive/wal/%f %p'
recovery_target_time = '2026-04-07 14:30:00 UTC'
recovery_target_action = promote
EOF
# 6. Start PostgreSQL
sudo systemctl start postgresql
# 7. Monitor recovery
psql -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();"
```
---
## Disaster Recovery Procedures
### DR Scenarios
#### Scenario 1: Database Corruption
```bash
# 1. Isolate corrupted database
psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'mockupaws';"
# 2. Restore from latest backup
./scripts/restore.sh latest
# 3. Verify data integrity
./scripts/verify-data.sh
# 4. Resume application traffic
```
#### Scenario 2: Complete Region Failure
```bash
# 1. Activate DR region
export BACKUP_BUCKET="mockupaws-backups-dr"
export AWS_REGION="eu-west-1"
# 2. Restore to DR database
./scripts/restore.sh latest
# 3. Update DNS/application configuration
# Point to DR region database endpoint
# 4. Verify application functionality
```
#### Scenario 3: Accidental Data Deletion
```bash
# 1. Identify deletion timestamp (from logs)
DELETION_TIME="2026-04-07 15:23:00"
# 2. Restore to point just before deletion
./scripts/restore.sh latest --target-time "$DELETION_TIME"
# 3. Export missing data
pg_dump --data-only --table=deleted_table > missing_data.sql
# 4. Restore to current and import missing data
```
### DR Testing Schedule
| Test Type | Frequency | Responsible |
|-----------|-----------|-------------|
| Backup verification | Daily | Automated |
| Restore test (dev) | Weekly | DevOps |
| Full DR drill | Monthly | SRE Team |
| Cross-region failover | Quarterly | Platform Team |
---
## Monitoring & Alerting
### Backup Monitoring
```sql
-- Check backup history
SELECT
backup_type,
created_at,
status,
EXTRACT(EPOCH FROM (NOW() - created_at))/3600 as hours_since_backup
FROM backup_history
ORDER BY created_at DESC
LIMIT 10;
```
### Prometheus Alerts
```yaml
# backup-alerts.yml
groups:
- name: backup_alerts
rules:
- alert: BackupNotRun
expr: time() - max(backup_last_success_timestamp) > 90000
for: 1h
labels:
severity: critical
annotations:
summary: "Database backup has not run in 25 hours"
- alert: BackupFailed
expr: increase(backup_failures_total[1h]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Database backup failed"
- alert: LowBackupStorage
expr: s3_bucket_free_bytes / s3_bucket_total_bytes < 0.1
for: 1h
labels:
severity: warning
annotations:
summary: "Backup storage capacity < 10%"
```
### Health Checks
```bash
# Check backup status
curl -f http://localhost:8000/health/backup || echo "Backup check failed"
# Check WAL archiving
psql -c "SELECT archived_count, failed_count FROM pg_stat_archiver;"
# Check replication lag (if applicable)
psql -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;"
```
---
## Troubleshooting
### Common Issues
#### Issue: Backup fails with "disk full"
```bash
# Check disk space
df -h
# Clean old backups
./scripts/backup.sh cleanup
# Or manually remove old local backups
find /path/to/backups -mtime +7 -delete
```
#### Issue: Decryption fails
```bash
# Verify encryption key matches
export BACKUP_ENCRYPTION_KEY="correct-key"
# Test decryption
openssl enc -aes-256-cbc -d -pbkdf2 -in backup.enc -out backup.sql -pass pass:"$BACKUP_ENCRYPTION_KEY"
```
#### Issue: Restore fails with "database in use"
```bash
# Terminate connections
psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'mockupaws' AND pid <> pg_backend_pid();"
# Retry restore
./scripts/restore.sh latest
```
#### Issue: S3 upload fails
```bash
# Check AWS credentials
aws sts get-caller-identity
# Test S3 access
aws s3 ls s3://$BACKUP_BUCKET/
# Check bucket permissions
aws s3api get-bucket-acl --bucket $BACKUP_BUCKET
```
### Log Files
| Log File | Purpose |
|----------|---------|
| `storage/logs/backup_*.log` | Backup execution logs |
| `storage/logs/restore_*.log` | Restore execution logs |
| `/var/log/postgresql/*.log` | PostgreSQL server logs |
### Getting Help
1. Check this documentation
2. Review logs in `storage/logs/`
3. Contact: #database-ops Slack channel
4. Escalate to: on-call SRE (PagerDuty)
---
## Appendix
### A. Backup Retention Policy
| Backup Type | Retention | Storage Class |
|-------------|-----------|---------------|
| Daily Full | 30 days | S3 Standard-IA |
| Weekly Full | 12 weeks | S3 Standard-IA |
| Monthly Full | 12 months | S3 Glacier |
| Yearly Full | 7 years | S3 Glacier Deep Archive |
| WAL Archives | 7 days | S3 Standard |
### B. Backup Encryption
```bash
# Generate encryption key
openssl rand -base64 32
# Store in secrets manager
aws secretsmanager create-secret \
--name mockupaws/backup-encryption-key \
--secret-string "$(openssl rand -base64 32)"
```
### C. Cron Configuration
```bash
# /etc/cron.d/mockupaws-backup
# Daily full backup at 02:00 UTC
0 2 * * * root /opt/mockupaws/scripts/backup.sh full >> /var/log/mockupaws/backup.log 2>&1
# Hourly WAL archive
0 * * * * root /opt/mockupaws/scripts/backup.sh wal >> /var/log/mockupaws/wal.log 2>&1
# Daily cleanup
0 4 * * * root /opt/mockupaws/scripts/backup.sh cleanup >> /var/log/mockupaws/cleanup.log 2>&1
```
---
## Document History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0.0 | 2026-04-07 | DB Team | Initial release |
---
*For questions or updates to this document, contact the Database Engineering team.*