Files
mockupAWS/docs/BACKUP-RESTORE.md
Luca Sacchi Ricciardi 38fd6cb562
Some checks failed
CI/CD - Build & Test / Backend Tests (push) Has been cancelled
CI/CD - Build & Test / Frontend Tests (push) Has been cancelled
CI/CD - Build & Test / Security Scans (push) Has been cancelled
CI/CD - Build & Test / Docker Build Test (push) Has been cancelled
CI/CD - Build & Test / Terraform Validate (push) Has been cancelled
Deploy to Production / Build & Test (push) Has been cancelled
Deploy to Production / Security Scan (push) Has been cancelled
Deploy to Production / Build Docker Images (push) Has been cancelled
Deploy to Production / Deploy to Staging (push) Has been cancelled
Deploy to Production / E2E Tests (push) Has been cancelled
Deploy to Production / Deploy to Production (push) Has been cancelled
E2E Tests / Run E2E Tests (push) Has been cancelled
E2E Tests / Visual Regression Tests (push) Has been cancelled
E2E Tests / Smoke Tests (push) Has been cancelled
release: v1.0.0 - Production Ready
Complete production-ready release with all v1.0.0 features:

Architecture & Planning (@spec-architect):
- Production architecture design with scalability and HA
- Security audit plan and compliance review
- Technical debt assessment and refactoring roadmap

Database (@db-engineer):
- 17 performance indexes and 3 materialized views
- PgBouncer connection pooling
- Automated backup/restore with PITR (RTO<1h, RPO<5min)
- Data archiving strategy (~65% storage savings)

Backend (@backend-dev):
- Redis caching layer with 3-tier strategy
- Celery async jobs with Flower monitoring
- API v2 with rate limiting (tiered: free/premium/enterprise)
- Prometheus metrics and OpenTelemetry tracing
- Security hardening (headers, audit logging)

Frontend (@frontend-dev):
- Bundle optimization: 308KB (code splitting, lazy loading)
- Onboarding tutorial (react-joyride)
- Command palette (Cmd+K) and keyboard shortcuts
- Analytics dashboard with cost predictions
- i18n (English + Italian) and WCAG 2.1 AA compliance

DevOps (@devops-engineer):
- Complete deployment guide (Docker, K8s, AWS ECS)
- Terraform AWS infrastructure (Multi-AZ RDS, ElastiCache, ECS)
- CI/CD pipelines with blue-green deployment
- Prometheus + Grafana monitoring with 15+ alert rules
- SLA definition and incident response procedures

QA (@qa-engineer):
- 153+ E2E test cases (85% coverage)
- k6 performance tests (1000+ concurrent users, p95<200ms)
- Security testing (0 critical vulnerabilities)
- Cross-browser and mobile testing
- Official QA sign-off

Production Features:
 Horizontal scaling ready
 99.9% uptime target
 <200ms response time (p95)
 Enterprise-grade security
 Complete observability
 Disaster recovery
 SLA monitoring

Ready for production deployment! 🚀
2026-04-07 20:14:51 +02:00

11 KiB

Backup & Restore Documentation

mockupAWS v1.0.0 - Database Disaster Recovery Guide


Table of Contents

  1. Overview
  2. Recovery Objectives
  3. Backup Strategy
  4. Restore Procedures
  5. Point-in-Time Recovery (PITR)
  6. Disaster Recovery Procedures
  7. Monitoring & Alerting
  8. Troubleshooting

Overview

This document describes the backup, restore, and disaster recovery procedures for the mockupAWS PostgreSQL database.

Components

  • Automated Backups: Daily full backups via pg_dump
  • WAL Archiving: Continuous archiving for Point-in-Time Recovery
  • Encryption: AES-256 encryption for all backups
  • Storage: S3 with cross-region replication
  • Retention: 30 days for daily backups, 7 days for WAL archives

Recovery Objectives

Metric Target Description
RTO < 1 hour Time to restore service after failure
RPO < 5 minutes Maximum data loss acceptable
Backup Window 02:00-04:00 UTC Daily backup execution time
Retention 30 days Backup retention period

Backup Strategy

Backup Types

1. Full Backups (Daily)

  • Schedule: Daily at 02:00 UTC
  • Tool: pg_dump with custom format
  • Compression: gzip level 9
  • Encryption: AES-256-CBC
  • Retention: 30 days

2. WAL Archiving (Continuous)

  • Method: PostgreSQL archive_command
  • Frequency: Every WAL segment (16MB)
  • Storage: S3 nearline storage
  • Retention: 7 days

3. Configuration Backups

  • Files: postgresql.conf, pg_hba.conf
  • Schedule: Weekly
  • Storage: Version control + S3

Storage Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Primary Region │────▶│  S3 Standard    │────▶│  S3 Glacier     │
│  (us-east-1)    │     │  (30 days)      │     │  (long-term)    │
└─────────────────┘     └─────────────────┘     └─────────────────┘
         │
         ▼
┌─────────────────┐
│ Secondary Region│
│ (eu-west-1)     │  ← Cross-region replication for DR
└─────────────────┘

Required Environment Variables

# Required
export DATABASE_URL="postgresql://user:pass@host:5432/dbname"
export BACKUP_BUCKET="mockupaws-backups-prod"
export BACKUP_ENCRYPTION_KEY="your-256-bit-key-here"

# Optional
export BACKUP_REGION="us-east-1"
export BACKUP_SECONDARY_REGION="eu-west-1"
export BACKUP_SECONDARY_BUCKET="mockupaws-backups-dr"
export BACKUP_RETENTION_DAYS="30"

Restore Procedures

Quick Reference

Scenario Command ETA
Latest full backup ./scripts/restore.sh latest 15-30 min
Specific backup ./scripts/restore.sh s3://bucket/path 15-30 min
Point-in-Time ./scripts/restore.sh latest --target-time "..." 30-60 min
Verify only ./scripts/restore.sh <file> --verify-only 5-10 min

Step-by-Step Restore

1. Pre-Restore Checklist

  • Identify target database (should be empty or disposable)
  • Ensure sufficient disk space (2x database size)
  • Verify backup integrity: ./scripts/restore.sh <backup> --verify-only
  • Notify team about maintenance window
  • Document current database state

2. Full Restore from Latest Backup

# Set environment variables
export DATABASE_URL="postgresql://postgres:password@localhost:5432/mockupaws"
export BACKUP_ENCRYPTION_KEY="your-encryption-key"
export BACKUP_BUCKET="mockupaws-backups-prod"

# Perform restore
./scripts/restore.sh latest

3. Restore from Specific Backup

# From S3
./scripts/restore.sh s3://mockupaws-backups-prod/backups/full/20260407/backup.enc

# From local file
./scripts/restore.sh /path/to/backup/mockupaws_full_20260407_120000.sql.gz.enc

4. Post-Restore Verification

# Check database connectivity
psql $DATABASE_URL -c "SELECT COUNT(*) FROM scenarios;"

# Verify key tables
psql $DATABASE_URL -c "\dt"

# Check recent data
psql $DATABASE_URL -c "SELECT MAX(created_at) FROM scenario_logs;"

Point-in-Time Recovery (PITR)

Prerequisites

  1. Base Backup: Full backup from before target time
  2. WAL Archives: All WAL segments from backup time to target time
  3. Configuration: PostgreSQL configured for archiving

PostgreSQL Configuration

Add to postgresql.conf:

# WAL Archiving
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://mockupaws-wal-archive/wal/%f'
archive_timeout = 60

# Recovery settings (applied during restore)
recovery_target_time = '2026-04-07 14:30:00 UTC'
recovery_target_action = promote

PITR Procedure

# Restore to specific point in time
./scripts/restore.sh latest --target-time "2026-04-07 14:30:00"

Manual PITR (Advanced)

# 1. Stop PostgreSQL
sudo systemctl stop postgresql

# 2. Clear data directory
sudo rm -rf /var/lib/postgresql/data/*

# 3. Restore base backup
pg_basebackup -h primary -D /var/lib/postgresql/data -Fp -Xs -P

# 4. Create recovery signal
touch /var/lib/postgresql/data/recovery.signal

# 5. Configure recovery
cat >> /var/lib/postgresql/data/postgresql.conf <<EOF
restore_command = 'aws s3 cp s3://mockupaws-wal-archive/wal/%f %p'
recovery_target_time = '2026-04-07 14:30:00 UTC'
recovery_target_action = promote
EOF

# 6. Start PostgreSQL
sudo systemctl start postgresql

# 7. Monitor recovery
psql -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();"

Disaster Recovery Procedures

DR Scenarios

Scenario 1: Database Corruption

# 1. Isolate corrupted database
psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'mockupaws';"

# 2. Restore from latest backup
./scripts/restore.sh latest

# 3. Verify data integrity
./scripts/verify-data.sh

# 4. Resume application traffic

Scenario 2: Complete Region Failure

# 1. Activate DR region
export BACKUP_BUCKET="mockupaws-backups-dr"
export AWS_REGION="eu-west-1"

# 2. Restore to DR database
./scripts/restore.sh latest

# 3. Update DNS/application configuration
# Point to DR region database endpoint

# 4. Verify application functionality

Scenario 3: Accidental Data Deletion

# 1. Identify deletion timestamp (from logs)
DELETION_TIME="2026-04-07 15:23:00"

# 2. Restore to point just before deletion
./scripts/restore.sh latest --target-time "$DELETION_TIME"

# 3. Export missing data
pg_dump --data-only --table=deleted_table > missing_data.sql

# 4. Restore to current and import missing data

DR Testing Schedule

Test Type Frequency Responsible
Backup verification Daily Automated
Restore test (dev) Weekly DevOps
Full DR drill Monthly SRE Team
Cross-region failover Quarterly Platform Team

Monitoring & Alerting

Backup Monitoring

-- Check backup history
SELECT 
    backup_type,
    created_at,
    status,
    EXTRACT(EPOCH FROM (NOW() - created_at))/3600 as hours_since_backup
FROM backup_history 
ORDER BY created_at DESC 
LIMIT 10;

Prometheus Alerts

# backup-alerts.yml
groups:
  - name: backup_alerts
    rules:
      - alert: BackupNotRun
        expr: time() - max(backup_last_success_timestamp) > 90000
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "Database backup has not run in 25 hours"
          
      - alert: BackupFailed
        expr: increase(backup_failures_total[1h]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Database backup failed"
          
      - alert: LowBackupStorage
        expr: s3_bucket_free_bytes / s3_bucket_total_bytes < 0.1
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Backup storage capacity < 10%"

Health Checks

# Check backup status
curl -f http://localhost:8000/health/backup || echo "Backup check failed"

# Check WAL archiving
psql -c "SELECT archived_count, failed_count FROM pg_stat_archiver;"

# Check replication lag (if applicable)
psql -c "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;"

Troubleshooting

Common Issues

Issue: Backup fails with "disk full"

# Check disk space
df -h

# Clean old backups
./scripts/backup.sh cleanup

# Or manually remove old local backups
find /path/to/backups -mtime +7 -delete

Issue: Decryption fails

# Verify encryption key matches
export BACKUP_ENCRYPTION_KEY="correct-key"

# Test decryption
openssl enc -aes-256-cbc -d -pbkdf2 -in backup.enc -out backup.sql -pass pass:"$BACKUP_ENCRYPTION_KEY"

Issue: Restore fails with "database in use"

# Terminate connections
psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'mockupaws' AND pid <> pg_backend_pid();"

# Retry restore
./scripts/restore.sh latest

Issue: S3 upload fails

# Check AWS credentials
aws sts get-caller-identity

# Test S3 access
aws s3 ls s3://$BACKUP_BUCKET/

# Check bucket permissions
aws s3api get-bucket-acl --bucket $BACKUP_BUCKET

Log Files

Log File Purpose
storage/logs/backup_*.log Backup execution logs
storage/logs/restore_*.log Restore execution logs
/var/log/postgresql/*.log PostgreSQL server logs

Getting Help

  1. Check this documentation
  2. Review logs in storage/logs/
  3. Contact: #database-ops Slack channel
  4. Escalate to: on-call SRE (PagerDuty)

Appendix

A. Backup Retention Policy

Backup Type Retention Storage Class
Daily Full 30 days S3 Standard-IA
Weekly Full 12 weeks S3 Standard-IA
Monthly Full 12 months S3 Glacier
Yearly Full 7 years S3 Glacier Deep Archive
WAL Archives 7 days S3 Standard

B. Backup Encryption

# Generate encryption key
openssl rand -base64 32

# Store in secrets manager
aws secretsmanager create-secret \
  --name mockupaws/backup-encryption-key \
  --secret-string "$(openssl rand -base64 32)"

C. Cron Configuration

# /etc/cron.d/mockupaws-backup
# Daily full backup at 02:00 UTC
0 2 * * * root /opt/mockupaws/scripts/backup.sh full >> /var/log/mockupaws/backup.log 2>&1

# Hourly WAL archive
0 * * * * root /opt/mockupaws/scripts/backup.sh wal >> /var/log/mockupaws/wal.log 2>&1

# Daily cleanup
0 4 * * * root /opt/mockupaws/scripts/backup.sh cleanup >> /var/log/mockupaws/cleanup.log 2>&1

Document History

Version Date Author Changes
1.0.0 2026-04-07 DB Team Initial release

For questions or updates to this document, contact the Database Engineering team.