# Data Archiving Strategy ## mockupAWS v1.0.0 - Data Lifecycle Management --- ## Table of Contents 1. [Overview](#overview) 2. [Archive Policies](#archive-policies) 3. [Implementation](#implementation) 4. [Archive Job](#archive-job) 5. [Querying Archived Data](#querying-archived-data) 6. [Monitoring](#monitoring) 7. [Storage Estimation](#storage-estimation) --- ## Overview As mockupAWS accumulates data over time, we implement an automated archiving strategy to: - **Reduce storage costs** by moving old data to archive tables - **Improve query performance** on active data - **Maintain data accessibility** through unified views - **Comply with data retention policies** ### Archive Strategy Overview ``` ┌─────────────────────────────────────────────────────────────────┐ │ Data Lifecycle │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Active Data (Hot) │ Archive Data (Cold) │ │ ───────────────── │ ────────────────── │ │ • Fast queries │ • Partitioned by month │ │ • Full indexing │ • Compressed │ │ • Real-time writes │ • S3 for large files │ │ │ │ scenario_logs │ → scenario_logs_archive │ │ (> 1 year old) │ (> 1 year, partitioned) │ │ │ │ scenario_metrics │ → scenario_metrics_archive │ │ (> 2 years old) │ (> 2 years, aggregated) │ │ │ │ reports │ → reports_archive │ │ (> 6 months old) │ (> 6 months, S3 storage) │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## Archive Policies ### Policy Configuration | Table | Archive After | Aggregation | Compression | S3 Storage | |-------|--------------|-------------|-------------|------------| | `scenario_logs` | 365 days | No | No | No | | `scenario_metrics` | 730 days | Daily | No | No | | `reports` | 180 days | No | Yes | Yes | ### Detailed Policies #### 1. Scenario Logs Archive (> 1 year) **Criteria:** - Records older than 365 days - Move to `scenario_logs_archive` table - Partitioned by month for efficient querying **Retention:** - Archive table: 7 years - After 7 years: Delete or move to long-term storage #### 2. Scenario Metrics Archive (> 2 years) **Criteria:** - Records older than 730 days - Aggregate to daily values before archiving - Store aggregated data in `scenario_metrics_archive` **Aggregation:** - Group by: scenario_id, metric_type, metric_name, day - Aggregate: AVG(value), COUNT(samples) **Retention:** - Archive table: 5 years - Aggregated data only (original samples deleted) #### 3. Reports Archive (> 6 months) **Criteria:** - Reports older than 180 days - Compress PDF/CSV files - Upload to S3 - Keep metadata in `reports_archive` table **Retention:** - S3 storage: 3 years with lifecycle to Glacier - Metadata: 5 years --- ## Implementation ### Database Schema #### Archive Tables ```sql -- Scenario logs archive (partitioned by month) CREATE TABLE scenario_logs_archive ( id UUID PRIMARY KEY, scenario_id UUID NOT NULL, received_at TIMESTAMPTZ NOT NULL, message_hash VARCHAR(64) NOT NULL, message_preview VARCHAR(500), source VARCHAR(100) NOT NULL, size_bytes INTEGER NOT NULL, has_pii BOOLEAN NOT NULL, token_count INTEGER NOT NULL, sqs_blocks INTEGER NOT NULL, archived_at TIMESTAMPTZ DEFAULT NOW(), archive_batch_id UUID ) PARTITION BY RANGE (DATE_TRUNC('month', received_at)); -- Scenario metrics archive (with aggregation support) CREATE TABLE scenario_metrics_archive ( id UUID PRIMARY KEY, scenario_id UUID NOT NULL, timestamp TIMESTAMPTZ NOT NULL, metric_type VARCHAR(50) NOT NULL, metric_name VARCHAR(100) NOT NULL, value DECIMAL(15,6) NOT NULL, unit VARCHAR(20) NOT NULL, extra_data JSONB DEFAULT '{}', archived_at TIMESTAMPTZ DEFAULT NOW(), archive_batch_id UUID, is_aggregated BOOLEAN DEFAULT FALSE, aggregation_period VARCHAR(20), sample_count INTEGER ) PARTITION BY RANGE (DATE_TRUNC('month', timestamp)); -- Reports archive (S3 references) CREATE TABLE reports_archive ( id UUID PRIMARY KEY, scenario_id UUID NOT NULL, format VARCHAR(10) NOT NULL, file_path VARCHAR(500) NOT NULL, file_size_bytes INTEGER, generated_by VARCHAR(100), extra_data JSONB DEFAULT '{}', created_at TIMESTAMPTZ NOT NULL, archived_at TIMESTAMPTZ DEFAULT NOW(), s3_location VARCHAR(500), deleted_locally BOOLEAN DEFAULT FALSE, archive_batch_id UUID ); ``` #### Unified Views (Query Transparency) ```sql -- View combining live and archived logs CREATE VIEW v_scenario_logs_all AS SELECT id, scenario_id, received_at, message_hash, message_preview, source, size_bytes, has_pii, token_count, sqs_blocks, NULL::timestamptz as archived_at, false as is_archived FROM scenario_logs UNION ALL SELECT id, scenario_id, received_at, message_hash, message_preview, source, size_bytes, has_pii, token_count, sqs_blocks, archived_at, true as is_archived FROM scenario_logs_archive; -- View combining live and archived metrics CREATE VIEW v_scenario_metrics_all AS SELECT id, scenario_id, timestamp, metric_type, metric_name, value, unit, extra_data, NULL::timestamptz as archived_at, false as is_aggregated, false as is_archived FROM scenario_metrics UNION ALL SELECT id, scenario_id, timestamp, metric_type, metric_name, value, unit, extra_data, archived_at, is_aggregated, true as is_archived FROM scenario_metrics_archive; ``` ### Archive Job Tracking ```sql -- Archive jobs table CREATE TABLE archive_jobs ( id UUID PRIMARY KEY DEFAULT uuid_generate_v4(), job_type VARCHAR(50) NOT NULL, status VARCHAR(50) NOT NULL DEFAULT 'pending', started_at TIMESTAMPTZ, completed_at TIMESTAMPTZ, records_processed INTEGER DEFAULT 0, records_archived INTEGER DEFAULT 0, records_deleted INTEGER DEFAULT 0, bytes_archived BIGINT DEFAULT 0, error_message TEXT, created_at TIMESTAMPTZ DEFAULT NOW() ); -- Archive statistics view CREATE VIEW v_archive_statistics AS SELECT 'logs' as archive_type, COUNT(*) as total_records, MIN(received_at) as oldest_record, MAX(received_at) as newest_record, SUM(size_bytes) as total_bytes FROM scenario_logs_archive UNION ALL SELECT 'metrics' as archive_type, COUNT(*) as total_records, MIN(timestamp) as oldest_record, MAX(timestamp) as newest_record, 0 as total_bytes FROM scenario_metrics_archive UNION ALL SELECT 'reports' as archive_type, COUNT(*) as total_records, MIN(created_at) as oldest_record, MAX(created_at) as newest_record, SUM(file_size_bytes) as total_bytes FROM reports_archive; ``` --- ## Archive Job ### Running the Archive Job ```bash # Preview what would be archived (dry run) python scripts/archive_job.py --dry-run --all # Archive all eligible data python scripts/archive_job.py --all # Archive specific types only python scripts/archive_job.py --logs python scripts/archive_job.py --metrics python scripts/archive_job.py --reports # Combine options python scripts/archive_job.py --logs --metrics --dry-run ``` ### Cron Configuration ```bash # Run archive job nightly at 3:00 AM 0 3 * * * /opt/mockupaws/.venv/bin/python /opt/mockupaws/scripts/archive_job.py --all >> /var/log/mockupaws/archive.log 2>&1 ``` ### Environment Variables ```bash # Required export DATABASE_URL="postgresql+asyncpg://user:pass@host:5432/mockupaws" # For reports S3 archiving export REPORTS_ARCHIVE_BUCKET="mockupaws-reports-archive" export AWS_ACCESS_KEY_ID="your-key" export AWS_SECRET_ACCESS_KEY="your-secret" export AWS_DEFAULT_REGION="us-east-1" ``` --- ## Querying Archived Data ### Transparent Access Use the unified views for automatic access to both live and archived data: ```sql -- Query all logs (live + archived) SELECT * FROM v_scenario_logs_all WHERE scenario_id = 'uuid-here' ORDER BY received_at DESC LIMIT 1000; -- Query all metrics (live + archived) SELECT * FROM v_scenario_metrics_all WHERE scenario_id = 'uuid-here' AND timestamp > NOW() - INTERVAL '2 years' ORDER BY timestamp; ``` ### Optimized Queries ```sql -- Query only live data (faster) SELECT * FROM scenario_logs WHERE scenario_id = 'uuid-here' ORDER BY received_at DESC; -- Query only archived data SELECT * FROM scenario_logs_archive WHERE scenario_id = 'uuid-here' AND received_at < NOW() - INTERVAL '1 year' ORDER BY received_at DESC; -- Query specific month partition (most efficient) SELECT * FROM scenario_logs_archive WHERE received_at >= '2025-01-01' AND received_at < '2025-02-01' AND scenario_id = 'uuid-here'; ``` ### Application Code Example ```python from sqlalchemy import select from src.models.scenario_log import ScenarioLog async def get_logs(db: AsyncSession, scenario_id: UUID, include_archived: bool = False): """Get scenario logs with optional archive inclusion.""" if include_archived: # Use unified view for complete history result = await db.execute( text(""" SELECT * FROM v_scenario_logs_all WHERE scenario_id = :sid ORDER BY received_at DESC """), {"sid": scenario_id} ) else: # Query only live data (faster) result = await db.execute( select(ScenarioLog) .where(ScenarioLog.scenario_id == scenario_id) .order_by(ScenarioLog.received_at.desc()) ) return result.scalars().all() ``` --- ## Monitoring ### Archive Job Status ```sql -- Check recent archive jobs SELECT job_type, status, started_at, completed_at, records_archived, records_deleted, pg_size_pretty(bytes_archived) as space_saved FROM archive_jobs ORDER BY started_at DESC LIMIT 10; -- Check for failed jobs SELECT * FROM archive_jobs WHERE status = 'failed' ORDER BY started_at DESC; ``` ### Archive Statistics ```sql -- View archive statistics SELECT * FROM v_archive_statistics; -- Archive growth over time SELECT DATE_TRUNC('month', archived_at) as archive_month, archive_type, COUNT(*) as records_archived, pg_size_pretty(SUM(total_bytes)) as bytes_archived FROM v_archive_statistics GROUP BY DATE_TRUNC('month', archived_at), archive_type ORDER BY archive_month DESC; ``` ### Alerts ```yaml # archive-alerts.yml groups: - name: archive_alerts rules: - alert: ArchiveJobFailed expr: increase(archive_job_failures_total[1h]) > 0 for: 5m labels: severity: warning annotations: summary: "Data archive job failed" - alert: ArchiveJobNotRunning expr: time() - max(archive_job_last_success_timestamp) > 90000 for: 1h labels: severity: warning annotations: summary: "Archive job has not run in 25 hours" - alert: ArchiveStorageGrowing expr: rate(archive_bytes_total[1d]) > 1073741824 # 1GB/day for: 1h labels: severity: info annotations: summary: "Archive storage growing rapidly" ``` --- ## Storage Estimation ### Projected Storage Savings Assuming typical usage patterns: | Data Type | Daily Volume | Annual Volume | After Archive | Savings | |-----------|--------------|---------------|---------------|---------| | Logs | 1M records/day | 365M records | 365M in archive | 0 in main | | Metrics | 500K records/day | 182M records | 60M aggregated | 66% reduction | | Reports | 100/day (50MB each) | 1.8TB | 1.8TB in S3 | 100% local reduction | ### Cost Analysis (Monthly) | Storage Type | Before Archive | After Archive | Monthly Savings | |--------------|----------------|---------------|-----------------| | PostgreSQL (hot) | $200 | $50 | $150 | | PostgreSQL (archive) | $0 | $30 | -$30 | | S3 Standard | $0 | $20 | -$20 | | S3 Glacier | $0 | $5 | -$5 | | **Total** | **$200** | **$105** | **$95** | *Estimates based on AWS us-east-1 pricing, actual costs may vary.* --- ## Maintenance ### Monthly Tasks 1. **Review archive statistics** ```sql SELECT * FROM v_archive_statistics; ``` 2. **Check for old archive partitions** ```sql SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size FROM pg_tables WHERE tablename LIKE 'scenario_logs_archive_%' ORDER BY tablename; ``` 3. **Clean up old S3 files** (after retention period) ```bash aws s3 rm s3://mockupaws-reports-archive/archived-reports/ \ --recursive \ --exclude '*' \ --include '*2023*' ``` ### Quarterly Tasks 1. **Archive job performance review** - Check execution times - Optimize batch sizes if needed 2. **Storage cost review** - Verify S3 lifecycle policies - Consider Glacier transition for old archives 3. **Data retention compliance** - Verify deletion of data past retention period - Update policies as needed --- ## Troubleshooting ### Archive Job Fails ```bash # Check logs tail -f storage/logs/archive_*.log # Run with verbose output python scripts/archive_job.py --all --verbose # Check database connectivity psql $DATABASE_URL -c "SELECT COUNT(*) FROM archive_jobs;" ``` ### S3 Upload Fails ```bash # Verify AWS credentials aws sts get-caller-identity # Test S3 access aws s3 ls s3://mockupaws-reports-archive/ # Check bucket policy aws s3api get-bucket-policy --bucket mockupaws-reports-archive ``` ### Query Performance Issues ```sql -- Check if indexes exist on archive tables SELECT indexname, indexdef FROM pg_indexes WHERE tablename LIKE '%_archive%'; -- Analyze archive tables ANALYZE scenario_logs_archive; ANALYZE scenario_metrics_archive; -- Check partition pruning EXPLAIN ANALYZE SELECT * FROM scenario_logs_archive WHERE received_at >= '2025-01-01' AND received_at < '2025-02-01'; ``` --- ## References - [PostgreSQL Table Partitioning](https://www.postgresql.org/docs/current/ddl-partitioning.html) - [AWS S3 Lifecycle Policies](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) - [Database Migration](alembic/versions/b2c3d4e5f6a7_create_archive_tables_v1_0_0.py) - [Archive Job Script](../scripts/archive_job.py) --- *Document Version: 1.0.0* *Last Updated: 2026-04-07*