Files
mockupAWS/docs/DATA-ARCHIVING.md
Luca Sacchi Ricciardi 38fd6cb562
Some checks failed
CI/CD - Build & Test / Backend Tests (push) Has been cancelled
CI/CD - Build & Test / Frontend Tests (push) Has been cancelled
CI/CD - Build & Test / Security Scans (push) Has been cancelled
CI/CD - Build & Test / Docker Build Test (push) Has been cancelled
CI/CD - Build & Test / Terraform Validate (push) Has been cancelled
Deploy to Production / Build & Test (push) Has been cancelled
Deploy to Production / Security Scan (push) Has been cancelled
Deploy to Production / Build Docker Images (push) Has been cancelled
Deploy to Production / Deploy to Staging (push) Has been cancelled
Deploy to Production / E2E Tests (push) Has been cancelled
Deploy to Production / Deploy to Production (push) Has been cancelled
E2E Tests / Run E2E Tests (push) Has been cancelled
E2E Tests / Visual Regression Tests (push) Has been cancelled
E2E Tests / Smoke Tests (push) Has been cancelled
release: v1.0.0 - Production Ready
Complete production-ready release with all v1.0.0 features:

Architecture & Planning (@spec-architect):
- Production architecture design with scalability and HA
- Security audit plan and compliance review
- Technical debt assessment and refactoring roadmap

Database (@db-engineer):
- 17 performance indexes and 3 materialized views
- PgBouncer connection pooling
- Automated backup/restore with PITR (RTO<1h, RPO<5min)
- Data archiving strategy (~65% storage savings)

Backend (@backend-dev):
- Redis caching layer with 3-tier strategy
- Celery async jobs with Flower monitoring
- API v2 with rate limiting (tiered: free/premium/enterprise)
- Prometheus metrics and OpenTelemetry tracing
- Security hardening (headers, audit logging)

Frontend (@frontend-dev):
- Bundle optimization: 308KB (code splitting, lazy loading)
- Onboarding tutorial (react-joyride)
- Command palette (Cmd+K) and keyboard shortcuts
- Analytics dashboard with cost predictions
- i18n (English + Italian) and WCAG 2.1 AA compliance

DevOps (@devops-engineer):
- Complete deployment guide (Docker, K8s, AWS ECS)
- Terraform AWS infrastructure (Multi-AZ RDS, ElastiCache, ECS)
- CI/CD pipelines with blue-green deployment
- Prometheus + Grafana monitoring with 15+ alert rules
- SLA definition and incident response procedures

QA (@qa-engineer):
- 153+ E2E test cases (85% coverage)
- k6 performance tests (1000+ concurrent users, p95<200ms)
- Security testing (0 critical vulnerabilities)
- Cross-browser and mobile testing
- Official QA sign-off

Production Features:
 Horizontal scaling ready
 99.9% uptime target
 <200ms response time (p95)
 Enterprise-grade security
 Complete observability
 Disaster recovery
 SLA monitoring

Ready for production deployment! 🚀
2026-04-07 20:14:51 +02:00

569 lines
15 KiB
Markdown

# Data Archiving Strategy
## mockupAWS v1.0.0 - Data Lifecycle Management
---
## Table of Contents
1. [Overview](#overview)
2. [Archive Policies](#archive-policies)
3. [Implementation](#implementation)
4. [Archive Job](#archive-job)
5. [Querying Archived Data](#querying-archived-data)
6. [Monitoring](#monitoring)
7. [Storage Estimation](#storage-estimation)
---
## Overview
As mockupAWS accumulates data over time, we implement an automated archiving strategy to:
- **Reduce storage costs** by moving old data to archive tables
- **Improve query performance** on active data
- **Maintain data accessibility** through unified views
- **Comply with data retention policies**
### Archive Strategy Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ Data Lifecycle │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Active Data (Hot) │ Archive Data (Cold) │
│ ───────────────── │ ────────────────── │
│ • Fast queries │ • Partitioned by month │
│ • Full indexing │ • Compressed │
│ • Real-time writes │ • S3 for large files │
│ │
│ scenario_logs │ → scenario_logs_archive │
│ (> 1 year old) │ (> 1 year, partitioned) │
│ │
│ scenario_metrics │ → scenario_metrics_archive │
│ (> 2 years old) │ (> 2 years, aggregated) │
│ │
│ reports │ → reports_archive │
│ (> 6 months old) │ (> 6 months, S3 storage) │
│ │
└─────────────────────────────────────────────────────────────────┘
```
---
## Archive Policies
### Policy Configuration
| Table | Archive After | Aggregation | Compression | S3 Storage |
|-------|--------------|-------------|-------------|------------|
| `scenario_logs` | 365 days | No | No | No |
| `scenario_metrics` | 730 days | Daily | No | No |
| `reports` | 180 days | No | Yes | Yes |
### Detailed Policies
#### 1. Scenario Logs Archive (> 1 year)
**Criteria:**
- Records older than 365 days
- Move to `scenario_logs_archive` table
- Partitioned by month for efficient querying
**Retention:**
- Archive table: 7 years
- After 7 years: Delete or move to long-term storage
#### 2. Scenario Metrics Archive (> 2 years)
**Criteria:**
- Records older than 730 days
- Aggregate to daily values before archiving
- Store aggregated data in `scenario_metrics_archive`
**Aggregation:**
- Group by: scenario_id, metric_type, metric_name, day
- Aggregate: AVG(value), COUNT(samples)
**Retention:**
- Archive table: 5 years
- Aggregated data only (original samples deleted)
#### 3. Reports Archive (> 6 months)
**Criteria:**
- Reports older than 180 days
- Compress PDF/CSV files
- Upload to S3
- Keep metadata in `reports_archive` table
**Retention:**
- S3 storage: 3 years with lifecycle to Glacier
- Metadata: 5 years
---
## Implementation
### Database Schema
#### Archive Tables
```sql
-- Scenario logs archive (partitioned by month)
CREATE TABLE scenario_logs_archive (
id UUID PRIMARY KEY,
scenario_id UUID NOT NULL,
received_at TIMESTAMPTZ NOT NULL,
message_hash VARCHAR(64) NOT NULL,
message_preview VARCHAR(500),
source VARCHAR(100) NOT NULL,
size_bytes INTEGER NOT NULL,
has_pii BOOLEAN NOT NULL,
token_count INTEGER NOT NULL,
sqs_blocks INTEGER NOT NULL,
archived_at TIMESTAMPTZ DEFAULT NOW(),
archive_batch_id UUID
) PARTITION BY RANGE (DATE_TRUNC('month', received_at));
-- Scenario metrics archive (with aggregation support)
CREATE TABLE scenario_metrics_archive (
id UUID PRIMARY KEY,
scenario_id UUID NOT NULL,
timestamp TIMESTAMPTZ NOT NULL,
metric_type VARCHAR(50) NOT NULL,
metric_name VARCHAR(100) NOT NULL,
value DECIMAL(15,6) NOT NULL,
unit VARCHAR(20) NOT NULL,
extra_data JSONB DEFAULT '{}',
archived_at TIMESTAMPTZ DEFAULT NOW(),
archive_batch_id UUID,
is_aggregated BOOLEAN DEFAULT FALSE,
aggregation_period VARCHAR(20),
sample_count INTEGER
) PARTITION BY RANGE (DATE_TRUNC('month', timestamp));
-- Reports archive (S3 references)
CREATE TABLE reports_archive (
id UUID PRIMARY KEY,
scenario_id UUID NOT NULL,
format VARCHAR(10) NOT NULL,
file_path VARCHAR(500) NOT NULL,
file_size_bytes INTEGER,
generated_by VARCHAR(100),
extra_data JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL,
archived_at TIMESTAMPTZ DEFAULT NOW(),
s3_location VARCHAR(500),
deleted_locally BOOLEAN DEFAULT FALSE,
archive_batch_id UUID
);
```
#### Unified Views (Query Transparency)
```sql
-- View combining live and archived logs
CREATE VIEW v_scenario_logs_all AS
SELECT
id, scenario_id, received_at, message_hash, message_preview,
source, size_bytes, has_pii, token_count, sqs_blocks,
NULL::timestamptz as archived_at,
false as is_archived
FROM scenario_logs
UNION ALL
SELECT
id, scenario_id, received_at, message_hash, message_preview,
source, size_bytes, has_pii, token_count, sqs_blocks,
archived_at,
true as is_archived
FROM scenario_logs_archive;
-- View combining live and archived metrics
CREATE VIEW v_scenario_metrics_all AS
SELECT
id, scenario_id, timestamp, metric_type, metric_name,
value, unit, extra_data,
NULL::timestamptz as archived_at,
false as is_aggregated,
false as is_archived
FROM scenario_metrics
UNION ALL
SELECT
id, scenario_id, timestamp, metric_type, metric_name,
value, unit, extra_data,
archived_at,
is_aggregated,
true as is_archived
FROM scenario_metrics_archive;
```
### Archive Job Tracking
```sql
-- Archive jobs table
CREATE TABLE archive_jobs (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
job_type VARCHAR(50) NOT NULL,
status VARCHAR(50) NOT NULL DEFAULT 'pending',
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
records_processed INTEGER DEFAULT 0,
records_archived INTEGER DEFAULT 0,
records_deleted INTEGER DEFAULT 0,
bytes_archived BIGINT DEFAULT 0,
error_message TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Archive statistics view
CREATE VIEW v_archive_statistics AS
SELECT
'logs' as archive_type,
COUNT(*) as total_records,
MIN(received_at) as oldest_record,
MAX(received_at) as newest_record,
SUM(size_bytes) as total_bytes
FROM scenario_logs_archive
UNION ALL
SELECT
'metrics' as archive_type,
COUNT(*) as total_records,
MIN(timestamp) as oldest_record,
MAX(timestamp) as newest_record,
0 as total_bytes
FROM scenario_metrics_archive
UNION ALL
SELECT
'reports' as archive_type,
COUNT(*) as total_records,
MIN(created_at) as oldest_record,
MAX(created_at) as newest_record,
SUM(file_size_bytes) as total_bytes
FROM reports_archive;
```
---
## Archive Job
### Running the Archive Job
```bash
# Preview what would be archived (dry run)
python scripts/archive_job.py --dry-run --all
# Archive all eligible data
python scripts/archive_job.py --all
# Archive specific types only
python scripts/archive_job.py --logs
python scripts/archive_job.py --metrics
python scripts/archive_job.py --reports
# Combine options
python scripts/archive_job.py --logs --metrics --dry-run
```
### Cron Configuration
```bash
# Run archive job nightly at 3:00 AM
0 3 * * * /opt/mockupaws/.venv/bin/python /opt/mockupaws/scripts/archive_job.py --all >> /var/log/mockupaws/archive.log 2>&1
```
### Environment Variables
```bash
# Required
export DATABASE_URL="postgresql+asyncpg://user:pass@host:5432/mockupaws"
# For reports S3 archiving
export REPORTS_ARCHIVE_BUCKET="mockupaws-reports-archive"
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_DEFAULT_REGION="us-east-1"
```
---
## Querying Archived Data
### Transparent Access
Use the unified views for automatic access to both live and archived data:
```sql
-- Query all logs (live + archived)
SELECT * FROM v_scenario_logs_all
WHERE scenario_id = 'uuid-here'
ORDER BY received_at DESC
LIMIT 1000;
-- Query all metrics (live + archived)
SELECT * FROM v_scenario_metrics_all
WHERE scenario_id = 'uuid-here'
AND timestamp > NOW() - INTERVAL '2 years'
ORDER BY timestamp;
```
### Optimized Queries
```sql
-- Query only live data (faster)
SELECT * FROM scenario_logs
WHERE scenario_id = 'uuid-here'
ORDER BY received_at DESC;
-- Query only archived data
SELECT * FROM scenario_logs_archive
WHERE scenario_id = 'uuid-here'
AND received_at < NOW() - INTERVAL '1 year'
ORDER BY received_at DESC;
-- Query specific month partition (most efficient)
SELECT * FROM scenario_logs_archive
WHERE received_at >= '2025-01-01'
AND received_at < '2025-02-01'
AND scenario_id = 'uuid-here';
```
### Application Code Example
```python
from sqlalchemy import select
from src.models.scenario_log import ScenarioLog
async def get_logs(db: AsyncSession, scenario_id: UUID, include_archived: bool = False):
"""Get scenario logs with optional archive inclusion."""
if include_archived:
# Use unified view for complete history
result = await db.execute(
text("""
SELECT * FROM v_scenario_logs_all
WHERE scenario_id = :sid
ORDER BY received_at DESC
"""),
{"sid": scenario_id}
)
else:
# Query only live data (faster)
result = await db.execute(
select(ScenarioLog)
.where(ScenarioLog.scenario_id == scenario_id)
.order_by(ScenarioLog.received_at.desc())
)
return result.scalars().all()
```
---
## Monitoring
### Archive Job Status
```sql
-- Check recent archive jobs
SELECT
job_type,
status,
started_at,
completed_at,
records_archived,
records_deleted,
pg_size_pretty(bytes_archived) as space_saved
FROM archive_jobs
ORDER BY started_at DESC
LIMIT 10;
-- Check for failed jobs
SELECT * FROM archive_jobs
WHERE status = 'failed'
ORDER BY started_at DESC;
```
### Archive Statistics
```sql
-- View archive statistics
SELECT * FROM v_archive_statistics;
-- Archive growth over time
SELECT
DATE_TRUNC('month', archived_at) as archive_month,
archive_type,
COUNT(*) as records_archived,
pg_size_pretty(SUM(total_bytes)) as bytes_archived
FROM v_archive_statistics
GROUP BY DATE_TRUNC('month', archived_at), archive_type
ORDER BY archive_month DESC;
```
### Alerts
```yaml
# archive-alerts.yml
groups:
- name: archive_alerts
rules:
- alert: ArchiveJobFailed
expr: increase(archive_job_failures_total[1h]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Data archive job failed"
- alert: ArchiveJobNotRunning
expr: time() - max(archive_job_last_success_timestamp) > 90000
for: 1h
labels:
severity: warning
annotations:
summary: "Archive job has not run in 25 hours"
- alert: ArchiveStorageGrowing
expr: rate(archive_bytes_total[1d]) > 1073741824 # 1GB/day
for: 1h
labels:
severity: info
annotations:
summary: "Archive storage growing rapidly"
```
---
## Storage Estimation
### Projected Storage Savings
Assuming typical usage patterns:
| Data Type | Daily Volume | Annual Volume | After Archive | Savings |
|-----------|--------------|---------------|---------------|---------|
| Logs | 1M records/day | 365M records | 365M in archive | 0 in main |
| Metrics | 500K records/day | 182M records | 60M aggregated | 66% reduction |
| Reports | 100/day (50MB each) | 1.8TB | 1.8TB in S3 | 100% local reduction |
### Cost Analysis (Monthly)
| Storage Type | Before Archive | After Archive | Monthly Savings |
|--------------|----------------|---------------|-----------------|
| PostgreSQL (hot) | $200 | $50 | $150 |
| PostgreSQL (archive) | $0 | $30 | -$30 |
| S3 Standard | $0 | $20 | -$20 |
| S3 Glacier | $0 | $5 | -$5 |
| **Total** | **$200** | **$105** | **$95** |
*Estimates based on AWS us-east-1 pricing, actual costs may vary.*
---
## Maintenance
### Monthly Tasks
1. **Review archive statistics**
```sql
SELECT * FROM v_archive_statistics;
```
2. **Check for old archive partitions**
```sql
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
FROM pg_tables
WHERE tablename LIKE 'scenario_logs_archive_%'
ORDER BY tablename;
```
3. **Clean up old S3 files** (after retention period)
```bash
aws s3 rm s3://mockupaws-reports-archive/archived-reports/ \
--recursive \
--exclude '*' \
--include '*2023*'
```
### Quarterly Tasks
1. **Archive job performance review**
- Check execution times
- Optimize batch sizes if needed
2. **Storage cost review**
- Verify S3 lifecycle policies
- Consider Glacier transition for old archives
3. **Data retention compliance**
- Verify deletion of data past retention period
- Update policies as needed
---
## Troubleshooting
### Archive Job Fails
```bash
# Check logs
tail -f storage/logs/archive_*.log
# Run with verbose output
python scripts/archive_job.py --all --verbose
# Check database connectivity
psql $DATABASE_URL -c "SELECT COUNT(*) FROM archive_jobs;"
```
### S3 Upload Fails
```bash
# Verify AWS credentials
aws sts get-caller-identity
# Test S3 access
aws s3 ls s3://mockupaws-reports-archive/
# Check bucket policy
aws s3api get-bucket-policy --bucket mockupaws-reports-archive
```
### Query Performance Issues
```sql
-- Check if indexes exist on archive tables
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename LIKE '%_archive%';
-- Analyze archive tables
ANALYZE scenario_logs_archive;
ANALYZE scenario_metrics_archive;
-- Check partition pruning
EXPLAIN ANALYZE
SELECT * FROM scenario_logs_archive
WHERE received_at >= '2025-01-01'
AND received_at < '2025-02-01';
```
---
## References
- [PostgreSQL Table Partitioning](https://www.postgresql.org/docs/current/ddl-partitioning.html)
- [AWS S3 Lifecycle Policies](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)
- [Database Migration](alembic/versions/b2c3d4e5f6a7_create_archive_tables_v1_0_0.py)
- [Archive Job Script](../scripts/archive_job.py)
---
*Document Version: 1.0.0*
*Last Updated: 2026-04-07*