mockupAWS/docs/DATA-ARCHIVING.md

# Data Archiving Strategy

## mockupAWS v1.0.0 - Data Lifecycle Management

---

## Table of Contents

1. [Overview](#overview)
2. [Archive Policies](#archive-policies)
3. [Implementation](#implementation)
4. [Archive Job](#archive-job)
5. [Querying Archived Data](#querying-archived-data)
6. [Monitoring](#monitoring)
7. [Storage Estimation](#storage-estimation)

---

## Overview

As mockupAWS accumulates data over time, we implement an automated archiving strategy to:

- **Reduce storage costs** by moving old data to archive tables
- **Improve query performance** on active data
- **Maintain data accessibility** through unified views
- **Comply with data retention policies**

### Archive Strategy Overview

```
┌─────────────────────────────────────────────────────────────────┐
│                     Data Lifecycle                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Active Data (Hot)    │    Archive Data (Cold)                  │
│  ─────────────────    │    ──────────────────                   │
│  • Fast queries       │    • Partitioned by month               │
│  • Full indexing      │    • Compressed                         │
│  • Real-time writes   │    • S3 for large files                 │
│                                                                 │
│  scenario_logs        │    → scenario_logs_archive              │
│  (> 1 year old)       │    (> 1 year, partitioned)              │
│                                                                 │
│  scenario_metrics     │    → scenario_metrics_archive           │
│  (> 2 years old)      │    (> 2 years, aggregated)              │
│                                                                 │
│  reports              │    → reports_archive                    │
│  (> 6 months old)     │    (> 6 months, S3 storage)             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

---

## Archive Policies

### Policy Configuration

| Table | Archive After | Aggregation | Compression | S3 Storage |
|-------|--------------|-------------|-------------|------------|
| `scenario_logs` | 365 days | No | No | No |
| `scenario_metrics` | 730 days | Daily | No | No |
| `reports` | 180 days | No | Yes | Yes |

### Detailed Policies

#### 1. Scenario Logs Archive (> 1 year)

**Criteria:**
- Records older than 365 days
- Move to `scenario_logs_archive` table
- Partitioned by month for efficient querying

**Retention:**
- Archive table: 7 years
- After 7 years: Delete or move to long-term storage

#### 2. Scenario Metrics Archive (> 2 years)

**Criteria:**
- Records older than 730 days
- Aggregate to daily values before archiving
- Store aggregated data in `scenario_metrics_archive`

**Aggregation:**
- Group by: scenario_id, metric_type, metric_name, day
- Aggregate: AVG(value), COUNT(samples)

**Retention:**
- Archive table: 5 years
- Aggregated data only (original samples deleted)

#### 3. Reports Archive (> 6 months)

**Criteria:**
- Reports older than 180 days
- Compress PDF/CSV files
- Upload to S3
- Keep metadata in `reports_archive` table

**Retention:**
- S3 storage: 3 years with lifecycle to Glacier
- Metadata: 5 years

---

## Implementation

### Database Schema

#### Archive Tables

```sql
-- Scenario logs archive (partitioned by month)
CREATE TABLE scenario_logs_archive (
    id UUID PRIMARY KEY,
    scenario_id UUID NOT NULL,
    received_at TIMESTAMPTZ NOT NULL,
    message_hash VARCHAR(64) NOT NULL,
    message_preview VARCHAR(500),
    source VARCHAR(100) NOT NULL,
    size_bytes INTEGER NOT NULL,
    has_pii BOOLEAN NOT NULL,
    token_count INTEGER NOT NULL,
    sqs_blocks INTEGER NOT NULL,
    archived_at TIMESTAMPTZ DEFAULT NOW(),
    archive_batch_id UUID
) PARTITION BY RANGE (DATE_TRUNC('month', received_at));

-- Scenario metrics archive (with aggregation support)
CREATE TABLE scenario_metrics_archive (
    id UUID PRIMARY KEY,
    scenario_id UUID NOT NULL,
    timestamp TIMESTAMPTZ NOT NULL,
    metric_type VARCHAR(50) NOT NULL,
    metric_name VARCHAR(100) NOT NULL,
    value DECIMAL(15,6) NOT NULL,
    unit VARCHAR(20) NOT NULL,
    extra_data JSONB DEFAULT '{}',
    archived_at TIMESTAMPTZ DEFAULT NOW(),
    archive_batch_id UUID,
    is_aggregated BOOLEAN DEFAULT FALSE,
    aggregation_period VARCHAR(20),
    sample_count INTEGER
) PARTITION BY RANGE (DATE_TRUNC('month', timestamp));

-- Reports archive (S3 references)
CREATE TABLE reports_archive (
    id UUID PRIMARY KEY,
    scenario_id UUID NOT NULL,
    format VARCHAR(10) NOT NULL,
    file_path VARCHAR(500) NOT NULL,
    file_size_bytes INTEGER,
    generated_by VARCHAR(100),
    extra_data JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL,
    archived_at TIMESTAMPTZ DEFAULT NOW(),
    s3_location VARCHAR(500),
    deleted_locally BOOLEAN DEFAULT FALSE,
    archive_batch_id UUID
);
```

#### Unified Views (Query Transparency)

```sql
-- View combining live and archived logs
CREATE VIEW v_scenario_logs_all AS
SELECT
    id, scenario_id, received_at, message_hash, message_preview,
    source, size_bytes, has_pii, token_count, sqs_blocks,
    NULL::timestamptz as archived_at,
    false as is_archived
FROM scenario_logs
UNION ALL
SELECT
    id, scenario_id, received_at, message_hash, message_preview,
    source, size_bytes, has_pii, token_count, sqs_blocks,
    archived_at,
    true as is_archived
FROM scenario_logs_archive;

-- View combining live and archived metrics
CREATE VIEW v_scenario_metrics_all AS
SELECT
    id, scenario_id, timestamp, metric_type, metric_name,
    value, unit, extra_data,
    NULL::timestamptz as archived_at,
    false as is_aggregated,
    false as is_archived
FROM scenario_metrics
UNION ALL
SELECT
    id, scenario_id, timestamp, metric_type, metric_name,
    value, unit, extra_data,
    archived_at,
    is_aggregated,
    true as is_archived
FROM scenario_metrics_archive;
```

### Archive Job Tracking

```sql
-- Archive jobs table
CREATE TABLE archive_jobs (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    job_type VARCHAR(50) NOT NULL,
    status VARCHAR(50) NOT NULL DEFAULT 'pending',
    started_at TIMESTAMPTZ,
    completed_at TIMESTAMPTZ,
    records_processed INTEGER DEFAULT 0,
    records_archived INTEGER DEFAULT 0,
    records_deleted INTEGER DEFAULT 0,
    bytes_archived BIGINT DEFAULT 0,
    error_message TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Archive statistics view
CREATE VIEW v_archive_statistics AS
SELECT
    'logs' as archive_type,
    COUNT(*) as total_records,
    MIN(received_at) as oldest_record,
    MAX(received_at) as newest_record,
    SUM(size_bytes) as total_bytes
FROM scenario_logs_archive
UNION ALL
SELECT
    'metrics' as archive_type,
    COUNT(*) as total_records,
    MIN(timestamp) as oldest_record,
    MAX(timestamp) as newest_record,
    0 as total_bytes
FROM scenario_metrics_archive
UNION ALL
SELECT
    'reports' as archive_type,
    COUNT(*) as total_records,
    MIN(created_at) as oldest_record,
    MAX(created_at) as newest_record,
    SUM(file_size_bytes) as total_bytes
FROM reports_archive;
```

---

## Archive Job

### Running the Archive Job

```bash
# Preview what would be archived (dry run)
python scripts/archive_job.py --dry-run --all

# Archive all eligible data
python scripts/archive_job.py --all

# Archive specific types only
python scripts/archive_job.py --logs
python scripts/archive_job.py --metrics
python scripts/archive_job.py --reports

# Combine options
python scripts/archive_job.py --logs --metrics --dry-run
```

### Cron Configuration

```bash
# Run archive job nightly at 3:00 AM
0 3 * * * /opt/mockupaws/.venv/bin/python /opt/mockupaws/scripts/archive_job.py --all >> /var/log/mockupaws/archive.log 2>&1
```

### Environment Variables

```bash
# Required
export DATABASE_URL="postgresql+asyncpg://user:pass@host:5432/mockupaws"

# For reports S3 archiving
export REPORTS_ARCHIVE_BUCKET="mockupaws-reports-archive"
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_DEFAULT_REGION="us-east-1"
```

---

## Querying Archived Data

### Transparent Access

Use the unified views for automatic access to both live and archived data:

```sql
-- Query all logs (live + archived)
SELECT * FROM v_scenario_logs_all
WHERE scenario_id = 'uuid-here'
ORDER BY received_at DESC
LIMIT 1000;

-- Query all metrics (live + archived)
SELECT * FROM v_scenario_metrics_all
WHERE scenario_id = 'uuid-here'
  AND timestamp > NOW() - INTERVAL '2 years'
ORDER BY timestamp;
```

### Optimized Queries

```sql
-- Query only live data (faster)
SELECT * FROM scenario_logs
WHERE scenario_id = 'uuid-here'
ORDER BY received_at DESC;

-- Query only archived data
SELECT * FROM scenario_logs_archive
WHERE scenario_id = 'uuid-here'
  AND received_at < NOW() - INTERVAL '1 year'
ORDER BY received_at DESC;

-- Query specific month partition (most efficient)
SELECT * FROM scenario_logs_archive
WHERE received_at >= '2025-01-01'
  AND received_at < '2025-02-01'
  AND scenario_id = 'uuid-here';
```

### Application Code Example

```python
from sqlalchemy import select
from src.models.scenario_log import ScenarioLog

async def get_logs(db: AsyncSession, scenario_id: UUID, include_archived: bool = False):
    """Get scenario logs with optional archive inclusion."""

    if include_archived:
        # Use unified view for complete history
        result = await db.execute(
            text("""
                SELECT * FROM v_scenario_logs_all
                WHERE scenario_id = :sid
                ORDER BY received_at DESC
            """),
            {"sid": scenario_id}
        )
    else:
        # Query only live data (faster)
        result = await db.execute(
            select(ScenarioLog)
            .where(ScenarioLog.scenario_id == scenario_id)
            .order_by(ScenarioLog.received_at.desc())
        )

    return result.scalars().all()
```

---

## Monitoring

### Archive Job Status

```sql
-- Check recent archive jobs
SELECT
    job_type,
    status,
    started_at,
    completed_at,
    records_archived,
    records_deleted,
    pg_size_pretty(bytes_archived) as space_saved
FROM archive_jobs
ORDER BY started_at DESC
LIMIT 10;

-- Check for failed jobs
SELECT * FROM archive_jobs
WHERE status = 'failed'
ORDER BY started_at DESC;
```

### Archive Statistics

```sql
-- View archive statistics
SELECT * FROM v_archive_statistics;

-- Archive growth over time
SELECT
    DATE_TRUNC('month', archived_at) as archive_month,
    archive_type,
    COUNT(*) as records_archived,
    pg_size_pretty(SUM(total_bytes)) as bytes_archived
FROM v_archive_statistics
GROUP BY DATE_TRUNC('month', archived_at), archive_type
ORDER BY archive_month DESC;
```

### Alerts

```yaml
# archive-alerts.yml
groups:
  - name: archive_alerts
    rules:
      - alert: ArchiveJobFailed
        expr: increase(archive_job_failures_total[1h]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Data archive job failed"

      - alert: ArchiveJobNotRunning
        expr: time() - max(archive_job_last_success_timestamp) > 90000
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Archive job has not run in 25 hours"

      - alert: ArchiveStorageGrowing
        expr: rate(archive_bytes_total[1d]) > 1073741824  # 1GB/day
        for: 1h
        labels:
          severity: info
        annotations:
          summary: "Archive storage growing rapidly"
```

---

## Storage Estimation

### Projected Storage Savings

Assuming typical usage patterns:

| Data Type | Daily Volume | Annual Volume | After Archive | Savings |
|-----------|--------------|---------------|---------------|---------|
| Logs | 1M records/day | 365M records | 365M in archive | 0 in main |
| Metrics | 500K records/day | 182M records | 60M aggregated | 66% reduction |
| Reports | 100/day (50MB each) | 1.8TB | 1.8TB in S3 | 100% local reduction |

### Cost Analysis (Monthly)

| Storage Type | Before Archive | After Archive | Monthly Savings |
|--------------|----------------|---------------|-----------------|
| PostgreSQL (hot) | $200 | $50 | $150 |
| PostgreSQL (archive) | $0 | $30 | -$30 |
| S3 Standard | $0 | $20 | -$20 |
| S3 Glacier | $0 | $5 | -$5 |
| **Total** | **$200** | **$105** | **$95** |

*Estimates based on AWS us-east-1 pricing, actual costs may vary.*

---

## Maintenance

### Monthly Tasks

1. **Review archive statistics**
   ```sql
   SELECT * FROM v_archive_statistics;
   ```

2. **Check for old archive partitions**
   ```sql
   SELECT
       schemaname,
       tablename,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
   FROM pg_tables
   WHERE tablename LIKE 'scenario_logs_archive_%'
   ORDER BY tablename;
   ```

3. **Clean up old S3 files** (after retention period)
   ```bash
   aws s3 rm s3://mockupaws-reports-archive/archived-reports/ \
     --recursive \
     --exclude '*' \
     --include '*2023*'
   ```

### Quarterly Tasks

1. **Archive job performance review**
   - Check execution times
   - Optimize batch sizes if needed

2. **Storage cost review**
   - Verify S3 lifecycle policies
   - Consider Glacier transition for old archives

3. **Data retention compliance**
   - Verify deletion of data past retention period
   - Update policies as needed

---

## Troubleshooting

### Archive Job Fails

```bash
# Check logs
tail -f storage/logs/archive_*.log

# Run with verbose output
python scripts/archive_job.py --all --verbose

# Check database connectivity
psql $DATABASE_URL -c "SELECT COUNT(*) FROM archive_jobs;"
```

### S3 Upload Fails

```bash
# Verify AWS credentials
aws sts get-caller-identity

# Test S3 access
aws s3 ls s3://mockupaws-reports-archive/

# Check bucket policy
aws s3api get-bucket-policy --bucket mockupaws-reports-archive
```

### Query Performance Issues

```sql
-- Check if indexes exist on archive tables
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename LIKE '%_archive%';

-- Analyze archive tables
ANALYZE scenario_logs_archive;
ANALYZE scenario_metrics_archive;

-- Check partition pruning
EXPLAIN ANALYZE
SELECT * FROM scenario_logs_archive
WHERE received_at >= '2025-01-01'
  AND received_at < '2025-02-01';
```

---

## References

- [PostgreSQL Table Partitioning](https://www.postgresql.org/docs/current/ddl-partitioning.html)
- [AWS S3 Lifecycle Policies](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)
- [Database Migration](alembic/versions/b2c3d4e5f6a7_create_archive_tables_v1_0_0.py)
- [Archive Job Script](../scripts/archive_job.py)

---

*Document Version: 1.0.0*
*Last Updated: 2026-04-07*