mockupAWS/docs/runbooks/incident-response.md

# Incident Response Runbook

> **Version:** 1.0.0
> **Last Updated:** 2026-04-07
> **Owner:** DevOps Team

---

## Table of Contents

1. [Incident Severity Levels](#1-incident-severity-levels)
2. [Response Procedures](#2-response-procedures)
3. [Communication Templates](#3-communication-templates)
4. [Post-Incident Review](#4-post-incident-review)
5. [Common Incidents](#5-common-incidents)

---

## 1. Incident Severity Levels

### P1 - Critical (Service Down)

**Criteria:**
- Complete service unavailability
- Data loss or corruption
- Security breach
- >50% of users affected

**Response Time:** 15 minutes
**Resolution Target:** 2 hours

**Actions:**
1. Page on-call engineer immediately
2. Create incident channel/war room
3. Notify stakeholders within 15 minutes
4. Begin rollback if applicable
5. Post to status page

### P2 - High (Major Impact)

**Criteria:**
- Core functionality impaired
- >25% of users affected
- Workaround available
- Performance severely degraded

**Response Time:** 1 hour
**Resolution Target:** 8 hours

### P3 - Medium (Partial Impact)

**Criteria:**
- Non-critical features affected
- <25% of users affected
- Workaround available

**Response Time:** 4 hours
**Resolution Target:** 24 hours

### P4 - Low (Minimal Impact)

**Criteria:**
- General questions
- Feature requests
- Minor cosmetic issues

**Response Time:** 24 hours
**Resolution Target:** Best effort

---

## 2. Response Procedures

### 2.1 Initial Response Checklist

```markdown
□ Acknowledge incident (within SLA)
□ Create incident ticket (PagerDuty/Opsgenie)
□ Join/create incident Slack channel
□ Identify severity level
□ Begin incident log
□ Notify stakeholders if P1/P2
```

### 2.2 Investigation Steps

```bash
# 1. Check service health
curl -f https://mockupaws.com/api/v1/health
curl -f https://api.mockupaws.com/api/v1/health

# 2. Check CloudWatch metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ClusterName,Value=mockupaws-production \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 300 \
  --statistics Average

# 3. Check ECS service status
aws ecs describe-services \
  --cluster mockupaws-production \
  --services backend

# 4. Check logs
aws logs tail /ecs/mockupaws-production --follow

# 5. Check database connections
aws rds describe-db-clusters \
  --db-cluster-identifier mockupaws-production
```

### 2.3 Escalation Path

```
0-15 min:  On-call Engineer
15-30 min: Senior Engineer
30-60 min: Engineering Manager
60+ min:   VP Engineering / CTO
```

### 2.4 Resolution & Recovery

1. **Immediate Mitigation**
   - Enable circuit breakers
   - Scale up resources
   - Enable maintenance mode

2. **Root Cause Fix**
   - Deploy hotfix
   - Database recovery
   - Infrastructure changes

3. **Verification**
   - Run smoke tests
   - Monitor metrics
   - Confirm user impact resolved

4. **Closeout**
   - Update status page
   - Notify stakeholders
   - Schedule post-mortem

---

## 3. Communication Templates

### 3.1 Internal Notification (P1)

```
Subject: [INCIDENT] P1 - mockupAWS Service Down

Incident ID: INC-YYYY-MM-DD-XXX
Severity: P1 - Critical
Started: YYYY-MM-DD HH:MM UTC
Impact: Complete service unavailability

Description:
[Detailed description of the issue]

Actions Taken:
- [ ] Initial investigation
- [ ] Rollback initiated
- [ ] [Other actions]

Next Update: +30 minutes
Incident Commander: [Name]
Slack: #incident-XXX
```

### 3.2 Customer Notification

```
Subject: Service Disruption - mockupAWS

We are currently investigating an issue affecting mockupAWS service availability.

Impact: Users may be unable to access the platform
Started: HH:MM UTC
Status: Investigating

We will provide updates every 30 minutes.

Track status: https://status.mockupaws.com

We apologize for any inconvenience.
```

### 3.3 Status Page Update

```markdown
**Investigating** - We are investigating reports of service unavailability.
Posted HH:MM UTC

**Update** - We have identified the root cause and are implementing a fix.
Posted HH:MM UTC

**Resolved** - Service has been fully restored. We will provide a post-mortem within 24 hours.
Posted HH:MM UTC
```

### 3.4 Post-Incident Communication

```
Subject: Post-Incident Review: INC-YYYY-MM-DD-XXX

Summary:
[One paragraph summary]

Timeline:
- HH:MM - Issue detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Fix deployed
- HH:MM - Service restored

Root Cause:
[Detailed explanation]

Impact:
- Duration: X minutes
- Users affected: X%
- Data loss: None / X records

Lessons Learned:
1. [Lesson 1]
2. [Lesson 2]

Action Items:
1. [Owner] - [Action] - [Due Date]
2. [Owner] - [Action] - [Due Date]
```

---

## 4. Post-Incident Review

### 4.1 Post-Mortem Template

```markdown
# Post-Mortem: INC-YYYY-MM-DD-XXX

## Metadata
- **Incident ID:** INC-YYYY-MM-DD-XXX
- **Date:** YYYY-MM-DD
- **Severity:** P1/P2/P3
- **Duration:** XX minutes
- **Reporter:** [Name]
- **Reviewers:** [Names]

## Summary
[2-3 sentence summary]

## Timeline
| Time (UTC) | Event |
|-----------|-------|
| 00:00 | Issue detected by monitoring |
| 00:05 | On-call paged |
| 00:15 | Investigation started |
| 00:45 | Root cause identified |
| 01:00 | Fix deployed |
| 01:30 | Service confirmed stable |

## Root Cause Analysis
### What happened?
[Detailed description]

### Why did it happen?
[5 Whys analysis]

### How did we detect it?
[Monitoring/alert details]

## Impact Assessment
- **Users affected:** X%
- **Features affected:** [List]
- **Data impact:** [None/Description]
- **SLA impact:** [None/X minutes downtime]

## Response Assessment
### What went well?
1.
2.

### What could have gone better?
1.
2.

### What did we learn?
1.
2.

## Action Items
| ID | Action | Owner | Priority | Due Date |
|----|--------|-------|----------|----------|
| 1 | | | High | |
| 2 | | | Medium | |
| 3 | | | Low | |

## Attachments
- [Logs]
- [Metrics]
- [Screenshots]
```

### 4.2 Review Meeting

**Attendees:**
- Incident Commander
- Engineers involved
- Engineering Manager
- Optional: Product Manager, Customer Success

**Agenda (30 minutes):**
1. Timeline review (5 min)
2. Root cause discussion (10 min)
3. Response assessment (5 min)
4. Action item assignment (5 min)
5. Lessons learned (5 min)

---

## 5. Common Incidents

### 5.1 Database Connection Pool Exhaustion

**Symptoms:**
- API timeouts
- "too many connections" errors
- Latency spikes

**Diagnosis:**
```bash
# Check active connections
aws rds describe-db-clusters \
  --query 'DBClusters[0].DBClusterMembers[*].DBInstanceIdentifier'

# Check CloudWatch metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections
```

**Resolution:**
1. Scale ECS tasks down temporarily
2. Kill idle connections
3. Increase max_connections
4. Implement connection pooling

### 5.2 High Memory Usage

**Symptoms:**
- OOM kills
- Container restarts
- Performance degradation

**Diagnosis:**
```bash
# Check container metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name MemoryUtilization
```

**Resolution:**
1. Identify memory leak (heap dump)
2. Restart affected tasks
3. Increase memory limits
4. Deploy fix

### 5.3 Redis Connection Issues

**Symptoms:**
- Cache misses increasing
- API latency spikes
- Connection errors

**Resolution:**
1. Check ElastiCache status
2. Verify security group rules
3. Restart Redis if needed
4. Implement circuit breaker

### 5.4 SSL Certificate Expiry

**Symptoms:**
- HTTPS errors
- Certificate warnings

**Prevention:**
- Set alert 30 days before expiry
- Use ACM with auto-renewal

**Resolution:**
1. Renew certificate
2. Update ALB/CloudFront
3. Verify SSL Labs rating

---

## Quick Reference

| Resource | URL/Command |
|----------|-------------|
| Status Page | https://status.mockupaws.com |
| PagerDuty | https://mockupaws.pagerduty.com |
| CloudWatch | AWS Console > CloudWatch |
| ECS Console | AWS Console > ECS |
| RDS Console | AWS Console > RDS |
| Logs | `aws logs tail /ecs/mockupaws-production --follow` |
| Emergency Hotline | +1-555-MOCKUP |

---

*This runbook should be reviewed quarterly and updated after each significant incident.*