# Incident Response Runbook > **Version:** 1.0.0 > **Last Updated:** 2026-04-07 > **Owner:** DevOps Team --- ## Table of Contents 1. [Incident Severity Levels](#1-incident-severity-levels) 2. [Response Procedures](#2-response-procedures) 3. [Communication Templates](#3-communication-templates) 4. [Post-Incident Review](#4-post-incident-review) 5. [Common Incidents](#5-common-incidents) --- ## 1. Incident Severity Levels ### P1 - Critical (Service Down) **Criteria:** - Complete service unavailability - Data loss or corruption - Security breach - >50% of users affected **Response Time:** 15 minutes **Resolution Target:** 2 hours **Actions:** 1. Page on-call engineer immediately 2. Create incident channel/war room 3. Notify stakeholders within 15 minutes 4. Begin rollback if applicable 5. Post to status page ### P2 - High (Major Impact) **Criteria:** - Core functionality impaired - >25% of users affected - Workaround available - Performance severely degraded **Response Time:** 1 hour **Resolution Target:** 8 hours ### P3 - Medium (Partial Impact) **Criteria:** - Non-critical features affected - <25% of users affected - Workaround available **Response Time:** 4 hours **Resolution Target:** 24 hours ### P4 - Low (Minimal Impact) **Criteria:** - General questions - Feature requests - Minor cosmetic issues **Response Time:** 24 hours **Resolution Target:** Best effort --- ## 2. Response Procedures ### 2.1 Initial Response Checklist ```markdown □ Acknowledge incident (within SLA) □ Create incident ticket (PagerDuty/Opsgenie) □ Join/create incident Slack channel □ Identify severity level □ Begin incident log □ Notify stakeholders if P1/P2 ``` ### 2.2 Investigation Steps ```bash # 1. Check service health curl -f https://mockupaws.com/api/v1/health curl -f https://api.mockupaws.com/api/v1/health # 2. Check CloudWatch metrics aws cloudwatch get-metric-statistics \ --namespace AWS/ECS \ --metric-name CPUUtilization \ --dimensions Name=ClusterName,Value=mockupaws-production \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 300 \ --statistics Average # 3. Check ECS service status aws ecs describe-services \ --cluster mockupaws-production \ --services backend # 4. Check logs aws logs tail /ecs/mockupaws-production --follow # 5. Check database connections aws rds describe-db-clusters \ --db-cluster-identifier mockupaws-production ``` ### 2.3 Escalation Path ``` 0-15 min: On-call Engineer 15-30 min: Senior Engineer 30-60 min: Engineering Manager 60+ min: VP Engineering / CTO ``` ### 2.4 Resolution & Recovery 1. **Immediate Mitigation** - Enable circuit breakers - Scale up resources - Enable maintenance mode 2. **Root Cause Fix** - Deploy hotfix - Database recovery - Infrastructure changes 3. **Verification** - Run smoke tests - Monitor metrics - Confirm user impact resolved 4. **Closeout** - Update status page - Notify stakeholders - Schedule post-mortem --- ## 3. Communication Templates ### 3.1 Internal Notification (P1) ``` Subject: [INCIDENT] P1 - mockupAWS Service Down Incident ID: INC-YYYY-MM-DD-XXX Severity: P1 - Critical Started: YYYY-MM-DD HH:MM UTC Impact: Complete service unavailability Description: [Detailed description of the issue] Actions Taken: - [ ] Initial investigation - [ ] Rollback initiated - [ ] [Other actions] Next Update: +30 minutes Incident Commander: [Name] Slack: #incident-XXX ``` ### 3.2 Customer Notification ``` Subject: Service Disruption - mockupAWS We are currently investigating an issue affecting mockupAWS service availability. Impact: Users may be unable to access the platform Started: HH:MM UTC Status: Investigating We will provide updates every 30 minutes. Track status: https://status.mockupaws.com We apologize for any inconvenience. ``` ### 3.3 Status Page Update ```markdown **Investigating** - We are investigating reports of service unavailability. Posted HH:MM UTC **Update** - We have identified the root cause and are implementing a fix. Posted HH:MM UTC **Resolved** - Service has been fully restored. We will provide a post-mortem within 24 hours. Posted HH:MM UTC ``` ### 3.4 Post-Incident Communication ``` Subject: Post-Incident Review: INC-YYYY-MM-DD-XXX Summary: [One paragraph summary] Timeline: - HH:MM - Issue detected - HH:MM - Investigation started - HH:MM - Root cause identified - HH:MM - Fix deployed - HH:MM - Service restored Root Cause: [Detailed explanation] Impact: - Duration: X minutes - Users affected: X% - Data loss: None / X records Lessons Learned: 1. [Lesson 1] 2. [Lesson 2] Action Items: 1. [Owner] - [Action] - [Due Date] 2. [Owner] - [Action] - [Due Date] ``` --- ## 4. Post-Incident Review ### 4.1 Post-Mortem Template ```markdown # Post-Mortem: INC-YYYY-MM-DD-XXX ## Metadata - **Incident ID:** INC-YYYY-MM-DD-XXX - **Date:** YYYY-MM-DD - **Severity:** P1/P2/P3 - **Duration:** XX minutes - **Reporter:** [Name] - **Reviewers:** [Names] ## Summary [2-3 sentence summary] ## Timeline | Time (UTC) | Event | |-----------|-------| | 00:00 | Issue detected by monitoring | | 00:05 | On-call paged | | 00:15 | Investigation started | | 00:45 | Root cause identified | | 01:00 | Fix deployed | | 01:30 | Service confirmed stable | ## Root Cause Analysis ### What happened? [Detailed description] ### Why did it happen? [5 Whys analysis] ### How did we detect it? [Monitoring/alert details] ## Impact Assessment - **Users affected:** X% - **Features affected:** [List] - **Data impact:** [None/Description] - **SLA impact:** [None/X minutes downtime] ## Response Assessment ### What went well? 1. 2. ### What could have gone better? 1. 2. ### What did we learn? 1. 2. ## Action Items | ID | Action | Owner | Priority | Due Date | |----|--------|-------|----------|----------| | 1 | | | High | | | 2 | | | Medium | | | 3 | | | Low | | ## Attachments - [Logs] - [Metrics] - [Screenshots] ``` ### 4.2 Review Meeting **Attendees:** - Incident Commander - Engineers involved - Engineering Manager - Optional: Product Manager, Customer Success **Agenda (30 minutes):** 1. Timeline review (5 min) 2. Root cause discussion (10 min) 3. Response assessment (5 min) 4. Action item assignment (5 min) 5. Lessons learned (5 min) --- ## 5. Common Incidents ### 5.1 Database Connection Pool Exhaustion **Symptoms:** - API timeouts - "too many connections" errors - Latency spikes **Diagnosis:** ```bash # Check active connections aws rds describe-db-clusters \ --query 'DBClusters[0].DBClusterMembers[*].DBInstanceIdentifier' # Check CloudWatch metrics aws cloudwatch get-metric-statistics \ --namespace AWS/RDS \ --metric-name DatabaseConnections ``` **Resolution:** 1. Scale ECS tasks down temporarily 2. Kill idle connections 3. Increase max_connections 4. Implement connection pooling ### 5.2 High Memory Usage **Symptoms:** - OOM kills - Container restarts - Performance degradation **Diagnosis:** ```bash # Check container metrics aws cloudwatch get-metric-statistics \ --namespace AWS/ECS \ --metric-name MemoryUtilization ``` **Resolution:** 1. Identify memory leak (heap dump) 2. Restart affected tasks 3. Increase memory limits 4. Deploy fix ### 5.3 Redis Connection Issues **Symptoms:** - Cache misses increasing - API latency spikes - Connection errors **Resolution:** 1. Check ElastiCache status 2. Verify security group rules 3. Restart Redis if needed 4. Implement circuit breaker ### 5.4 SSL Certificate Expiry **Symptoms:** - HTTPS errors - Certificate warnings **Prevention:** - Set alert 30 days before expiry - Use ACM with auto-renewal **Resolution:** 1. Renew certificate 2. Update ALB/CloudFront 3. Verify SSL Labs rating --- ## Quick Reference | Resource | URL/Command | |----------|-------------| | Status Page | https://status.mockupaws.com | | PagerDuty | https://mockupaws.pagerduty.com | | CloudWatch | AWS Console > CloudWatch | | ECS Console | AWS Console > ECS | | RDS Console | AWS Console > RDS | | Logs | `aws logs tail /ecs/mockupaws-production --follow` | | Emergency Hotline | +1-555-MOCKUP | --- *This runbook should be reviewed quarterly and updated after each significant incident.*