LogWhispererAI/docs/specs/bash_ingestion_secure.md

# Technical Specification - Secure Bash Log Ingestion (Sprint 2)

**Status:** 🟡 In Review
**Sprint:** 2
**Priority:** 🔴 Critical - Security Fix
**Author:** @tech-lead
**Date:** 2026-04-02
**Security Review:** Required before implementation

---

## 1. Overview

Riscrittura dello script di log ingestion con focus sulla sicurezza, risolvendo le vulnerabilità HIGH identificate nella Sprint 1 Review. Lo script deve essere resistente a Command Injection, JSON Injection, e Path Traversal.

### 1.1 Vulnerabilità Addressate (da Sprint 1 Review)

| Vulnerabilità | Severità | Stato Sprint 1 | Mitigazione Sprint 2 |
|---------------|----------|----------------|---------------------|
| JSON Injection via Log Content | 🔴 HIGH | Incomplete escaping | jq-based JSON encoding |
| Path Traversal via LOG_SOURCES | 🔴 HIGH | Weak validation | Whitelist /var/log only |
| Command Injection | 🔴 HIGH | Implicit risk | Array-based commands, no eval |
| Race Condition offset files | 🟡 MEDIUM | No atomicity | Atomic write (tmp + mv) |
| Information Disclosure | 🟡 MEDIUM | Full values logged | Masked sensitive data |
| No Webhook Authentication | 🔴 HIGH | None | HMAC-SHA256 signature |

---

## 2. Architecture

### 2.1 Modular Structure

```
secure_logwhisperer.sh
│
├── Configuration & Validation
│   ├── load_config()              # Load with validation
│   ├── validate_environment()     # Check jq, curl, permissions
│   └── validate_log_source()      # Whitelist /var/log paths
│
├── Input Sanitization
│   ├── sanitize_path()            # Path traversal prevention
│   ├── sanitize_log_line()        # DLP + control char removal
│   └── validate_line_length()     # MAX_LINE_LENGTH enforcement
│
├── Security Functions
│   ├── encode_json_payload()      # jq-based safe JSON encoding
│   ├── generate_hmac_signature()  # HMAC-SHA256 for webhook auth
│   └── sanitize_for_display()     # Mask sensitive data in logs
│
├── Core Logic
│   ├── tail_log_safe()            # Read logs without injection
│   ├── atomic_write_offset()      # Atomic file operations
│   └── dispatch_webhook_secure()  # Authenticated HTTP POST
│
└── Main Loop
    └── monitor_loop()             # Safe monitoring with error handling
```

### 2.2 Data Flow (Secure)

```
┌─────────────────┐
│   Log Source    │ /var/log/* only
│  (read-only)    │
└────────┬────────┘
         │
         ▼
┌──────────────────────────────────────┐
│     validate_log_source()            │
│  - Check path starts with /var/log   │
│  - Verify file is readable           │
│  - Reject symlinks outside /var/log  │
└────────┬─────────────────────────────┘
         │
         ▼
┌──────────────────────────────────────┐
│      sanitize_log_line()             │
│  - Remove control characters         │
│  - DLP: mask PII/secrets             │
│  - Truncate to MAX_LINE_LENGTH       │
└────────┬─────────────────────────────┘
         │
         ▼
┌──────────────────────────────────────┐
│     encode_json_payload()            │
│  - Use jq for safe JSON encoding     │
│  - No manual string escaping         │
└────────┬─────────────────────────────┘
         │
         ▼
┌──────────────────────────────────────┐
│   generate_hmac_signature()          │
│  - HMAC-SHA256(payload + timestamp)  │
│  - Prevent replay attacks            │
└────────┬─────────────────────────────┘
         │
         ▼
┌──────────────────────────────────────┐
│    dispatch_webhook_secure()         │
│  - HTTPS only                        │
│  - X-LogWhisperer-Signature header   │
│  - Timeout and retry with backoff    │
└──────────────────────────────────────┘
```

---

## 3. Security Requirements

### 3.1 Input Validation

#### Path Validation (ANTI-PATH TRAVERSAL)
```bash
validate_log_source() {
    local path="$1"

    # MUST start with /var/log/
    if [[ ! "$path" =~ ^/var/log/ ]]; then
        log_error "Invalid log source path: $path (must be under /var/log/)"
        return 1
    fi

    # MUST be a regular file or fifo (no symlinks outside /var/log)
    if [[ -L "$path" ]]; then
        local realpath
        realpath=$(readlink -f "$path")
        if [[ ! "$realpath" =~ ^/var/log/ ]]; then
            log_error "Symlink target outside /var/log: $realpath"
            return 1
        fi
    fi

    # MUST be readable
    if [[ ! -r "$path" ]]; then
        log_error "Log source not readable: $path"
        return 1
    fi

    return 0
}
```

#### Log Line Sanitization (DLP + ANTI-INJECTION)
```bash
sanitize_log_line() {
    local line="$1"

    # Remove control characters (keep only printable ASCII + newline)
    line=$(printf '%s' "$line" | tr -d '\x00-\x08\x0b-\x0c\x0e-\x1f\x7f')

    # Truncate to MAX_LINE_LENGTH
    if [[ ${#line} -gt $MAX_LINE_LENGTH ]]; then
        line="${line:0:$MAX_LINE_LENGTH}...[truncated]"
    fi

    # DLP: Mask sensitive patterns
    # Passwords
    line=$(printf '%s' "$line" | sed -E 's/(password|passwd|pwd)=[^[:space:]]+/\1=***/gi')
    # Email addresses
    line=$(printf '%s' "$line" | sed -E 's/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/[EMAIL]/g')
    # API Keys and Tokens (16+ alphanumeric chars)
    line=$(printf '%s' "$line" | sed -E 's/(api[_-]?key|token|secret)=[a-zA-Z0-9]{16,}/\1=***/gi')
    # IP addresses
    line=$(printf '%s' "$line" | sed -E 's/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/[IP]/g')

    printf '%s' "$line"
}
```

### 3.2 Safe JSON Encoding

#### ANTI-JSON INJECTION: Use jq
```bash
encode_json_payload() {
    local client_id="$1"
    local hostname="$2"
    local source="$3"
    local severity="$4"
    local raw_log="$5"
    local pattern="$6"
    local timestamp
    timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

    # Use jq for safe JSON encoding - no manual escaping
    jq -n \
        --arg client_id "$client_id" \
        --arg hostname "$hostname" \
        --arg source "$source" \
        --arg severity "$severity" \
        --arg timestamp "$timestamp" \
        --arg raw_log "$raw_log" \
        --arg pattern "$pattern" \
        '{
            client_id: $client_id,
            hostname: $hostname,
            source: $source,
            severity: $severity,
            timestamp: $timestamp,
            raw_log: $raw_log,
            matched_pattern: $pattern
        }'
}
```

**Requirement:** `jq` must be installed. Script exits with error if missing.

### 3.3 Webhook Authentication

#### HMAC-SHA256 Signature
```bash
generate_hmac_signature() {
    local payload="$1"
    local timestamp
    timestamp=$(date +%s)

    # Generate signature: HMAC-SHA256(payload + timestamp)
    local signature
    signature=$(printf '%s:%s' "$timestamp" "$payload" | \
        openssl dgst -sha256 -hmac "$CLIENT_SECRET" | \
        sed 's/^.* //')

    printf '%s:%s' "$timestamp" "$signature"
}

dispatch_webhook_secure() {
    local payload="$1"
    local sig_data
    sig_data=$(generate_hmac_signature "$payload")
    local timestamp=${sig_data%%:*}
    local signature=${sig_data#*:}

    # Enforce HTTPS
    if [[ ! "$WEBHOOK_URL" =~ ^https:// ]]; then
        log_error "Webhook URL must use HTTPS"
        return 1
    fi

    # Send with signature header
    curl -s -X POST "$WEBHOOK_URL" \
        -H "Content-Type: application/json" \
        -H "X-LogWhisperer-Signature: $signature" \
        -H "X-LogWhisperer-Timestamp: $timestamp" \
        -d "$payload" \
        --max-time 30 \
        --retry 3 \
        --retry-delay 1
}
```

**New Configuration:** `CLIENT_SECRET` (shared secret for HMAC)

### 3.4 Atomic File Operations

#### ANTI-RACE CONDITION
```bash
atomic_write_offset() {
    local offset_file="$1"
    local offset_value="$2"
    local tmp_file="${offset_file}.tmp.$$"

    # Write to temp file with PID suffix
    printf '%s' "$offset_value" > "$tmp_file"

    # Atomic move
    mv "$tmp_file" "$offset_file"
}
```

### 3.5 Safe Command Execution

#### ANTI-COMMAND INJECTION
```bash
# WRONG: vulnerable to injection
tail -n 0 -F "$log_source" 2>/dev/null | while read -r line; do ... done

# CORRECT: array-based, no interpretation
local tail_cmd=("tail" "-n" "0" "-F" "$log_source")
"${tail_cmd[@]}" 2>/dev/null | while IFS= read -r line; do ... done
```

**Rules:**
- No `eval` anywhere
- No backtick command substitution on user input
- Use `printf %q` if variable must be in command
- Use arrays for complex commands

---

## 4. Configuration

### 4.1 New Config Parameters

```bash
# config.env
WEBHOOK_URL="https://your-n8n-instance.com/webhook/logwhisperer"
CLIENT_ID="unique-client-uuid"
CLIENT_SECRET="shared-secret-for-hmac"  # NEW
LOG_SOURCES="/var/log/syslog,/var/log/nginx/error.log"
POLL_INTERVAL=5
MAX_LINE_LENGTH=2000
OFFSET_DIR="/var/lib/logwhisperer"
```

### 4.2 Validation Requirements

| Parameter | Validation | Failure Action |
|-----------|------------|----------------|
| `WEBHOOK_URL` | MUST be HTTPS | Exit with error |
| `CLIENT_ID` | Valid UUID format | Exit with error |
| `CLIENT_SECRET` | Min 32 chars, no spaces | Exit with error |
| `LOG_SOURCES` | All paths MUST be under /var/log | Skip invalid paths, log warning |
| `MAX_LINE_LENGTH` | Integer between 500-10000 | Use default 2000 |

---

## 5. Dependencies

### 5.1 Required

| Tool | Purpose | Check in Script |
|------|---------|-----------------|
| `jq` | Safe JSON encoding | Exit if missing |
| `curl` | HTTP POST | Exit if missing |
| `openssl` | HMAC-SHA256 | Exit if missing |
| `date` | Timestamp generation | Exit if missing |

### 5.2 Optional

| Tool | Purpose | Fallback |
|------|---------|----------|
| `systemctl` | Service management | Skip systemd setup |

---

## 6. Error Handling

### 6.1 Error Levels

| Level | Description | Action |
|-------|-------------|--------|
| `FATAL` | Config invalid, security violation | Exit immediately |
| `ERROR` | Single log source unreadable | Skip source, continue |
| `WARN` | Retryable error (network) | Retry with backoff |
| `INFO` | Normal operation | Log and continue |

### 6.2 Graceful Degradation

```bash
# If one log source fails, continue with others
for source in "${LOG_SOURCES_ARRAY[@]}"; do
    if ! validate_log_source "$source"; then
        log_error "Skipping invalid source: $source"
        continue
    fi
    monitor_source "$source" &
done
```

---

## 7. Testing Strategy

### 7.1 Security Test Cases (RED Phase)

| Test ID | Description | Expected Behavior |
|---------|-------------|-------------------|
| `SEC-001` | Path `/etc/passwd` in LOG_SOURCES | Rejected, logged as error |
| `SEC-002` | Path `../../../etc/shadow` | Rejected, logged as error |
| `SEC-003` | Symlink to `/etc/shadow` from /var/log | Rejected, logged as error |
| `SEC-004` | Log line with `"; rm -rf /;"` | Sanitized, no command execution |
| `SEC-005` | Log line with `password=secret123` | Masked as `password=***` in payload |
| `SEC-006` | Log line with `user@example.com` | Masked as `[EMAIL]` in payload |
| `SEC-007` | Missing jq binary | Exit with clear error message |
| `SEC-008` | HTTP webhook URL (non HTTPS) | Exit with error |
| `SEC-009` | Payload tampering (wrong HMAC) | Webhook rejects (tested server-side) |
| `SEC-010` | Offset file corruption | Detected, reset to 0 (safe) |

### 7.2 Integration Tests

| Test ID | Description | Expected |
|---------|-------------|----------|
| `INT-001` | End-to-end with valid log | Payload delivered with HMAC |
| `INT-002` | Network timeout | Retry 3x, then skip |
| `INT-003` | Webhook returns 4xx | Stop retry, log error |
| `INT-004` | Multiple concurrent log sources | All monitored correctly |

---

## 8. Acceptance Criteria

### 8.1 Security

- [ ] All log sources validated against /var/log whitelist
- [ ] JSON encoding uses jq (no manual escaping)
- [ ] All payloads signed with HMAC-SHA256
- [ ] HTTPS enforced for webhooks
- [ ] DLP masking applied to PII/secrets
- [ ] Atomic writes for offset files
- [ ] No eval or command substitution on user input

### 8.2 Functionality

- [ ] Backward compatible with Sprint 1 config (minus security fixes)
- [ ] All Sprint 1 tests still pass (except where behavior changed for security)
- [ ] New security tests pass
- [ ] Graceful handling of missing jq/curl/openssl

### 8.3 Performance

- [ ] No significant slowdown (< 10% overhead)
- [ ] Sanitization completes in < 10ms per line
- [ ] HMAC generation < 5ms per payload

---

## 9. Migration from Sprint 1

### 9.1 Breaking Changes

| Aspect | Sprint 1 | Sprint 2 | Migration |
|--------|----------|----------|-----------|
| JSON Encoding | Manual sed | jq required | Install jq |
| Webhook Auth | None | HMAC | Add CLIENT_SECRET |
| Path Validation | None | /var/log only | Update config if needed |
| Dependencies | bash, curl | + jq, openssl | Update install.sh |

### 9.2 Upgrade Path

```bash
# install.sh will:
1. Check for jq, install if missing
2. Generate CLIENT_SECRET if not present
3. Validate existing LOG_SOURCES
4. Warn about paths outside /var/log
```

---

## 10. Risks and Mitigations

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| jq not available on target | Medium | High | Fallback to Python JSON encoding |
| Performance degradation | Low | Medium | Benchmark tests |
| False positives in DLP | Medium | Low | Configurable DLP patterns |
| Backward compatibility | Medium | Medium | Major version bump, migration guide |

---

## 11. Notes for Implementation

### 11.1 @context-auditor Checklist

Before implementation, verify:
- [ ] Latest jq documentation for JSON encoding options
- [ ] Best practices for HMAC-SHA256 in bash
- [ ] curl security flags for production use

### 11.2 @security-auditor Pre-implementation Review

Required before GREEN phase:
- [ ] Review validate_log_source() logic
- [ ] Verify sanitize_log_line() regex patterns
- [ ] Check HMAC implementation for timing attacks
- [ ] Confirm atomic write implementation

### 11.3 @qa-engineer Test Requirements

Create tests for:
- [ ] All SEC-* test cases (RED phase)
- [ ] Integration with webhook signature verification
- [ ] Performance benchmarks

---

*Security First. Safety Always.*