2:I[7012,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],"MarkdownRenderer"]
4:I[9856,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],""]
5:I[4126,[],""]
7:I[9630,[],""]
8:I[4278,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"HeadingProvider"]
9:I[1476,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Header"]
a:I[3167,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Sidebar"]
b:I[7409,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"PageFrame"]
3:T3ff1,
# Chaos Engineering Guide

**Last Updated**: 2025-11-21 (Phase 7 - P3.5)
**Purpose**: Guide for running chaos experiments to validate VoiceAssist V2 resilience

---

## Overview

Chaos Engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production. VoiceAssist V2 uses the **Chaos Toolkit** to systematically test resilience.

**Philosophy**: "Break things on purpose to learn how to make them more resilient."

**Benefits**:

- Discover failure modes before production incidents
- Validate graceful degradation strategies
- Build confidence in system resilience
- Improve incident response procedures

---

## Architecture

```
┌──────────────────────┐
│  Chaos Toolkit       │
│  (Experiment Runner) │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Steady State        │
│  Hypothesis          │◄──── Validate before/after
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Method              │
│  (Inject Chaos)      │◄──── Stop containers, add latency, etc.
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Rollbacks           │
│  (Restore System)    │◄──── Always restore to normal
└──────────────────────┘
```

**Components**:

1. **Steady State Hypothesis**: Define what "normal" looks like
2. **Method**: Actions and probes to inject chaos
3. **Rollbacks**: Restore system to normal state
4. **Journal**: JSON report of experiment results

---

## Setup

### 1. Install Chaos Toolkit

```bash
# Install chaos dependencies
pip install -r chaos/chaos-requirements.txt

# Verify installation
chaos --version
# Expected: chaostoolkit 1.17.1
```

### 2. Verify System is Running

```bash
# Start all services
docker compose up -d

# Verify health
curl http://localhost:8000/health
# Expected: {"status":"healthy",...}
```

### 3. Run Your First Experiment

```bash
# Run database failure experiment
./scripts/run-chaos-tests.sh database-failure

# Or run all experiments
./scripts/run-chaos-tests.sh
```

---

## Available Experiments

### 1. Database Failure (`database-failure.yaml`)

**What it tests**: PostgreSQL becomes unavailable

**Expected behavior**:

- API returns 503 Service Unavailable
- Errors are logged appropriately
- No 500 Internal Server Errors
- System recovers when database returns

**Run**:

```bash
chaos run chaos/experiments/database-failure.yaml
```

**What happens**:

1. Verifies API is healthy
2. Stops PostgreSQL container
3. Checks API responds with graceful error
4. Restarts PostgreSQL
5. Verifies full recovery

**Success criteria**:

- No crashes or panics
- Errors are logged with correlation IDs
- Health check reflects degraded state
- Recovery is automatic

---

### 2. Redis Unavailability (`redis-unavailable.yaml`)

**What it tests**: Redis cache becomes unavailable

**Expected behavior**:

- API continues to function (degraded performance)
- Cache misses are handled gracefully
- Sessions may be lost but no errors
- System recovers when Redis returns

**Run**:

```bash
chaos run chaos/experiments/redis-unavailable.yaml
```

**What happens**:

1. Verifies API serves requests
2. Stops Redis container
3. Tests API without cache
4. Verifies no crashes
5. Restarts Redis
6. Verifies cache is restored

**Success criteria**:

- API remains available
- Slower response times acceptable
- No 500 errors from cache failures
- Automatic reconnection to Redis

---

### 3. Network Latency (`network-latency.yaml`)

**What it tests**: High network latency (500ms)

**Expected behavior**:

- API responds within timeout limits
- No connection timeouts
- Increased response times acceptable
- Monitoring reflects slow responses

**Prerequisites**:

```bash
# Requires Toxiproxy for network chaos
docker compose up -d toxiproxy
```

**Run**:

```bash
chaos run chaos/experiments/network-latency.yaml
```

**What happens**:

1. Measures baseline response time
2. Injects 500ms latency via Toxiproxy
3. Verifies API still responds
4. Checks no timeouts occur
5. Removes latency
6. Verifies performance restored

**Success criteria**:

- Timeouts are appropriately configured
- Circuit breakers don't trigger unnecessarily
- Metrics show increased latency
- No request failures

---

### 4. Resource Exhaustion (`resource-exhaustion.yaml`)

**What it tests**: High CPU usage and memory pressure

**Expected behavior**:

- API slows down but remains stable
- No out-of-memory crashes
- Graceful degradation under load
- Recovery after stress removed

**Run**:

```bash
chaos run chaos/experiments/resource-exhaustion.yaml
```

**What happens**:

1. Verifies container is healthy
2. Applies CPU stress (stress-ng)
3. Tests API under load
4. Checks metrics endpoint
5. Waits for stress to complete
6. Verifies no memory leaks

**Success criteria**:

- No container restarts
- API remains responsive (slower OK)
- Memory usage returns to baseline
- No resource leaks

---

## Running Experiments

### Run Single Experiment

```bash
# Using convenience script
./scripts/run-chaos-tests.sh database-failure

# Or directly with chaos toolkit
chaos run chaos/experiments/database-failure.yaml
```

### Run All Experiments

```bash
# Runs all experiments sequentially
./scripts/run-chaos-tests.sh

# Expected output:
# ========================================
# Running: database-failure
# ========================================
# ✓ database-failure PASSED
# ...
# Passed: 4
# Failed: 0
```

### Run with Custom Configuration

```bash
# Override API URL
chaos run chaos/experiments/database-failure.yaml \
  --var api_url=http://production:8000

# Save detailed journal
chaos run chaos/experiments/redis-unavailable.yaml \
  --journal-path=./reports/redis-test.json
```

### Dry Run (No Actions)

```bash
# Show what would happen without executing
chaos run chaos/experiments/database-failure.yaml \
  --dry
```

---

## Interpreting Results

### Successful Experiment

```
[2025-11-21 12:00:00 INFO] Steady state hypothesis is met!
[2025-11-21 12:00:05 INFO] Action: stop-postgres-container succeeded
[2025-11-21 12:00:10 INFO] Probe: verify-graceful-degradation succeeded
[2025-11-21 12:00:15 INFO] Rollback: restart-postgres-container succeeded
[2025-11-21 12:00:25 INFO] Steady state hypothesis is met!
[2025-11-21 12:00:25 INFO] Experiment ended with status: completed
```

**Interpretation**: System behaved as expected under chaos.

### Failed Experiment

```
[2025-11-21 12:00:00 INFO] Steady state hypothesis is met!
[2025-11-21 12:00:05 INFO] Action: stop-postgres-container succeeded
[2025-11-21 12:00:10 ERROR] Probe: verify-graceful-degradation failed
[2025-11-21 12:00:10 ERROR] Expected status [503, 500] but got 200
[2025-11-21 12:00:15 INFO] Rollback: restart-postgres-container succeeded
[2025-11-21 12:00:25 ERROR] Experiment ended with status: failed
```

**Interpretation**: API didn't respond appropriately to database failure. Need to improve error handling.

### Reading Journal Reports

Experiment results are saved as JSON in `chaos/reports/`:

```bash
# View latest report
cat chaos/reports/database-failure-20251121-120000.json | jq .

# Check if experiment passed
cat chaos/reports/database-failure-20251121-120000.json | jq '.status'
# "completed" = passed, "failed" = failed

# See which probes failed
cat chaos/reports/database-failure-20251121-120000.json | jq '.run[].status'
```

---

## Creating New Experiments

### Experiment Template

```yaml
# chaos/experiments/my-experiment.yaml

version: 1.0.0
title: "Short Title"
description: "What are we testing?"

tags:
  - category
  - component

configuration:
  api_url: "http://localhost:8000"

steady-state-hypothesis:
  title: "System is healthy"
  probes:
    - name: "api-responds"
      type: probe
      provider:
        type: http
        url: "${api_url}/health"
        expect:
          - status: 200

method:
  - name: "inject-chaos"
    type: action
    provider:
      type: process
      path: "docker"
      arguments: ["compose", "stop", "service-name"]
    pauses:
      after: 5

  - name: "verify-behavior"
    type: probe
    provider:
      type: http
      url: "${api_url}/endpoint"
      expect:
        - status: [200, 503]

rollbacks:
  - name: "restore-system"
    type: action
    provider:
      type: process
      path: "docker"
      arguments: ["compose", "start", "service-name"]
    pauses:
      after: 10

  - name: "verify-recovery"
    type: probe
    provider:
      type: http
      url: "${api_url}/health"
      expect:
        - status: 200
```

### Common Chaos Patterns

#### Stop a Container

```yaml
- name: "stop-container"
  type: action
  provider:
    type: process
    path: "docker"
    arguments: ["compose", "stop", "container-name"]
```

#### Kill a Process

```yaml
- name: "kill-process"
  type: action
  provider:
    type: process
    path: "pkill"
    arguments: ["-9", "python"]
```

#### Fill Disk Space

```yaml
- name: "fill-disk"
  type: action
  provider:
    type: process
    path: "docker"
    arguments:
      - "compose"
      - "exec"
      - "voiceassist-server"
      - "dd"
      - "if=/dev/zero"
      - "of=/tmp/fill"
      - "bs=1M"
      - "count=1000"
```

#### Inject Network Packet Loss

```yaml
- name: "add-packet-loss"
  type: action
  provider:
    type: http
    url: "http://localhost:8474/proxies/api/toxics"
    method: POST
    body:
      type: "loss_downstream"
      attributes:
        probability: 0.3 # 30% packet loss
```

---

## Best Practices

### 1. Always Start Small

**Bad**: Test all failures simultaneously
**Good**: Test one failure mode at a time

Start with:

1. Single container failure
2. Brief network issues
3. Light resource pressure

Then progress to:

1. Multiple simultaneous failures
2. Extended outages
3. Severe resource exhaustion

### 2. Run in Isolated Environment First

**Never run chaos experiments in production without**:

- Testing in development environment
- Understanding potential impact
- Having rollback procedures
- Notifying team members

Progression:

1. Local development → 2. Staging → 3. Production (controlled)

### 3. Validate Monitoring

Every experiment should verify:

- Metrics reflect the chaos (latency spikes, error rates)
- Alerts fire appropriately
- Logs contain useful information

```yaml
- name: "verify-alert-fired"
  type: probe
  provider:
    type: http
    url: "http://localhost:9093/api/v2/alerts"
    expect:
      - json: "$.length"
        operator: gt
        value: 0
```

### 4. Document Learnings

After each experiment:

1. Document what broke
2. Identify improvements
3. Update runbooks
4. Fix issues found
5. Re-run to verify fix

### 5. Automate in CI/CD

```yaml
# .github/workflows/chaos.yml

name: Chaos Tests

on:
  schedule:
    - cron: "0 2 * * *" # Run daily at 2 AM

jobs:
  chaos:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Start services
        run: docker compose up -d
      - name: Run chaos tests
        run: ./scripts/run-chaos-tests.sh
      - name: Upload reports
        if: always()
        uses: actions/upload-artifact@v3
        with:
          name: chaos-reports
          path: chaos/reports/
```

---

## Troubleshooting

### Issue: Rollback Doesn't Restore System

**Symptoms**: System remains broken after experiment

**Solutions**:

```bash
# Manually restart all services
docker compose restart

# Check service status
docker compose ps

# View logs
docker compose logs voiceassist-server --tail=50
```

### Issue: Experiment Hangs

**Symptoms**: Chaos toolkit stops responding

**Solutions**:

```bash
# Kill chaos process
pkill -f "chaos run"

# Ensure services are running
docker compose up -d

# Check for stuck containers
docker compose ps -a
```

### Issue: False Positive Failures

**Symptoms**: Experiment fails but system is actually fine

**Root Causes**:

- Timeouts too aggressive
- Probes check wrong condition
- Timing issues (race conditions)

**Fix**:

```yaml
# Increase timeouts
provider:
  type: http
  url: "${api_url}/health"
  timeout: 30 # Longer timeout

# Add delays
pauses:
  after: 10 # Wait longer for changes to propagate
```

### Issue: Toxiproxy Not Available

**Symptoms**: `network-latency.yaml` fails with connection refused

**Solutions**:

```bash
# Start Toxiproxy
docker compose up -d toxiproxy

# Verify running
curl http://localhost:8474/version

# Configure proxy for API
curl -X POST http://localhost:8474/proxies \
  -H "Content-Type: application/json" \
  -d '{
    "name": "voiceassist-api",
    "listen": "0.0.0.0:8001",
    "upstream": "voiceassist-server:8000"
  }'
```

---

## Advanced Topics

### Multi-Target Experiments

Test multiple failures simultaneously:

```yaml
method:
  - name: "stop-postgres-and-redis"
    type: action
    background: true # Run in parallel
    provider:
      type: process
      path: "docker"
      arguments: ["compose", "stop", "postgres", "redis"]
```

### Gradual Chaos Injection

Slowly increase chaos intensity:

```yaml
method:
  - name: "inject-10-percent-packet-loss"
    type: action
    provider:
      type: http
      url: "${toxiproxy_url}/proxies/api/toxics"
      body:
        attributes:
          probability: 0.1
    pauses:
      after: 30

  - name: "increase-to-30-percent"
    type: action
    provider:
      type: http
      url: "${toxiproxy_url}/proxies/api/toxics/packet_loss"
      method: POST
      body:
        attributes:
          probability: 0.3
```

### Production Chaos

Running in production requires:

1. **Blast Radius Limits**: Affect small percentage of traffic
2. **Business Hours Only**: Run during low-traffic periods
3. **Automated Rollback**: Stop immediately if SLOs breached
4. **Notifications**: Alert team before/during/after

```yaml
configuration:
  blast_radius: 0.01 # 1% of traffic
  max_latency_ms: 500
  max_error_rate: 0.05

steady-state-hypothesis:
  title: "SLOs are met"
  probes:
    - name: "error-rate-acceptable"
      type: probe
      provider:
        type: python
        module: custom_probes
        func: check_error_rate
        arguments:
          threshold: ${max_error_rate}
      tolerance: true
```

---

## Chaos Engineering Culture

### Principles

1. **Build a hypothesis**: What do you expect to happen?
2. **Vary real-world events**: Mimic actual failure modes
3. **Run experiments in production**: Eventually (safely)
4. **Automate experiments**: Continuous chaos
5. **Minimize blast radius**: Affect smallest scope possible

### GameDays

Regular chaos engineering exercises:

**Monthly GameDay Schedule**:

- Week 1: Database failures
- Week 2: Network issues
- Week 3: Resource exhaustion
- Week 4: Multi-component failures

**GameDay Checklist**:

- [ ] Schedule 2-hour block
- [ ] Notify all team members
- [ ] Prepare rollback procedures
- [ ] Set up monitoring dashboard
- [ ] Document observations
- [ ] Create action items for issues found

---

## Metrics and Success

### Track Resilience Over Time

```promql
# Mean Time To Recovery (MTTR)
avg(chaos_experiment_recovery_duration_seconds)

# Experiment Success Rate
sum(chaos_experiment_success_total) /
sum(chaos_experiment_total)

# Issues Found per Month
sum(increase(chaos_issues_discovered_total[30d]))
```

### Goals

- **MTTR < 5 minutes**: System recovers quickly from failures
- **Success Rate > 90%**: Most experiments pass
- **Zero Production Incidents**: From tested failure modes

---

## Related Documentation

- [Incident Response Runbook](operations/runbooks/INCIDENT_RESPONSE.md)
- [Monitoring Guide](operations/runbooks/MONITORING.md)
- [SLO Definitions](operations/SLO_DEFINITIONS.md)
- [Troubleshooting Guide](operations/runbooks/TROUBLESHOOTING.md)

---

**Document Version**: 1.0
**Last Updated**: 2025-11-21
**Maintained By**: VoiceAssist SRE Team
**Review Cycle**: Monthly or after major system changes
6:["slug","CHAOS_ENGINEERING","c"]
0:["X7oMT3VrOffzp0qvbeOas",[[["",{"children":["docs",{"children":[["slug","CHAOS_ENGINEERING","c"],{"children":["__PAGE__?{\"slug\":[\"CHAOS_ENGINEERING\"]}",{}]}]}]},"$undefined","$undefined",true],["",{"children":["docs",{"children":[["slug","CHAOS_ENGINEERING","c"],{"children":["__PAGE__",{},[["$L1",["$","div",null,{"children":[["$","div",null,{"className":"mb-6 flex items-center justify-between gap-4","children":[["$","div",null,{"children":[["$","p",null,{"className":"text-sm text-gray-500 dark:text-gray-400","children":"Docs / Raw"}],["$","h1",null,{"className":"text-3xl font-bold text-gray-900 dark:text-white","children":"Chaos Engineering"}],["$","p",null,{"className":"text-sm text-gray-600 dark:text-gray-400","children":["Sourced from"," ",["$","code",null,{"className":"font-mono text-xs","children":["docs/","CHAOS_ENGINEERING.md"]}]]}]]}],["$","a",null,{"href":"https://github.com/mohammednazmy/VoiceAssist/edit/main/docs/CHAOS_ENGINEERING.md","target":"_blank","rel":"noreferrer","className":"inline-flex items-center gap-2 rounded-md border border-gray-200 dark:border-gray-700 px-3 py-1.5 text-sm text-gray-700 dark:text-gray-200 hover:border-primary-500 dark:hover:border-primary-400 hover:text-primary-700 dark:hover:text-primary-300","children":"Edit on GitHub"}]]}],["$","div",null,{"className":"rounded-lg border border-gray-200 dark:border-gray-800 bg-white dark:bg-gray-900 p-6","children":["$","$L2",null,{"content":"$3"}]}],["$","div",null,{"className":"mt-6 flex flex-wrap gap-2 text-sm","children":[["$","$L4",null,{"href":"/reference/all-docs","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"← All documentation"}],["$","$L4",null,{"href":"/","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"Home"}]]}]]}],null],null],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children","$6","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[[[["$","link","0",{"rel":"stylesheet","href":"/_next/static/css/7f586cdbbaa33ff7.css","precedence":"next","crossOrigin":"$undefined"}]],["$","html",null,{"lang":"en","className":"h-full","children":["$","body",null,{"className":"__className_f367f3 h-full bg-white dark:bg-gray-900","children":[["$","a",null,{"href":"#main-content","className":"skip-to-content","children":"Skip to main content"}],["$","$L8",null,{"children":[["$","$L9",null,{}],["$","$La",null,{}],["$","main",null,{"id":"main-content","className":"lg:pl-64","role":"main","aria-label":"Documentation content","children":["$","$Lb",null,{"children":["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[]}]}]}]]}]]}]}]],null],null],["$Lc",null]]]]
c:[["$","meta","0",{"name":"viewport","content":"width=device-width, initial-scale=1"}],["$","meta","1",{"charSet":"utf-8"}],["$","title","2",{"children":"Chaos Engineering | Docs | VoiceAssist Docs"}],["$","meta","3",{"name":"description","content":"**Last Updated**: 2025-11-21 (Phase 7 - P3.5)"}],["$","meta","4",{"name":"keywords","content":"VoiceAssist,documentation,medical AI,voice assistant,healthcare,HIPAA,API"}],["$","meta","5",{"name":"robots","content":"index, follow"}],["$","meta","6",{"name":"googlebot","content":"index, follow"}],["$","link","7",{"rel":"canonical","href":"https://assistdocs.asimo.io"}],["$","meta","8",{"property":"og:title","content":"VoiceAssist Documentation"}],["$","meta","9",{"property":"og:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","10",{"property":"og:url","content":"https://assistdocs.asimo.io"}],["$","meta","11",{"property":"og:site_name","content":"VoiceAssist Docs"}],["$","meta","12",{"property":"og:type","content":"website"}],["$","meta","13",{"name":"twitter:card","content":"summary"}],["$","meta","14",{"name":"twitter:title","content":"VoiceAssist Documentation"}],["$","meta","15",{"name":"twitter:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","16",{"name":"next-size-adjust"}]]
1:null