Chaos Engineering Guide

Last Updated: 2025-11-21 (Phase 7 - P3.5) Purpose: Guide for running chaos experiments to validate VoiceAssist V2 resilience

Overview

Chaos Engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production. VoiceAssist V2 uses the Chaos Toolkit to systematically test resilience.

Philosophy: "Break things on purpose to learn how to make them more resilient."

Benefits:

Discover failure modes before production incidents
Validate graceful degradation strategies
Build confidence in system resilience
Improve incident response procedures

Architecture

┌──────────────────────┐
│  Chaos Toolkit       │
│  (Experiment Runner) │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Steady State        │
│  Hypothesis          │◄──── Validate before/after
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Method              │
│  (Inject Chaos)      │◄──── Stop containers, add latency, etc.
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Rollbacks           │
│  (Restore System)    │◄──── Always restore to normal
└──────────────────────┘

Components:

Steady State Hypothesis: Define what "normal" looks like
Method: Actions and probes to inject chaos
Rollbacks: Restore system to normal state
Journal: JSON report of experiment results

Setup

1. Install Chaos Toolkit

# Install chaos dependencies
pip install -r chaos/chaos-requirements.txt

# Verify installation
chaos --version
# Expected: chaostoolkit 1.17.1

2. Verify System is Running

# Start all services
docker compose up -d

# Verify health
curl http://localhost:8000/health
# Expected: {"status":"healthy",...}

3. Run Your First Experiment

# Run database failure experiment
./scripts/run-chaos-tests.sh database-failure

# Or run all experiments
./scripts/run-chaos-tests.sh

Available Experiments

1. Database Failure (`database-failure.yaml`)

What it tests: PostgreSQL becomes unavailable

Expected behavior:

API returns 503 Service Unavailable
Errors are logged appropriately
No 500 Internal Server Errors
System recovers when database returns

Run:

chaos run chaos/experiments/database-failure.yaml

What happens:

Verifies API is healthy
Stops PostgreSQL container
Checks API responds with graceful error
Restarts PostgreSQL
Verifies full recovery

Success criteria:

No crashes or panics
Errors are logged with correlation IDs
Health check reflects degraded state
Recovery is automatic

2. Redis Unavailability (`redis-unavailable.yaml`)

What it tests: Redis cache becomes unavailable

Expected behavior:

API continues to function (degraded performance)
Cache misses are handled gracefully
Sessions may be lost but no errors
System recovers when Redis returns

Run:

chaos run chaos/experiments/redis-unavailable.yaml

What happens:

Verifies API serves requests
Stops Redis container
Tests API without cache
Verifies no crashes
Restarts Redis
Verifies cache is restored

Success criteria:

API remains available
Slower response times acceptable
No 500 errors from cache failures
Automatic reconnection to Redis

3. Network Latency (`network-latency.yaml`)

What it tests: High network latency (500ms)

Expected behavior:

API responds within timeout limits
No connection timeouts
Increased response times acceptable
Monitoring reflects slow responses

Prerequisites:

# Requires Toxiproxy for network chaos
docker compose up -d toxiproxy

Run:

chaos run chaos/experiments/network-latency.yaml

What happens:

Measures baseline response time
Injects 500ms latency via Toxiproxy
Verifies API still responds
Checks no timeouts occur
Removes latency
Verifies performance restored

Success criteria:

Timeouts are appropriately configured
Circuit breakers don't trigger unnecessarily
Metrics show increased latency
No request failures

4. Resource Exhaustion (`resource-exhaustion.yaml`)

What it tests: High CPU usage and memory pressure

Expected behavior:

API slows down but remains stable
No out-of-memory crashes
Graceful degradation under load
Recovery after stress removed

Run:

chaos run chaos/experiments/resource-exhaustion.yaml

What happens:

Verifies container is healthy
Applies CPU stress (stress-ng)
Tests API under load
Checks metrics endpoint
Waits for stress to complete
Verifies no memory leaks

Success criteria:

No container restarts
API remains responsive (slower OK)
Memory usage returns to baseline
No resource leaks

Running Experiments

Run Single Experiment

# Using convenience script
./scripts/run-chaos-tests.sh database-failure

# Or directly with chaos toolkit
chaos run chaos/experiments/database-failure.yaml

Run All Experiments

# Runs all experiments sequentially
./scripts/run-chaos-tests.sh

# Expected output:
# ========================================
# Running: database-failure
# ========================================
# ✓ database-failure PASSED
# ...
# Passed: 4
# Failed: 0

Run with Custom Configuration

# Override API URL
chaos run chaos/experiments/database-failure.yaml \
  --var api_url=http://production:8000

# Save detailed journal
chaos run chaos/experiments/redis-unavailable.yaml \
  --journal-path=./reports/redis-test.json

Dry Run (No Actions)

# Show what would happen without executing
chaos run chaos/experiments/database-failure.yaml \
  --dry

Interpreting Results

Successful Experiment

[2025-11-21 12:00:00 INFO] Steady state hypothesis is met!
[2025-11-21 12:00:05 INFO] Action: stop-postgres-container succeeded
[2025-11-21 12:00:10 INFO] Probe: verify-graceful-degradation succeeded
[2025-11-21 12:00:15 INFO] Rollback: restart-postgres-container succeeded
[2025-11-21 12:00:25 INFO] Steady state hypothesis is met!
[2025-11-21 12:00:25 INFO] Experiment ended with status: completed

Interpretation: System behaved as expected under chaos.

Failed Experiment

[2025-11-21 12:00:00 INFO] Steady state hypothesis is met!
[2025-11-21 12:00:05 INFO] Action: stop-postgres-container succeeded
[2025-11-21 12:00:10 ERROR] Probe: verify-graceful-degradation failed
[2025-11-21 12:00:10 ERROR] Expected status [503, 500] but got 200
[2025-11-21 12:00:15 INFO] Rollback: restart-postgres-container succeeded
[2025-11-21 12:00:25 ERROR] Experiment ended with status: failed

Interpretation: API didn't respond appropriately to database failure. Need to improve error handling.

Reading Journal Reports

Experiment results are saved as JSON in chaos/reports/:

# View latest report
cat chaos/reports/database-failure-20251121-120000.json | jq .

# Check if experiment passed
cat chaos/reports/database-failure-20251121-120000.json | jq '.status'
# "completed" = passed, "failed" = failed

# See which probes failed
cat chaos/reports/database-failure-20251121-120000.json | jq '.run[].status'

Creating New Experiments

Experiment Template

# chaos/experiments/my-experiment.yaml

version: 1.0.0
title: "Short Title"
description: "What are we testing?"

tags:
  - category
  - component

configuration:
  api_url: "http://localhost:8000"

steady-state-hypothesis:
  title: "System is healthy"
  probes:
    - name: "api-responds"
      type: probe
      provider:
        type: http
        url: "${api_url}/health"
        expect:
          - status: 200

method:
  - name: "inject-chaos"
    type: action
    provider:
      type: process
      path: "docker"
      arguments: ["compose", "stop", "service-name"]
    pauses:
      after: 5

  - name: "verify-behavior"
    type: probe
    provider:
      type: http
      url: "${api_url}/endpoint"
      expect:
        - status: [200, 503]

rollbacks:
  - name: "restore-system"
    type: action
    provider:
      type: process
      path: "docker"
      arguments: ["compose", "start", "service-name"]
    pauses:
      after: 10

  - name: "verify-recovery"
    type: probe
    provider:
      type: http
      url: "${api_url}/health"
      expect:
        - status: 200

Common Chaos Patterns

Stop a Container

- name: "stop-container"
  type: action
  provider:
    type: process
    path: "docker"
    arguments: ["compose", "stop", "container-name"]

Kill a Process

- name: "kill-process"
  type: action
  provider:
    type: process
    path: "pkill"
    arguments: ["-9", "python"]

Fill Disk Space

- name: "fill-disk"
  type: action
  provider:
    type: process
    path: "docker"
    arguments:
      - "compose"
      - "exec"
      - "voiceassist-server"
      - "dd"
      - "if=/dev/zero"
      - "of=/tmp/fill"
      - "bs=1M"
      - "count=1000"

Inject Network Packet Loss

- name: "add-packet-loss"
  type: action
  provider:
    type: http
    url: "http://localhost:8474/proxies/api/toxics"
    method: POST
    body:
      type: "loss_downstream"
      attributes:
        probability: 0.3 # 30% packet loss

Best Practices

1. Always Start Small

Bad: Test all failures simultaneously Good: Test one failure mode at a time

Start with:

Single container failure
Brief network issues
Light resource pressure

Then progress to:

Multiple simultaneous failures
Extended outages
Severe resource exhaustion

2. Run in Isolated Environment First

Never run chaos experiments in production without:

Testing in development environment
Understanding potential impact
Having rollback procedures
Notifying team members

Progression:

Local development → 2. Staging → 3. Production (controlled)

3. Validate Monitoring

Every experiment should verify:

Metrics reflect the chaos (latency spikes, error rates)
Alerts fire appropriately
Logs contain useful information

- name: "verify-alert-fired"
  type: probe
  provider:
    type: http
    url: "http://localhost:9093/api/v2/alerts"
    expect:
      - json: "$.length"
        operator: gt
        value: 0

4. Document Learnings

After each experiment:

Document what broke
Identify improvements
Update runbooks
Fix issues found
Re-run to verify fix

5. Automate in CI/CD

# .github/workflows/chaos.yml

name: Chaos Tests

on:
  schedule:
    - cron: "0 2 * * *" # Run daily at 2 AM

jobs:
  chaos:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Start services
        run: docker compose up -d
      - name: Run chaos tests
        run: ./scripts/run-chaos-tests.sh
      - name: Upload reports
        if: always()
        uses: actions/upload-artifact@v3
        with:
          name: chaos-reports
          path: chaos/reports/

Troubleshooting

Issue: Rollback Doesn't Restore System

Symptoms: System remains broken after experiment

Solutions:

# Manually restart all services
docker compose restart

# Check service status
docker compose ps

# View logs
docker compose logs voiceassist-server --tail=50

Issue: Experiment Hangs

Symptoms: Chaos toolkit stops responding

Solutions:

# Kill chaos process
pkill -f "chaos run"

# Ensure services are running
docker compose up -d

# Check for stuck containers
docker compose ps -a

Issue: False Positive Failures

Symptoms: Experiment fails but system is actually fine

Root Causes:

Timeouts too aggressive
Probes check wrong condition
Timing issues (race conditions)

Fix:

# Increase timeouts
provider:
  type: http
  url: "${api_url}/health"
  timeout: 30 # Longer timeout

# Add delays
pauses:
  after: 10 # Wait longer for changes to propagate

Issue: Toxiproxy Not Available

Symptoms: network-latency.yaml fails with connection refused

Solutions:

# Start Toxiproxy
docker compose up -d toxiproxy

# Verify running
curl http://localhost:8474/version

# Configure proxy for API
curl -X POST http://localhost:8474/proxies \
  -H "Content-Type: application/json" \
  -d '{
    "name": "voiceassist-api",
    "listen": "0.0.0.0:8001",
    "upstream": "voiceassist-server:8000"
  }'

Advanced Topics

Multi-Target Experiments

Test multiple failures simultaneously:

method:
  - name: "stop-postgres-and-redis"
    type: action
    background: true # Run in parallel
    provider:
      type: process
      path: "docker"
      arguments: ["compose", "stop", "postgres", "redis"]

Gradual Chaos Injection

Slowly increase chaos intensity:

method:
  - name: "inject-10-percent-packet-loss"
    type: action
    provider:
      type: http
      url: "${toxiproxy_url}/proxies/api/toxics"
      body:
        attributes:
          probability: 0.1
    pauses:
      after: 30

  - name: "increase-to-30-percent"
    type: action
    provider:
      type: http
      url: "${toxiproxy_url}/proxies/api/toxics/packet_loss"
      method: POST
      body:
        attributes:
          probability: 0.3

Production Chaos

Running in production requires:

Blast Radius Limits: Affect small percentage of traffic
Business Hours Only: Run during low-traffic periods
Automated Rollback: Stop immediately if SLOs breached
Notifications: Alert team before/during/after

configuration:
  blast_radius: 0.01 # 1% of traffic
  max_latency_ms: 500
  max_error_rate: 0.05

steady-state-hypothesis:
  title: "SLOs are met"
  probes:
    - name: "error-rate-acceptable"
      type: probe
      provider:
        type: python
        module: custom_probes
        func: check_error_rate
        arguments:
          threshold: ${max_error_rate}
      tolerance: true

Chaos Engineering Culture

Principles

Build a hypothesis: What do you expect to happen?
Vary real-world events: Mimic actual failure modes
Run experiments in production: Eventually (safely)
Automate experiments: Continuous chaos
Minimize blast radius: Affect smallest scope possible

GameDays

Regular chaos engineering exercises:

Monthly GameDay Schedule:

Week 1: Database failures
Week 2: Network issues
Week 3: Resource exhaustion
Week 4: Multi-component failures

GameDay Checklist:

Metrics and Success

Track Resilience Over Time

# Mean Time To Recovery (MTTR)
avg(chaos_experiment_recovery_duration_seconds)

# Experiment Success Rate
sum(chaos_experiment_success_total) /
sum(chaos_experiment_total)

# Issues Found per Month
sum(increase(chaos_issues_discovered_total[30d]))

Goals

MTTR < 5 minutes: System recovers quickly from failures
Success Rate > 90%: Most experiments pass
Zero Production Incidents: From tested failure modes

Document Version: 1.0 Last Updated: 2025-11-21 Maintained By: VoiceAssist SRE Team Review Cycle: Monthly or after major system changes

Chaos Engineering

Chaos Engineering Guide

Overview

Architecture

Setup

1. Install Chaos Toolkit

2. Verify System is Running

3. Run Your First Experiment

Available Experiments

1. Database Failure (database-failure.yaml)

2. Redis Unavailability (redis-unavailable.yaml)

3. Network Latency (network-latency.yaml)

4. Resource Exhaustion (resource-exhaustion.yaml)

Running Experiments

Run Single Experiment

Run All Experiments

Run with Custom Configuration

Dry Run (No Actions)

Interpreting Results

Successful Experiment

Failed Experiment

Reading Journal Reports

Creating New Experiments

Experiment Template

Common Chaos Patterns

Stop a Container

Kill a Process

Fill Disk Space

Inject Network Packet Loss

Best Practices

1. Always Start Small

2. Run in Isolated Environment First

3. Validate Monitoring

4. Document Learnings

5. Automate in CI/CD

Troubleshooting

Issue: Rollback Doesn't Restore System

Issue: Experiment Hangs

Issue: False Positive Failures

Issue: Toxiproxy Not Available

Advanced Topics

Multi-Target Experiments

Gradual Chaos Injection

Production Chaos

Chaos Engineering Culture

Principles

GameDays

Metrics and Success

Track Resilience Over Time

Goals

Related Documentation

1. Database Failure (`database-failure.yaml`)

2. Redis Unavailability (`redis-unavailable.yaml`)

3. Network Latency (`network-latency.yaml`)

4. Resource Exhaustion (`resource-exhaustion.yaml`)