Monitoring Runbook

Last Updated: 2025-11-27 Purpose: Comprehensive guide for monitoring and observability in VoiceAssist V2

Monitoring Architecture

Application Metrics
    ↓
Prometheus (Metrics Collection)
    ↓
Grafana (Visualization)
    ↓
AlertManager (Alerting)
    ↓
PagerDuty/Slack/Email

Key Monitoring Components

Component	Purpose	Port	Dashboard
Prometheus	Metrics collection & storage	9090	http://localhost:9090
Grafana	Metrics visualization	3000	http://localhost:3000
AlertManager	Alert routing & management	9093	http://localhost:9093
Application Metrics	Custom app metrics	8000/metrics	http://localhost:8000/metrics

Setup Monitoring Stack

Docker Compose Configuration

# Add to docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./monitoring/alerts.yml:/etc/prometheus/alerts.yml
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--web.console.libraries=/etc/prometheus/console_libraries"
      - "--web.console.templates=/etc/prometheus/consoles"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning
      - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
      - GF_USERS_ALLOW_SIGN_UP=false
    depends_on:
      - prometheus

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    command:
      - "--config.file=/etc/alertmanager/alertmanager.yml"
      - "--storage.path=/alertmanager"

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro

  postgres-exporter:
    image: prometheuscommunity/postgres-exporter:latest
    ports:
      - "9187:9187"
    environment:
      DATA_SOURCE_NAME: "postgresql://voiceassist:${POSTGRES_PASSWORD}@postgres:5432/voiceassist?sslmode=disable"
    depends_on:
      - postgres

  redis-exporter:
    image: oliver006/redis_exporter:latest
    ports:
      - "9121:9121"
    environment:
      REDIS_ADDR: "redis:6379"
    depends_on:
      - redis

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

Prometheus Configuration

# Create monitoring/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: "voiceassist-prod"
    environment: "production"

# Load alerting rules
rule_files:
  - "/etc/prometheus/alerts.yml"

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

# Scrape configurations
scrape_configs:
  # VoiceAssist Application
  - job_name: "voiceassist-app"
    static_configs:
      - targets: ["voiceassist-server:8000"]
    metrics_path: "/metrics"
    scrape_interval: 10s

  # PostgreSQL
  - job_name: "postgresql"
    static_configs:
      - targets: ["postgres-exporter:9187"]

  # Redis
  - job_name: "redis"
    static_configs:
      - targets: ["redis-exporter:9121"]

  # Node metrics
  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]

  # Prometheus itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Grafana
  - job_name: "grafana"
    static_configs:
      - targets: ["grafana:3000"]

Alert Rules

# Create monitoring/alerts.yml
groups:
  - name: voiceassist_alerts
    interval: 30s
    rules:
      # Application availability
      - alert: ApplicationDown
        expr: up{job="voiceassist-app"} == 0
        for: 1m
        labels:
          severity: critical
          component: application
        annotations:
          summary: "VoiceAssist application is down"
          description: "Application {{ $labels.instance }} is not responding"

      # High error rate
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m]) /
          rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
          component: application
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} over last 5 minutes"

      # Slow response times
      - alert: SlowResponseTime
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
          component: application
        annotations:
          summary: "Slow API response times"
          description: "95th percentile response time is {{ $value }}s"

      # High CPU usage
      - alert: HighCPUUsage
        expr: |
          100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "High CPU usage"
          description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"

      # High memory usage
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "High memory usage"
          description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"

      # Database connection pool exhaustion
      - alert: DatabaseConnectionPoolExhausted
        expr: |
          pg_stat_database_numbackends / pg_settings_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
          component: database
        annotations:
          summary: "Database connection pool nearly exhausted"
          description: "Database connections at {{ $value | humanizePercentage }} of maximum"

      # Database down
      - alert: DatabaseDown
        expr: up{job="postgresql"} == 0
        for: 1m
        labels:
          severity: critical
          component: database
        annotations:
          summary: "PostgreSQL database is down"
          description: "Database {{ $labels.instance }} is not responding"

      # Redis down
      - alert: RedisDown
        expr: up{job="redis"} == 0
        for: 1m
        labels:
          severity: critical
          component: cache
        annotations:
          summary: "Redis is down"
          description: "Redis {{ $labels.instance }} is not responding"

      # High Redis memory usage
      - alert: HighRedisMemory
        expr: |
          redis_memory_used_bytes / redis_memory_max_bytes > 0.9
        for: 5m
        labels:
          severity: warning
          component: cache
        annotations:
          summary: "Redis memory usage high"
          description: "Redis memory usage at {{ $value | humanizePercentage }}"

      # Disk space low
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
        for: 10m
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "Low disk space"
          description: "Only {{ $value }}% disk space remaining on {{ $labels.instance }}"

      # Certificate expiration
      - alert: SSLCertificateExpiring
        expr: |
          (ssl_certificate_expiry_seconds - time()) / 86400 < 30
        for: 1h
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "SSL certificate expiring soon"
          description: "SSL certificate expires in {{ $value }} days"

AlertManager Configuration

# Create monitoring/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "${SLACK_WEBHOOK_URL}"

# Default route
route:
  receiver: "default"
  group_by: ["alertname", "cluster", "service"]
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

  routes:
    # Critical alerts -> PagerDuty + Slack
    - match:
        severity: critical
      receiver: "pagerduty-critical"
      continue: true

    - match:
        severity: critical
      receiver: "slack-critical"

    # Warning alerts -> Slack only
    - match:
        severity: warning
      receiver: "slack-warnings"

# Receivers
receivers:
  - name: "default"
    slack_configs:
      - channel: "#voiceassist-alerts"
        title: "VoiceAssist Alert"
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}\n{{ end }}'

  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "${PAGERDUTY_SERVICE_KEY}"
        description: "{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}"

  - name: "slack-critical"
    slack_configs:
      - channel: "#voiceassist-critical"
        username: "AlertManager"
        color: "danger"
        title: "🔴 CRITICAL: {{ .GroupLabels.alertname }}"
        text: |
          *Summary:* {{ .CommonAnnotations.summary }}
          *Description:* {{ .CommonAnnotations.description }}
          *Severity:* {{ .GroupLabels.severity }}
          *Component:* {{ .GroupLabels.component }}

  - name: "slack-warnings"
    slack_configs:
      - channel: "#voiceassist-alerts"
        username: "AlertManager"
        color: "warning"
        title: "⚠️  WARNING: {{ .GroupLabels.alertname }}"
        text: |
          *Summary:* {{ .CommonAnnotations.summary }}
          *Description:* {{ .CommonAnnotations.description }}
          *Severity:* {{ .GroupLabels.severity }}
          *Component:* {{ .GroupLabels.component }}

  - name: "email-ops"
    email_configs:
      - to: "ops-team@voiceassist.local"
        from: "alertmanager@voiceassist.local"
        smarthost: "smtp.gmail.com:587"
        auth_username: "${SMTP_USERNAME}"
        auth_password: "${SMTP_PASSWORD}"
        headers:
          Subject: "[VoiceAssist] {{ .GroupLabels.alertname }}"

Deploy Monitoring Stack

# Create monitoring directory
mkdir -p /Users/mohammednazmy/VoiceAssist/monitoring/grafana/{provisioning,dashboards}

# Start monitoring stack
docker compose up -d prometheus grafana alertmanager node-exporter postgres-exporter redis-exporter

# Verify services
docker compose ps | grep -E "(prometheus|grafana|alertmanager)"

# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Access Grafana
echo "Grafana: http://localhost:3000 (admin/admin)"
echo "Prometheus: http://localhost:9090"
echo "AlertManager: http://localhost:9093"

Grafana Dashboards

Provision Datasource

# Create monitoring/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Provision Dashboards

# Create monitoring/grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1

providers:
  - name: "VoiceAssist"
    orgId: 1
    folder: "VoiceAssist V2"
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards

Application Overview Dashboard

// Create monitoring/grafana/dashboards/application-overview.json
{
  "dashboard": {
    "title": "VoiceAssist - Application Overview",
    "tags": ["voiceassist", "application"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{job=\"voiceassist-app\"}[5m])",
            "legendFormat": "{{method}} {{endpoint}}"
          }
        ]
      },
      {
        "title": "Response Time (p95)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p95"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
            "legendFormat": "5xx errors"
          }
        ]
      },
      {
        "title": "Active Instances",
        "type": "stat",
        "targets": [
          {
            "expr": "count(up{job=\"voiceassist-app\"} == 1)"
          }
        ]
      }
    ]
  }
}

Database Dashboard

// Create monitoring/grafana/dashboards/database.json
{
  "dashboard": {
    "title": "VoiceAssist - Database",
    "tags": ["voiceassist", "database", "postgresql"],
    "panels": [
      {
        "title": "Database Connections",
        "type": "graph",
        "targets": [
          {
            "expr": "pg_stat_database_numbackends",
            "legendFormat": "Active connections"
          }
        ]
      },
      {
        "title": "Query Duration",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(pg_stat_database_tup_fetched[5m])",
            "legendFormat": "Rows fetched/sec"
          }
        ]
      },
      {
        "title": "Database Size",
        "type": "graph",
        "targets": [
          {
            "expr": "pg_database_size_bytes",
            "legendFormat": "Database size"
          }
        ]
      },
      {
        "title": "Cache Hit Ratio",
        "type": "gauge",
        "targets": [
          {
            "expr": "rate(pg_stat_database_blks_hit[5m]) / (rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m]))"
          }
        ]
      }
    ]
  }
}

Import Pre-built Dashboards

# Import Node Exporter dashboard
curl -X POST http://localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -u admin:admin \
  -d '{
    "dashboard": {
      "id": null,
      "uid": null,
      "title": "Node Exporter Full",
      "gnetId": 1860
    },
    "overwrite": false,
    "inputs": [
      {
        "name": "DS_PROMETHEUS",
        "type": "datasource",
        "pluginId": "prometheus",
        "value": "Prometheus"
      }
    ]
  }'

# Import PostgreSQL dashboard
curl -X POST http://localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -u admin:admin \
  -d '{
    "dashboard": {
      "id": null,
      "uid": null,
      "title": "PostgreSQL Database",
      "gnetId": 9628
    },
    "overwrite": false,
    "inputs": [
      {
        "name": "DS_PROMETHEUS",
        "type": "datasource",
        "pluginId": "prometheus",
        "value": "Prometheus"
      }
    ]
  }'

# Import Redis dashboard
curl -X POST http://localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -u admin:admin \
  -d '{
    "dashboard": {
      "id": null,
      "uid": null,
      "title": "Redis Dashboard",
      "gnetId": 11835
    },
    "overwrite": false,
    "inputs": [
      {
        "name": "DS_PROMETHEUS",
        "type": "datasource",
        "pluginId": "prometheus",
        "value": "Prometheus"
      }
    ]
  }'

Application Metrics

Instrument Application Code

# Add to application code (e.g., app/monitoring.py)
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi import FastAPI, Response
import time

app = FastAPI()

# Metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint']
)

ACTIVE_REQUESTS = Gauge(
    'http_requests_active',
    'Number of active HTTP requests',
    ['method', 'endpoint']
)

DB_CONNECTION_POOL = Gauge(
    'db_connection_pool_size',
    'Database connection pool size',
    ['state']  # active, idle
)

CACHE_OPERATIONS = Counter(
    'cache_operations_total',
    'Total cache operations',
    ['operation', 'status']  # get/set, hit/miss
)

# Middleware to track metrics
@app.middleware("http")
async def track_metrics(request, call_next):
    method = request.method
    endpoint = request.url.path

    ACTIVE_REQUESTS.labels(method=method, endpoint=endpoint).inc()

    start_time = time.time()
    try:
        response = await call_next(request)
        status = response.status_code
    except Exception as e:
        status = 500
        raise
    finally:
        duration = time.time() - start_time

        REQUEST_COUNT.labels(
            method=method,
            endpoint=endpoint,
            status=status
        ).inc()

        REQUEST_DURATION.labels(
            method=method,
            endpoint=endpoint
        ).observe(duration)

        ACTIVE_REQUESTS.labels(method=method, endpoint=endpoint).dec()

    return response

# Metrics endpoint
@app.get("/metrics")
async def metrics():
    return Response(
        content=generate_latest(),
        media_type="text/plain"
    )

# Custom metric tracking
def track_cache_operation(operation: str, hit: bool):
    """Track cache hit/miss"""
    status = "hit" if hit else "miss"
    CACHE_OPERATIONS.labels(operation=operation, status=status).inc()

def update_connection_pool_metrics(active: int, idle: int):
    """Update database connection pool metrics"""
    DB_CONNECTION_POOL.labels(state="active").set(active)
    DB_CONNECTION_POOL.labels(state="idle").set(idle)

Custom Business Metrics

# Track business-specific metrics
from prometheus_client import Counter, Gauge

# User metrics
USER_REGISTRATIONS = Counter(
    'user_registrations_total',
    'Total user registrations'
)

ACTIVE_USERS = Gauge(
    'active_users',
    'Number of currently active users'
)

# Conversation metrics
CONVERSATIONS_CREATED = Counter(
    'conversations_created_total',
    'Total conversations created'
)

MESSAGES_SENT = Counter(
    'messages_sent_total',
    'Total messages sent',
    ['conversation_type']
)

# Voice processing metrics
VOICE_PROCESSING_DURATION = Histogram(
    'voice_processing_duration_seconds',
    'Voice processing duration in seconds'
)

VOICE_PROCESSING_ERRORS = Counter(
    'voice_processing_errors_total',
    'Total voice processing errors',
    ['error_type']
)

# Usage in application
def create_conversation(user_id: int):
    CONVERSATIONS_CREATED.inc()
    # ... rest of the logic

def send_message(conversation_id: int, message: str):
    MESSAGES_SENT.labels(conversation_type="text").inc()
    # ... rest of the logic

def process_voice(audio_data: bytes):
    start_time = time.time()
    try:
        result = process_audio(audio_data)
        VOICE_PROCESSING_DURATION.observe(time.time() - start_time)
        return result
    except Exception as e:
        VOICE_PROCESSING_ERRORS.labels(error_type=type(e).__name__).inc()
        raise

Log Aggregation

Structured Logging

# Configure structured logging
import logging
import json
from datetime import datetime

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            'timestamp': datetime.utcnow().isoformat(),
            'level': record.levelname,
            'logger': record.name,
            'message': record.getMessage(),
            'module': record.module,
            'function': record.funcName,
            'line': record.lineno
        }

        if record.exc_info:
            log_data['exception'] = self.formatException(record.exc_info)

        if hasattr(record, 'user_id'):
            log_data['user_id'] = record.user_id

        if hasattr(record, 'request_id'):
            log_data['request_id'] = record.request_id

        return json.dumps(log_data)

# Configure logger
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())

logger = logging.getLogger('voiceassist')
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# Usage
logger.info("User logged in", extra={'user_id': 123})
logger.error("Database connection failed", exc_info=True)

Centralized Logging with Loki

# Add to docker-compose.yml
services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - ./monitoring/loki-config.yml:/etc/loki/local-config.yaml
      - loki_data:/loki
    command: -config.file=/etc/loki/local-config.yaml

  promtail:
    image: grafana/promtail:latest
    volumes:
      - ./monitoring/promtail-config.yml:/etc/promtail/config.yml
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock
    command: -config.file=/etc/promtail/config.yml
    depends_on:
      - loki

volumes:
  loki_data:

# Create monitoring/loki-config.yml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  chunk_idle_period: 5m
  chunk_retain_period: 30s

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 168h

storage_config:
  boltdb:
    directory: /loki/index
  filesystem:
    directory: /loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

# Create monitoring/promtail-config.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: ["__meta_docker_container_name"]
        regex: "/(.*)"
        target_label: "container"
      - source_labels: ["__meta_docker_container_log_stream"]
        target_label: "stream"

# Add Loki datasource to Grafana
curl -X POST http://localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -u admin:admin \
  -d '{
    "name": "Loki",
    "type": "loki",
    "url": "http://loki:3100",
    "access": "proxy",
    "isDefault": false
  }'

Health Checks

Application Health Endpoints

# Comprehensive health check endpoints
from fastapi import APIRouter, status
from typing import Dict
import asyncio

router = APIRouter()

@router.get("/health")
async def health_check() -> Dict:
    """Basic health check - always returns 200 if app is running"""
    return {
        "status": "healthy",
        "timestamp": datetime.utcnow().isoformat(),
        "version": "2.0.0"
    }

@router.get("/ready")
async def readiness_check() -> Dict:
    """Readiness check - verifies all dependencies"""
    checks = {
        "database": await check_database(),
        "redis": await check_redis(),
        "qdrant": await check_qdrant()
    }

    all_healthy = all(checks.values())

    return {
        "status": "ready" if all_healthy else "not_ready",
        "timestamp": datetime.utcnow().isoformat(),
        "checks": checks
    }

async def check_database() -> bool:
    """Check database connectivity"""
    try:
        await db.execute("SELECT 1")
        return True
    except Exception:
        return False

async def check_redis() -> bool:
    """Check Redis connectivity"""
    try:
        redis_client.ping()
        return True
    except Exception:
        return False

async def check_qdrant() -> bool:
    """Check Qdrant connectivity"""
    try:
        response = await http_client.get("http://qdrant:6333/healthz")
        return response.status_code == 200
    except Exception:
        return False

@router.get("/live")
async def liveness_check() -> Dict:
    """Liveness check - for Kubernetes/Docker"""
    return {"status": "alive"}

Docker Health Checks

# Update docker-compose.yml with health checks
services:
  voiceassist-server:
    # ... existing config ...
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  postgres:
    # ... existing config ...
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U voiceassist"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    # ... existing config ...
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 3

  qdrant:
    # ... existing config ...
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
      interval: 30s
      timeout: 10s
      retries: 3

Monitoring Operations

Daily Monitoring Routine

#!/bin/bash
# Save as: /usr/local/bin/va-monitoring-daily

echo "VoiceAssist Daily Monitoring Report - $(date)"
echo "=============================================="
echo ""

# 1. Check all services are up
echo "1. Service Health:"
docker compose ps | grep -E "(Up|healthy)" | wc -l
docker compose ps
echo ""

# 2. Check Prometheus targets
echo "2. Prometheus Targets:"
curl -s http://localhost:9090/api/v1/targets | \
  jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
echo ""

# 3. Check for active alerts
echo "3. Active Alerts:"
curl -s http://localhost:9093/api/v1/alerts | \
  jq '.data[] | select(.status.state=="active") | {name: .labels.alertname, severity: .labels.severity}'
echo ""

# 4. Resource usage summary
echo "4. Resource Usage:"
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemPerc}}" | head -10
echo ""

# 5. Error rate (last 24 hours)
echo "5. Error Rate (24h):"
docker compose logs --since 24h voiceassist-server | grep -i error | wc -l
echo ""

# 6. Database health
echo "6. Database Health:"
docker compose exec -T postgres psql -U voiceassist -d voiceassist <<EOF
SELECT
    'Connections' as metric,
    count(*)::text as value
FROM pg_stat_activity
UNION ALL
SELECT
    'Database Size',
    pg_size_pretty(pg_database_size('voiceassist'))
UNION ALL
SELECT
    'Cache Hit Ratio',
    round((sum(blks_hit) * 100.0 / NULLIF(sum(blks_hit) + sum(blks_read), 0))::numeric, 2)::text || '%'
FROM pg_stat_database;
EOF
echo ""

# 7. Backup status
echo "7. Last Backup:"
ls -lh /backups/postgres/daily/*.dump.gz 2>/dev/null | tail -1
echo ""

echo "=============================================="
echo "Report completed"

Troubleshooting Monitoring Issues

Prometheus Not Scraping Targets

# Check Prometheus logs
docker compose logs prometheus | tail -50

# Check target configuration
curl -s http://localhost:9090/api/v1/targets | jq '.'

# Verify network connectivity
docker compose exec prometheus wget -O- http://voiceassist-server:8000/metrics

# Reload Prometheus configuration
curl -X POST http://localhost:9090/-/reload

Grafana Dashboards Not Loading

# Check Grafana logs
docker compose logs grafana | tail -50

# Verify datasource connection
curl -s http://localhost:3000/api/datasources \
  -u admin:admin | jq '.'

# Test Prometheus connection from Grafana
curl -s http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=up \
  -u admin:admin | jq '.'

# Restart Grafana
docker compose restart grafana

Alerts Not Firing

# Check AlertManager status
curl -s http://localhost:9093/api/v1/status | jq '.'

# Check alert rules in Prometheus
curl -s http://localhost:9090/api/v1/rules | jq '.'

# Check specific alert state
curl -s 'http://localhost:9090/api/v1/query?query=ALERTS{alertname="HighErrorRate"}' | jq '.'

# Verify AlertManager configuration
docker compose exec alertmanager amtool config show

# Check AlertManager logs
docker compose logs alertmanager | tail -50

Monitoring Best Practices

1. Define SLOs (Service Level Objectives)

# Document SLOs
SLOs:
  - name: Availability
    target: 99.9%
    measurement: uptime over 30 days

  - name: Response Time
    target: p95 < 500ms
    measurement: 95th percentile of all API requests

  - name: Error Rate
    target: < 0.1%
    measurement: 5xx errors / total requests

  - name: Data Durability
    target: 99.999%
    measurement: no data loss events

2. Alert Fatigue Prevention

# Guidelines for creating alerts:
# - Alert on symptoms, not causes
# - Make alerts actionable
# - Include runbook links
# - Set appropriate thresholds
# - Use proper severity levels
# - Group related alerts

# Good alert example:
- alert: UserFacingErrorRate
  expr: rate(http_requests_total{status="500"}[5m]) > 0.05
  for: 5m
  annotations:
    summary: "High user-facing error rate"
    description: "More than 5% of requests failing"
    runbook_url: "https://docs.voiceassist.local/runbooks/troubleshooting#high-error-rate"

# Bad alert example (too noisy):
- alert: SingleError
  expr: increase(http_requests_total{status="500"}[1m]) > 0
  for: 0s

3. Dashboard Organization

Dashboards Structure:
├── Executive Dashboard (high-level KPIs)
├── Application Overview (request rate, errors, latency)
├── Infrastructure (CPU, memory, disk, network)
├── Database Performance (connections, queries, cache hit ratio)
├── Cache Performance (Redis operations, memory, hit rate)
├── Business Metrics (users, conversations, messages)
└── On-Call Dashboard (active alerts, recent incidents)

Document Version: 1.0 Last Updated: 2025-11-21 Maintained By: VoiceAssist DevOps Team Review Cycle: Quarterly Next Review: 2026-02-21

Monitoring Runbook

Monitoring Runbook

Monitoring Architecture

Key Monitoring Components

Setup Monitoring Stack

Docker Compose Configuration

Prometheus Configuration

Alert Rules

AlertManager Configuration

Deploy Monitoring Stack

Grafana Dashboards

Provision Datasource

Provision Dashboards

Application Overview Dashboard

Database Dashboard

Import Pre-built Dashboards

Application Metrics

Instrument Application Code

Custom Business Metrics

Log Aggregation

Structured Logging

Centralized Logging with Loki

Health Checks

Application Health Endpoints

Docker Health Checks

Monitoring Operations

Daily Monitoring Routine

Troubleshooting Monitoring Issues

Prometheus Not Scraping Targets

Grafana Dashboards Not Loading

Alerts Not Firing

Monitoring Best Practices

1. Define SLOs (Service Level Objectives)

2. Alert Fatigue Prevention

3. Dashboard Organization

Related Documentation