VoiceAssist V2 Observability
Purpose: This document defines observability patterns for monitoring, logging, and alerting across all VoiceAssist services.
Last Updated: 2025-11-20
Overview
VoiceAssist V2 uses a three-pillar observability approach:
- Metrics - Prometheus for time-series metrics
- Logs - Structured logging with trace IDs
- Traces - Distributed tracing (optional in Phase 11-14)
Standard Service Endpoints
Every service must expose these endpoints:
Health Check (Liveness)
Endpoint: GET /health
Purpose: Kubernetes liveness probe - is the service process running?
Response:
{ "status": "healthy", "timestamp": "2025-11-20T12:34:56.789Z", "service": "kb-service", "version": "2.0.0" }
FastAPI Example:
from fastapi import APIRouter from datetime import datetime router = APIRouter(tags=["observability"]) @router.get("/health") async def health_check(): """Liveness probe - is service running?""" return { "status": "healthy", "timestamp": datetime.utcnow().isoformat(), "service": "kb-service", "version": "2.0.0", }
Readiness Check (Dependencies)
Endpoint: GET /ready
Purpose: Kubernetes readiness probe - are dependencies available?
Checks:
- Database connection (PostgreSQL)
- Redis connection
- Qdrant connection (if KB service)
- Nextcloud API (if applicable)
Response (Healthy):
{ "status": "ready", "timestamp": "2025-11-20T12:34:56.789Z", "dependencies": { "postgres": "healthy", "redis": "healthy", "qdrant": "healthy" } }
Response (Degraded):
{ "status": "degraded", "timestamp": "2025-11-20T12:34:56.789Z", "dependencies": { "postgres": "healthy", "redis": "unhealthy", "qdrant": "healthy" } }
FastAPI Example:
from fastapi import APIRouter, status from fastapi.responses import JSONResponse @router.get("/ready") async def readiness_check( db: Session = Depends(get_db), redis: Redis = Depends(get_redis), ): """Readiness probe - are dependencies healthy?""" dependencies = {} all_healthy = True # Check PostgreSQL try: await db.execute("SELECT 1") dependencies["postgres"] = "healthy" except Exception as e: dependencies["postgres"] = "unhealthy" all_healthy = False logger.error(f"PostgreSQL health check failed: {e}") # Check Redis try: await redis.ping() dependencies["redis"] = "healthy" except Exception as e: dependencies["redis"] = "unhealthy" all_healthy = False logger.error(f"Redis health check failed: {e}") # Check Qdrant (if KB service) if settings.SERVICE_NAME == "kb-service": try: await qdrant_client.health_check() dependencies["qdrant"] = "healthy" except Exception as e: dependencies["qdrant"] = "unhealthy" all_healthy = False logger.error(f"Qdrant health check failed: {e}") status_code = status.HTTP_200_OK if all_healthy else status.HTTP_503_SERVICE_UNAVAILABLE return JSONResponse( status_code=status_code, content={ "status": "ready" if all_healthy else "degraded", "timestamp": datetime.utcnow().isoformat(), "dependencies": dependencies, } )
Prometheus Metrics
Endpoint: GET /metrics
Purpose: Export metrics in Prometheus format
Response: Plain text Prometheus metrics
FastAPI Setup:
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST from fastapi import Response # Define metrics chat_requests_total = Counter( 'chat_requests_total', 'Total chat requests', ['intent', 'phi_detected'] ) kb_search_duration_seconds = Histogram( 'kb_search_duration_seconds', 'KB search duration', buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0] ) tool_failure_total = Counter( 'tool_failure_total', 'External tool failures', ['tool', 'error_type'] ) phi_redacted_total = Counter( 'phi_redacted_total', 'PHI redaction events' ) indexing_jobs_active = Gauge( 'indexing_jobs_active', 'Currently running indexing jobs' ) @router.get("/metrics") async def metrics(): """Prometheus metrics endpoint.""" return Response( content=generate_latest(), media_type=CONTENT_TYPE_LATEST )
Key Metrics
Chat & Query Metrics
| Metric | Type | Labels | Purpose |
|---|---|---|---|
chat_requests_total | Counter | intent, phi_detected | Total chat requests by intent |
chat_duration_seconds | Histogram | intent | End-to-end chat latency |
streaming_messages_total | Counter | completed | Streaming message count |
phi_detected_total | Counter | - | PHI detection events |
phi_redacted_total | Counter | - | PHI redaction events |
Usage in Code:
async def process_chat(request: ChatRequest): phi_detected = await phi_detector.detect(request.message) # Increment counter chat_requests_total.labels( intent=request.intent, phi_detected=str(phi_detected.contains_phi) ).inc() # Time the request with chat_duration_seconds.labels(intent=request.intent).time(): response = await conductor.process_query(request) if phi_detected.contains_phi: phi_detected_total.inc() return response
KB & Search Metrics
| Metric | Type | Labels | Purpose |
|---|---|---|---|
kb_search_duration_seconds | Histogram | source_type | KB search latency |
kb_search_results_total | Histogram | - | Number of results returned |
kb_cache_hits_total | Counter | - | Redis cache hits |
kb_cache_misses_total | Counter | - | Redis cache misses |
embedding_generation_duration_seconds | Histogram | - | Embedding generation time |
Indexing Metrics
| Metric | Type | Labels | Purpose |
|---|---|---|---|
indexing_jobs_active | Gauge | - | Currently running jobs |
indexing_jobs_total | Counter | state | Total jobs by final state |
indexing_duration_seconds | Histogram | - | Time to index document |
chunks_created_total | Counter | source_type | Total chunks created |
Tool Invocation Metrics
VoiceAssist uses a comprehensive tools system (see TOOLS_AND_INTEGRATIONS.md) that requires detailed observability.
| Metric | Type | Labels | Purpose |
|---|---|---|---|
voiceassist_tool_calls_total | Counter | tool_name, status | Total tool calls by status (completed, failed, timeout, cancelled) |
voiceassist_tool_execution_duration_seconds | Histogram | tool_name | Tool execution duration (p50, p95, p99) |
voiceassist_tool_confirmation_required_total | Counter | tool_name, confirmed | Tool calls requiring user confirmation |
voiceassist_tool_phi_detected_total | Counter | tool_name | Tool calls with PHI detected |
voiceassist_tool_errors_total | Counter | tool_name, error_code | Tool execution errors by code |
voiceassist_tool_timeouts_total | Counter | tool_name | Tool execution timeouts |
voiceassist_tool_active_calls | Gauge | tool_name | Currently executing tool calls |
Status Label Values:
completed- Tool executed successfullyfailed- Tool execution failed with errortimeout- Tool execution exceeded timeoutcancelled- User cancelled tool execution
Common Error Codes:
VALIDATION_ERROR- Invalid argumentsPERMISSION_DENIED- User lacks permissionEXTERNAL_API_ERROR- External service failureTIMEOUT- Execution timeoutPHI_VIOLATION- PHI sent to non-PHI tool
Usage in Tool Execution:
# server/app/services/orchestration/tool_executor.py from prometheus_client import Counter, Histogram, Gauge from contextvars import ContextVar import time # Metrics tool_calls_total = Counter( 'voiceassist_tool_calls_total', 'Total tool invocations', ['tool_name', 'status'] ) tool_execution_duration = Histogram( 'voiceassist_tool_execution_duration_seconds', 'Tool execution duration', ['tool_name'], buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0] ) tool_confirmation_required = Counter( 'voiceassist_tool_confirmation_required_total', 'Tool calls requiring confirmation', ['tool_name', 'confirmed'] ) tool_phi_detected = Counter( 'voiceassist_tool_phi_detected_total', 'Tool calls with PHI detected', ['tool_name'] ) tool_errors = Counter( 'voiceassist_tool_errors_total', 'Tool execution errors', ['tool_name', 'error_code'] ) tool_timeouts = Counter( 'voiceassist_tool_timeouts_total', 'Tool execution timeouts', ['tool_name'] ) tool_active_calls = Gauge( 'voiceassist_tool_active_calls', 'Currently executing tool calls', ['tool_name'] ) async def execute_tool( tool_name: str, args: dict, user: UserContext, trace_id: str, ) -> ToolResult: """ Execute a tool with comprehensive metrics tracking. See: docs/TOOLS_AND_INTEGRATIONS.md See: docs/ORCHESTRATION_DESIGN.md#tool-execution-engine """ start_time = time.time() status = "failed" # Default to failed # Increment active calls tool_active_calls.labels(tool_name=tool_name).inc() try: # Get tool definition tool_def = TOOL_REGISTRY.get(tool_name) if not tool_def: tool_errors.labels(tool_name=tool_name, error_code="TOOL_NOT_FOUND").inc() raise ToolNotFoundError(f"Tool {tool_name} not found") # Check for PHI in arguments phi_result = await phi_detector.detect_in_dict(args) if phi_result.contains_phi: tool_phi_detected.labels(tool_name=tool_name).inc() # Ensure tool allows PHI if not tool_def.allows_phi: tool_errors.labels(tool_name=tool_name, error_code="PHI_VIOLATION").inc() raise ToolPHIViolationError( f"Tool {tool_name} cannot process PHI" ) # Check if confirmation required if tool_def.requires_confirmation: confirmed = await request_user_confirmation(tool_name, args, user, trace_id) tool_confirmation_required.labels( tool_name=tool_name, confirmed=str(confirmed).lower() ).inc() if not confirmed: status = "cancelled" return ToolResult( success=False, error_code="USER_CANCELLED", error_message="User cancelled tool execution" ) # Execute tool with timeout timeout_seconds = tool_def.timeout_seconds try: async with asyncio.timeout(timeout_seconds): result = await tool_def.execute(args, user, trace_id) status = "completed" return result except asyncio.TimeoutError: status = "timeout" tool_timeouts.labels(tool_name=tool_name).inc() raise ToolTimeoutError( f"Tool {tool_name} exceeded timeout ({timeout_seconds}s)" ) except ToolError as e: status = "failed" tool_errors.labels(tool_name=tool_name, error_code=e.error_code).inc() raise except Exception as e: status = "failed" tool_errors.labels(tool_name=tool_name, error_code="UNKNOWN_ERROR").inc() raise finally: # Record metrics duration = time.time() - start_time tool_execution_duration.labels(tool_name=tool_name).observe(duration) tool_calls_total.labels(tool_name=tool_name, status=status).inc() tool_active_calls.labels(tool_name=tool_name).dec() # Structured logging logger.info( "Tool execution completed", extra={ "tool_name": tool_name, "status": status, "duration_ms": int(duration * 1000), "phi_detected": phi_result.contains_phi if 'phi_result' in locals() else False, "trace_id": trace_id, "user_id": user.id, } )
External Tool Metrics (Legacy)
For backward compatibility, external API calls also emit these metrics:
| Metric | Type | Labels | Purpose |
|---|---|---|---|
tool_requests_total | Counter | tool | Total external API requests (legacy) |
tool_failure_total | Counter | tool, error_type | External tool failures (legacy) |
tool_duration_seconds | Histogram | tool | External tool latency (legacy) |
Note: New code should use voiceassist_tool_* metrics above. These legacy metrics are maintained for backward compatibility with Phase 5 implementations.
Logging Conventions
Log Structure
Every log line must include:
timestamp(ISO 8601 UTC)level(DEBUG, INFO, WARNING, ERROR, CRITICAL)service(service name)trace_id(from request)message(log message)session_id(if applicable)user_id(if applicable, never with PHI)
JSON Format:
{ "timestamp": "2025-11-20T12:34:56.789Z", "level": "INFO", "service": "kb-service", "trace_id": "550e8400-e29b-41d4-a716-446655440000", "session_id": "abc123", "user_id": "user_456", "message": "KB search completed", "duration_ms": 1234, "results_count": 5 }
Python Logging Setup
import logging import json from datetime import datetime from contextvars import ContextVar # Context var for trace_id trace_id_var: ContextVar[str] = ContextVar('trace_id', default='') class JSONFormatter(logging.Formatter): """Format logs as JSON.""" def format(self, record): log_data = { "timestamp": datetime.utcnow().isoformat(), "level": record.levelname, "service": settings.SERVICE_NAME, "trace_id": trace_id_var.get(), "message": record.getMessage(), } # Add extra fields if hasattr(record, 'session_id'): log_data['session_id'] = record.session_id if hasattr(record, 'user_id'): log_data['user_id'] = record.user_id if hasattr(record, 'duration_ms'): log_data['duration_ms'] = record.duration_ms # Add exception info if record.exc_info: log_data['exception'] = self.formatException(record.exc_info) return json.dumps(log_data) # Configure logger logger = logging.getLogger("voiceassist") handler = logging.StreamHandler() handler.setFormatter(JSONFormatter()) logger.addHandler(handler) logger.setLevel(logging.INFO)
PHI Logging Rules
CRITICAL: PHI must NEVER be logged directly.
Allowed:
- Session IDs (UUIDs)
- User IDs (UUIDs)
- Document IDs
- Trace IDs
- Intent types
- Error codes
- Counts and aggregates
FORBIDDEN:
- Patient names
- Patient dates of birth
- Medical record numbers
- Actual query text (if contains PHI)
- Clinical context details
- Document content
Instead of logging query text:
# Bad - may contain PHI logger.info(f"Processing query: {query}") # Good - log query hash or length logger.info( "Processing query", extra={ "query_length": len(query), "query_hash": sha256(query.encode()).hexdigest()[:8], "phi_detected": phi_result.contains_phi, } )
Alerting Rules
Critical Alerts (Page On-Call)
| Alert | Condition | Action |
|---|---|---|
| Service Down | Health check failing > 2 minutes | Page on-call engineer |
| Database Unavailable | PostgreSQL readiness check failing | Page DBA + engineer |
| High Error Rate | Error rate > 5% for 5 minutes | Page on-call engineer |
| PHI Leak Detected | PHI in logs or external API call | Page security team immediately |
Warning Alerts (Slack Notification)
| Alert | Condition | Action |
|---|---|---|
| High Latency | p95 latency > 5s for 10 minutes | Notify #engineering |
| KB Search Timeouts | > 10% timeout rate for 5 minutes | Notify #engineering |
| External Tool Failures | > 20% failure rate for 10 minutes | Notify #engineering |
| Indexing Job Failures | > 3 failed jobs in 1 hour | Notify #admin |
Example Prometheus Alert Rules
# alerts.yml groups: - name: voiceassist rules: - alert: HighChatLatency expr: histogram_quantile(0.95, chat_duration_seconds_bucket) > 5 for: 10m labels: severity: warning annotations: summary: "High chat latency detected" description: "95th percentile chat latency is {{ $value }}s" - alert: HighErrorRate expr: rate(chat_requests_total{status="error"}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}" - alert: ExternalToolFailures expr: rate(tool_failure_total[5m]) > 0.2 for: 10m labels: severity: warning annotations: summary: "High external tool failure rate" description: "Tool {{ $labels.tool }} failing at {{ $value | humanizePercentage }}"
Grafana Dashboards
Suggested Dashboards
-
System Overview
- Request rate (requests/sec)
- Error rate (%)
- Latency (p50, p95, p99)
- Active sessions
-
Chat Service
- Chat requests by intent
- Streaming vs non-streaming
- PHI detection rate
- Citations per response
-
Knowledge Base
- KB search latency
- Cache hit rate
- Indexing job status
- Document count by source type
-
External Tools
- Tool request rate
- Tool failure rate
- Tool latency by tool
- Cost tracking (API usage)
Distributed Tracing (Phase 11-14)
For microservices deployment, add distributed tracing:
Tools: Jaeger or OpenTelemetry
Trace Spans:
- Chat request (root span)
- PHI detection
- KB search
- External tool calls (parallel)
- LLM generation
- Safety filters
Benefits:
- Visualize request flow across services
- Identify bottlenecks
- Debug distributed failures
Related Documentation
- ARCHITECTURE_V2.md - System architecture
- SECURITY_COMPLIANCE.md - HIPAA logging requirements
- ADMIN_PANEL_SPECS.md - Admin metrics dashboard
- server/README.md - API implementation
Summary
- All services expose
/health,/ready,/metrics - Metrics use Prometheus format
- Logs use structured JSON with trace IDs
- PHI must NEVER be logged
- Critical alerts page on-call
- Grafana dashboards for monitoring
VoiceAssist Performance Benchmarks
Overview
This document provides comprehensive performance benchmarks for VoiceAssist Phase 10, including baseline metrics, load test results, and performance targets. Use these benchmarks to:
- Evaluate system performance under various load conditions
- Identify performance regressions
- Set realistic SLOs (Service Level Objectives)
- Plan capacity and scaling strategies
Table of Contents
- Testing Environment
- Baseline Performance
- Load Test Results
- Response Time Targets
- Throughput Targets
- Resource Utilization
- Cache Performance
- Database Performance
- Autoscaling Behavior
- Before vs After Optimization
- Performance SLOs
Testing Environment
Infrastructure
- Kubernetes Version: 1.28+
- Node Configuration:
- 3 worker nodes
- 4 vCPU, 16GB RAM per node
- SSD storage
- Database: PostgreSQL 15
- 2 vCPU, 8GB RAM
- Connection pool: 20-50 connections
- Cache: Redis 7
- 2 vCPU, 4GB RAM
- Max memory: 2GB
Application Configuration
- API Gateway: 2-10 replicas (HPA enabled)
- Worker Service: 2-8 replicas (HPA enabled)
- Resource Limits:
- CPU: 500m-2000m
- Memory: 512Mi-2Gi
- HPA Thresholds:
- CPU: 70%
- Memory: 80%
- Custom: 50 req/s per pod
Baseline Performance
No Load Conditions
Metrics collected with zero active users:
| Metric | Value | Notes |
|---|---|---|
| Idle CPU Usage | 5-10% | Background tasks only |
| Idle Memory Usage | 200-300 MB | Per pod |
| Pod Count | 2 (min replicas) | API Gateway + Worker |
| DB Connections | 5-10 active | Connection pool idle |
| Cache Memory | 50-100 MB | Warm cache |
| Health Check Response | 10-20ms | P95 |
Single User Performance
Metrics collected with 1 active user:
| Endpoint | P50 (ms) | P95 (ms) | P99 (ms) | Notes |
|---|---|---|---|---|
| /health | 5 | 10 | 15 | Basic health check |
| /api/auth/login | 50 | 80 | 100 | Includes password hash |
| /api/chat (simple) | 150 | 250 | 350 | Simple query, cache hit |
| /api/chat (complex) | 800 | 1200 | 1500 | Complex query, RAG |
| /api/documents/upload | 500 | 800 | 1200 | 1MB document |
| /api/admin/dashboard | 100 | 180 | 250 | Dashboard metrics |
Load Test Results
Test Methodology
- Tool: Locust (primary), k6 (validation)
- User Distribution:
- 70% Regular Users (simple queries)
- 20% Power Users (complex queries)
- 10% Admin Users (document management)
- Ramp-up: Linear, 10 users/minute
- Duration: 30 minutes steady state
- Think Time: 3-10 seconds between requests
50 Virtual Users
Target: Baseline performance validation
| Metric | Value | Target | Status |
|---|---|---|---|
| Throughput | 45 req/s | 40+ req/s | PASS |
| P50 Response Time | 120ms | <200ms | PASS |
| P95 Response Time | 380ms | <500ms | PASS |
| P99 Response Time | 650ms | <1000ms | PASS |
| Error Rate | 0.1% | <1% | PASS |
| CPU Utilization | 35-45% | <60% | PASS |
| Memory Utilization | 40-50% | <70% | PASS |
| Pod Count | 2-3 | - | - |
| DB Connections | 15-20 | <40 | PASS |
| Cache Hit Rate (L1) | 85% | >80% | PASS |
| Cache Hit Rate (L2) | 70% | >60% | PASS |
| Cache Hit Rate (RAG) | 55% | >50% | PASS |
Key Findings:
- System handles 50 users comfortably with minimal scaling
- Response times well within targets
- Cache performing as expected
- No database bottlenecks
100 Virtual Users
Target: Production load simulation
| Metric | Value | Target | Status |
|---|---|---|---|
| Throughput | 90 req/s | 80+ req/s | PASS |
| P50 Response Time | 180ms | <250ms | PASS |
| P95 Response Time | 520ms | <800ms | PASS |
| P99 Response Time | 950ms | <1500ms | PASS |
| Error Rate | 0.3% | <1% | PASS |
| CPU Utilization | 55-65% | <70% | PASS |
| Memory Utilization | 55-65% | <75% | PASS |
| Pod Count | 4-5 | - | - |
| DB Connections | 25-35 | <45 | PASS |
| Cache Hit Rate (L1) | 83% | >75% | PASS |
| Cache Hit Rate (L2) | 68% | >55% | PASS |
| Cache Hit Rate (RAG) | 52% | >45% | PASS |
Key Findings:
- HPA triggered at ~70 users (CPU threshold)
- Scaled to 4-5 pods
- Response times increased but within targets
- Cache efficiency remains high
- DB connection pool sufficient
200 Virtual Users
Target: Peak load handling
| Metric | Value | Target | Status |
|---|---|---|---|
| Throughput | 175 req/s | 150+ req/s | PASS |
| P50 Response Time | 280ms | <400ms | PASS |
| P95 Response Time | 850ms | <1200ms | PASS |
| P99 Response Time | 1450ms | <2000ms | PASS |
| Error Rate | 0.8% | <2% | PASS |
| CPU Utilization | 68-78% | <80% | PASS |
| Memory Utilization | 65-75% | <80% | PASS |
| Pod Count | 7-8 | - | - |
| DB Connections | 35-45 | <50 | PASS |
| Cache Hit Rate (L1) | 80% | >70% | PASS |
| Cache Hit Rate (L2) | 65% | >50% | PASS |
| Cache Hit Rate (RAG) | 48% | >40% | PASS |
Key Findings:
- Aggressive scaling to 7-8 pods
- Response times degrading but acceptable
- CPU approaching threshold
- DB connection pool near capacity
- Cache still providing value
500 Virtual Users
Target: Stress test / Breaking point
| Metric | Value | Target | Status |
|---|---|---|---|
| Throughput | 380 req/s | 300+ req/s | PASS |
| P50 Response Time | 520ms | <800ms | PASS |
| P95 Response Time | 1850ms | <3000ms | PASS |
| P99 Response Time | 3200ms | <5000ms | PASS |
| Error Rate | 2.5% | <5% | PASS |
| CPU Utilization | 75-85% | <90% | PASS |
| Memory Utilization | 70-80% | <85% | PASS |
| Pod Count | 10 (max) | - | - |
| DB Connections | 45-50 | <50 | MARGINAL |
| Cache Hit Rate (L1) | 75% | >65% | PASS |
| Cache Hit Rate (L2) | 60% | >45% | PASS |
| Cache Hit Rate (RAG) | 42% | >35% | PASS |
Key Findings:
- System at maximum capacity (10 pods)
- Response times significantly degraded
- DB connection pool saturated
- Error rate increasing but acceptable
- Cache hit rates dropping due to churn
- Recommendation: 500 users is operational limit
Breaking Point Analysis:
- At 600+ users: Error rate >5%, P99 >8000ms
- Primary bottleneck: Database connection pool
- Secondary bottleneck: CPU at peak load
- Mitigation: Scale database vertically or add read replicas
Response Time Targets
SLO Definitions
| Percentile | Target | Critical Threshold | Notes |
|---|---|---|---|
| P50 | <200ms | <500ms | Median user experience |
| P95 | <500ms | <1000ms | 95% of requests |
| P99 | <1000ms | <2000ms | Edge cases |
| P99.9 | <2000ms | <5000ms | Rare outliers |
By Endpoint Category
Fast Endpoints (<100ms P95)
- Health checks
- Static content
- Cache hits
- Simple queries
Medium Endpoints (100-500ms P95)
- Authentication
- Simple chat queries
- Profile operations
- Dashboard views
Slow Endpoints (500-1500ms P95)
- Complex chat queries (RAG)
- Document uploads
- Batch operations
- Report generation
Acceptable Outliers (>1500ms)
- Large document processing
- Complex analytics
- Historical data exports
- AI model inference (cold start)
Throughput Targets
Overall System
| Load Level | Target (req/s) | Measured (req/s) | Status |
|---|---|---|---|
| Light (50 users) | 40+ | 45 | PASS |
| Normal (100 users) | 80+ | 90 | PASS |
| Heavy (200 users) | 150+ | 175 | PASS |
| Peak (500 users) | 300+ | 380 | PASS |
By Service
| Service | Target (req/s) | Peak (req/s) | Notes |
|---|---|---|---|
| API Gateway | 400+ | 380 | Primary entry point |
| Auth Service | 50+ | 45 | Login/logout operations |
| Chat Service | 300+ | 280 | Main workload |
| Document Service | 20+ | 25 | Upload/download |
| Admin Service | 10+ | 15 | Management operations |
Resource Utilization
At Different Load Levels
CPU Utilization
| Load | Avg CPU | Peak CPU | Pod Count | Notes |
|---|---|---|---|---|
| 50 users | 40% | 55% | 2-3 | Minimal scaling |
| 100 users | 60% | 75% | 4-5 | Active scaling |
| 200 users | 73% | 85% | 7-8 | Frequent scaling |
| 500 users | 80% | 95% | 10 | Max capacity |
Memory Utilization
| Load | Avg Memory | Peak Memory | Pod Count | Notes |
|---|---|---|---|---|
| 50 users | 45% | 60% | 2-3 | Stable |
| 100 users | 60% | 72% | 4-5 | Gradual increase |
| 200 users | 70% | 82% | 7-8 | High utilization |
| 500 users | 75% | 88% | 10 | Near limit |
Network I/O
| Load | Ingress (MB/s) | Egress (MB/s) | Notes |
|---|---|---|---|
| 50 users | 2.5 | 3.5 | Low bandwidth |
| 100 users | 5.0 | 7.0 | Moderate |
| 200 users | 10.0 | 14.0 | High |
| 500 users | 22.0 | 30.0 | Very high |
Disk I/O
| Load | Read (IOPS) | Write (IOPS) | Notes |
|---|---|---|---|
| 50 users | 150 | 80 | Minimal disk usage |
| 100 users | 300 | 150 | Moderate |
| 200 users | 550 | 280 | High |
| 500 users | 1200 | 600 | Very high |
Cache Performance
L1 Cache (In-Memory)
| Metric | 50 Users | 100 Users | 200 Users | 500 Users | Target |
|---|---|---|---|---|---|
| Hit Rate | 85% | 83% | 80% | 75% | >70% |
| Miss Rate | 15% | 17% | 20% | 25% | <30% |
| Avg Latency | 0.5ms | 0.6ms | 0.8ms | 1.2ms | <2ms |
| P95 Latency | 1.0ms | 1.2ms | 1.5ms | 2.5ms | <5ms |
| Eviction Rate | 2/min | 5/min | 12/min | 35/min | - |
L2 Cache (Redis)
| Metric | 50 Users | 100 Users | 200 Users | 500 Users | Target |
|---|---|---|---|---|---|
| Hit Rate | 70% | 68% | 65% | 60% | >55% |
| Miss Rate | 30% | 32% | 35% | 40% | <45% |
| Avg Latency | 2.5ms | 3.0ms | 3.8ms | 5.5ms | <10ms |
| P95 Latency | 5.0ms | 6.0ms | 8.0ms | 12.0ms | <20ms |
| Eviction Rate | 5/min | 10/min | 25/min | 80/min | - |
RAG Cache (Vector/Semantic)
| Metric | 50 Users | 100 Users | 200 Users | 500 Users | Target |
|---|---|---|---|---|---|
| Hit Rate | 55% | 52% | 48% | 42% | >40% |
| Miss Rate | 45% | 48% | 52% | 58% | <60% |
| Avg Latency | 15ms | 18ms | 22ms | 35ms | <50ms |
| P95 Latency | 35ms | 42ms | 55ms | 85ms | <100ms |
| Eviction Rate | 3/min | 8/min | 20/min | 60/min | - |
Key Findings:
- L1 cache most effective, even at high load
- L2 cache provides good fallback
- RAG cache hit rate lower but still valuable
- Cache eviction increases with load (expected)
- Overall cache strategy working well
Database Performance
Query Performance
| Query Type | P50 (ms) | P95 (ms) | P99 (ms) | Target P95 | Status |
|---|---|---|---|---|---|
| Simple SELECT | 5 | 12 | 18 | <20ms | PASS |
| JOIN (2 tables) | 15 | 35 | 55 | <50ms | PASS |
| JOIN (3+ tables) | 35 | 85 | 150 | <100ms | MARGINAL |
| INSERT | 8 | 18 | 28 | <25ms | PASS |
| UPDATE | 10 | 22 | 35 | <30ms | PASS |
| DELETE | 8 | 20 | 32 | <25ms | PASS |
| Aggregate | 25 | 65 | 120 | <80ms | MARGINAL |
| Full-text Search | 45 | 120 | 200 | <150ms | MARGINAL |
Connection Pool
| Metric | 50 Users | 100 Users | 200 Users | 500 Users | Notes |
|---|---|---|---|---|---|
| Active Connections | 15-20 | 25-35 | 35-45 | 45-50 | Max: 50 |
| Idle Connections | 5-10 | 5-10 | 3-5 | 0-2 | - |
| Wait Time | 0ms | 0ms | 0-5ms | 5-20ms | Queueing at peak |
| Checkout Time | 0.5ms | 0.8ms | 1.2ms | 2.5ms | - |
| Utilization | 35% | 65% | 85% | 98% | Near capacity |
Slow Queries
Queries exceeding 100ms threshold:
| Load | Slow Queries/min | Most Common | Notes |
|---|---|---|---|
| 50 users | 2-5 | Complex JOINs | Acceptable |
| 100 users | 8-15 | Aggregates, Full-text | Within limits |
| 200 users | 25-40 | Unoptimized queries | Needs attention |
| 500 users | 80-120 | All complex queries | Critical |
Recommendations:
- Add indexes for common query patterns
- Optimize 3+ table JOINs
- Consider read replicas for 200+ users
- Review and optimize aggregate queries
- Implement query result caching
Autoscaling Behavior
HPA Metrics
| Metric | Configuration | Observed Behavior |
|---|---|---|
| Min Replicas | 2 | Maintained during idle |
| Max Replicas | 10 | Reached at 500 users |
| Target CPU | 70% | Triggers scale-up reliably |
| Target Memory | 80% | Rarely triggers (CPU first) |
| Custom Metric | 50 req/s | Works well for API Gateway |
| Scale-up Speed | 1 pod/30s | Conservative, prevents flapping |
| Scale-down Speed | 1 pod/5min | Gradual, allows warmup |
| Stabilization | 3min | Prevents rapid oscillation |
Scaling Events Timeline
0-100 Users (Ramp-up Phase)
| User Count | Event | Pod Count | Reason |
|---|---|---|---|
| 0 | Start | 2 | Min replicas |
| 50 | - | 2 | Below threshold |
| 70 | Scale up | 3 | CPU >70% |
| 85 | Scale up | 4 | CPU >70% |
| 100 | Stable | 4-5 | Fluctuating |
100-200 Users (Growth Phase)
| User Count | Event | Pod Count | Reason |
|---|---|---|---|
| 120 | Scale up | 5 | CPU >70% |
| 140 | Scale up | 6 | CPU >70% |
| 170 | Scale up | 7 | CPU >70% |
| 200 | Stable | 7-8 | Fluctuating |
200-500 Users (Peak Phase)
| User Count | Event | Pod Count | Reason |
|---|---|---|---|
| 250 | Scale up | 8 | CPU >70% |
| 320 | Scale up | 9 | CPU >70% |
| 400 | Scale up | 10 | CPU >70% |
| 500 | Max | 10 | Max replicas |
VPA Recommendations
VPA observed resource usage and made the following recommendations:
Before Optimization
| Resource | Requested | Recommended | Actual Usage | Notes |
|---|---|---|---|---|
| CPU | 500m | 800m | 600-700m avg | Under-provisioned |
| Memory | 512Mi | 768Mi | 650-750Mi avg | Under-provisioned |
After Tuning
| Resource | Requested | Recommended | Actual Usage | Notes |
|---|---|---|---|---|
| CPU | 1000m | 1000m | 700-900m avg | Well-provisioned |
| Memory | 1Gi | 1Gi | 700-900Mi avg | Well-provisioned |
Result: VPA recommendations now align with actual usage, indicating proper resource allocation.
Before vs After Optimization
Optimization Focus Areas
-
Database Query Optimization
- Added missing indexes
- Optimized N+1 queries
- Implemented query result caching
-
Cache Strategy Enhancement
- Implemented 3-tier cache (L1, L2, RAG)
- Optimized TTL values
- Added cache warming
-
Resource Tuning
- Adjusted CPU/Memory limits based on VPA
- Optimized connection pool sizing
- Fine-tuned HPA thresholds
-
Code Optimization
- Reduced middleware overhead
- Optimized serialization
- Implemented async processing
Performance Comparison (100 Users)
| Metric | Before | After | Improvement |
|---|---|---|---|
| P50 Response Time | 320ms | 180ms | 44% faster |
| P95 Response Time | 980ms | 520ms | 47% faster |
| P99 Response Time | 1850ms | 950ms | 49% faster |
| Throughput | 65 req/s | 90 req/s | 38% increase |
| Error Rate | 1.2% | 0.3% | 75% reduction |
| CPU Utilization | 75% | 60% | 20% reduction |
| Memory Utilization | 70% | 60% | 14% reduction |
| DB Queries | 150/s | 90/s | 40% reduction |
| Cache Hit Rate (L1) | 65% | 83% | 28% increase |
| Pod Count | 5-6 | 4-5 | 1 fewer pod |
Cost Implications
| Metric | Before | After | Savings |
|---|---|---|---|
| Avg Pod Count | 5.5 | 4.5 | 18% |
| CPU Hours/Day | 132 | 108 | 18% |
| Memory GB-Hours/Day | 132 | 108 | 18% |
| Estimated Monthly Cost | $450 | $370 | $80 (18%) |
Performance SLOs
Production SLOs (100-200 Users)
| Metric | Target | Critical | Current | Status |
|---|---|---|---|---|
| Availability | 99.9% | 99.5% | 99.95% | PASS |
| P50 Response Time | <250ms | <500ms | 180-280ms | PASS |
| P95 Response Time | <800ms | <1500ms | 520-850ms | PASS |
| P99 Response Time | <1500ms | <3000ms | 950-1450ms | PASS |
| Error Rate | <1% | <3% | 0.3-0.8% | PASS |
| Throughput | >100 req/s | >50 req/s | 90-175 req/s | PASS |
Performance Budget
Maximum acceptable degradation:
| Metric | Baseline | Budget | Alert Threshold |
|---|---|---|---|
| P95 Response Time | 520ms | +30% | >675ms |
| Throughput | 90 req/s | -20% | <72 req/s |
| Error Rate | 0.3% | +200% | >0.9% |
| Cache Hit Rate | 83% | -10% | <75% |
Alerting Rules
Critical Alerts (Page oncall):
- P95 response time >1500ms for 5 minutes
- Error rate >5% for 5 minutes
- Availability <99.5% over 1 hour
- Database connection pool >95% for 10 minutes
Warning Alerts (Notify team):
- P95 response time >800ms for 10 minutes
- Error rate >1% for 10 minutes
- CPU utilization >80% for 15 minutes
- Memory utilization >85% for 15 minutes
- Cache hit rate <70% for 15 minutes
Info Alerts (Log only):
- P95 response time >500ms for 15 minutes
- CPU utilization >70% for 20 minutes
- Autoscaling events
Continuous Monitoring
Key Metrics to Track
-
Golden Signals
- Latency (P50, P95, P99)
- Traffic (req/s)
- Errors (rate, count)
- Saturation (CPU, memory, DB connections)
-
Performance Indicators
- Cache hit rates (all tiers)
- Database query performance
- Autoscaling behavior
- Resource utilization
-
Business Metrics
- User satisfaction (survey data)
- Feature usage
- Peak load patterns
- Cost per request
Dashboards
- Load Testing Overview:
/dashboards/load-testing-overview.json - Autoscaling Monitoring:
/dashboards/autoscaling-monitoring.json - System Performance:
/dashboards/system-performance.json
Review Cadence
- Daily: Review overnight metrics, check for anomalies
- Weekly: Analyze trends, update capacity plans
- Monthly: Review SLOs, update benchmarks
- Quarterly: Performance audit, optimization sprint
Conclusion
VoiceAssist Phase 10 demonstrates strong performance characteristics:
Strengths:
- Handles 100-200 concurrent users comfortably
- Response times well within targets
- Effective caching strategy
- Reliable autoscaling
- Significant improvements post-optimization
Areas for Improvement:
- Database connection pool at capacity during peak load (500+ users)
- Some complex queries need optimization
- Cache eviction rate high at extreme load
Recommendations:
- Plan for database scaling (read replicas) before 300+ users
- Continue query optimization efforts
- Monitor cache efficiency and adjust TTLs
- Consider implementing rate limiting for burst traffic
- Review and update benchmarks quarterly
Next Steps:
- See
LOAD_TESTING_GUIDE.mdfor testing procedures - See
PERFORMANCE_TUNING_GUIDE.mdfor optimization techniques - Use Grafana dashboards for ongoing monitoring