VoiceAssist Docs

Analytics & Metrics

Platform analytics, usage metrics, performance monitoring, and observability

stabledocs2025-11-27human
observability

VoiceAssist V2 Observability

Purpose: This document defines observability patterns for monitoring, logging, and alerting across all VoiceAssist services.

Last Updated: 2025-11-20


Overview

VoiceAssist V2 uses a three-pillar observability approach:

  1. Metrics - Prometheus for time-series metrics
  2. Logs - Structured logging with trace IDs
  3. Traces - Distributed tracing (optional in Phase 11-14)

Standard Service Endpoints

Every service must expose these endpoints:

Health Check (Liveness)

Endpoint: GET /health

Purpose: Kubernetes liveness probe - is the service process running?

Response:

{ "status": "healthy", "timestamp": "2025-11-20T12:34:56.789Z", "service": "kb-service", "version": "2.0.0" }

FastAPI Example:

from fastapi import APIRouter from datetime import datetime router = APIRouter(tags=["observability"]) @router.get("/health") async def health_check(): """Liveness probe - is service running?""" return { "status": "healthy", "timestamp": datetime.utcnow().isoformat(), "service": "kb-service", "version": "2.0.0", }

Readiness Check (Dependencies)

Endpoint: GET /ready

Purpose: Kubernetes readiness probe - are dependencies available?

Checks:

  • Database connection (PostgreSQL)
  • Redis connection
  • Qdrant connection (if KB service)
  • Nextcloud API (if applicable)

Response (Healthy):

{ "status": "ready", "timestamp": "2025-11-20T12:34:56.789Z", "dependencies": { "postgres": "healthy", "redis": "healthy", "qdrant": "healthy" } }

Response (Degraded):

{ "status": "degraded", "timestamp": "2025-11-20T12:34:56.789Z", "dependencies": { "postgres": "healthy", "redis": "unhealthy", "qdrant": "healthy" } }

FastAPI Example:

from fastapi import APIRouter, status from fastapi.responses import JSONResponse @router.get("/ready") async def readiness_check( db: Session = Depends(get_db), redis: Redis = Depends(get_redis), ): """Readiness probe - are dependencies healthy?""" dependencies = {} all_healthy = True # Check PostgreSQL try: await db.execute("SELECT 1") dependencies["postgres"] = "healthy" except Exception as e: dependencies["postgres"] = "unhealthy" all_healthy = False logger.error(f"PostgreSQL health check failed: {e}") # Check Redis try: await redis.ping() dependencies["redis"] = "healthy" except Exception as e: dependencies["redis"] = "unhealthy" all_healthy = False logger.error(f"Redis health check failed: {e}") # Check Qdrant (if KB service) if settings.SERVICE_NAME == "kb-service": try: await qdrant_client.health_check() dependencies["qdrant"] = "healthy" except Exception as e: dependencies["qdrant"] = "unhealthy" all_healthy = False logger.error(f"Qdrant health check failed: {e}") status_code = status.HTTP_200_OK if all_healthy else status.HTTP_503_SERVICE_UNAVAILABLE return JSONResponse( status_code=status_code, content={ "status": "ready" if all_healthy else "degraded", "timestamp": datetime.utcnow().isoformat(), "dependencies": dependencies, } )

Prometheus Metrics

Endpoint: GET /metrics

Purpose: Export metrics in Prometheus format

Response: Plain text Prometheus metrics

FastAPI Setup:

from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST from fastapi import Response # Define metrics chat_requests_total = Counter( 'chat_requests_total', 'Total chat requests', ['intent', 'phi_detected'] ) kb_search_duration_seconds = Histogram( 'kb_search_duration_seconds', 'KB search duration', buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0] ) tool_failure_total = Counter( 'tool_failure_total', 'External tool failures', ['tool', 'error_type'] ) phi_redacted_total = Counter( 'phi_redacted_total', 'PHI redaction events' ) indexing_jobs_active = Gauge( 'indexing_jobs_active', 'Currently running indexing jobs' ) @router.get("/metrics") async def metrics(): """Prometheus metrics endpoint.""" return Response( content=generate_latest(), media_type=CONTENT_TYPE_LATEST )

Key Metrics

Chat & Query Metrics

MetricTypeLabelsPurpose
chat_requests_totalCounterintent, phi_detectedTotal chat requests by intent
chat_duration_secondsHistogramintentEnd-to-end chat latency
streaming_messages_totalCountercompletedStreaming message count
phi_detected_totalCounter-PHI detection events
phi_redacted_totalCounter-PHI redaction events

Usage in Code:

async def process_chat(request: ChatRequest): phi_detected = await phi_detector.detect(request.message) # Increment counter chat_requests_total.labels( intent=request.intent, phi_detected=str(phi_detected.contains_phi) ).inc() # Time the request with chat_duration_seconds.labels(intent=request.intent).time(): response = await conductor.process_query(request) if phi_detected.contains_phi: phi_detected_total.inc() return response

KB & Search Metrics

MetricTypeLabelsPurpose
kb_search_duration_secondsHistogramsource_typeKB search latency
kb_search_results_totalHistogram-Number of results returned
kb_cache_hits_totalCounter-Redis cache hits
kb_cache_misses_totalCounter-Redis cache misses
embedding_generation_duration_secondsHistogram-Embedding generation time

Indexing Metrics

MetricTypeLabelsPurpose
indexing_jobs_activeGauge-Currently running jobs
indexing_jobs_totalCounterstateTotal jobs by final state
indexing_duration_secondsHistogram-Time to index document
chunks_created_totalCountersource_typeTotal chunks created

Tool Invocation Metrics

VoiceAssist uses a comprehensive tools system (see TOOLS_AND_INTEGRATIONS.md) that requires detailed observability.

MetricTypeLabelsPurpose
voiceassist_tool_calls_totalCountertool_name, statusTotal tool calls by status (completed, failed, timeout, cancelled)
voiceassist_tool_execution_duration_secondsHistogramtool_nameTool execution duration (p50, p95, p99)
voiceassist_tool_confirmation_required_totalCountertool_name, confirmedTool calls requiring user confirmation
voiceassist_tool_phi_detected_totalCountertool_nameTool calls with PHI detected
voiceassist_tool_errors_totalCountertool_name, error_codeTool execution errors by code
voiceassist_tool_timeouts_totalCountertool_nameTool execution timeouts
voiceassist_tool_active_callsGaugetool_nameCurrently executing tool calls

Status Label Values:

  • completed - Tool executed successfully
  • failed - Tool execution failed with error
  • timeout - Tool execution exceeded timeout
  • cancelled - User cancelled tool execution

Common Error Codes:

  • VALIDATION_ERROR - Invalid arguments
  • PERMISSION_DENIED - User lacks permission
  • EXTERNAL_API_ERROR - External service failure
  • TIMEOUT - Execution timeout
  • PHI_VIOLATION - PHI sent to non-PHI tool

Usage in Tool Execution:

# server/app/services/orchestration/tool_executor.py from prometheus_client import Counter, Histogram, Gauge from contextvars import ContextVar import time # Metrics tool_calls_total = Counter( 'voiceassist_tool_calls_total', 'Total tool invocations', ['tool_name', 'status'] ) tool_execution_duration = Histogram( 'voiceassist_tool_execution_duration_seconds', 'Tool execution duration', ['tool_name'], buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0] ) tool_confirmation_required = Counter( 'voiceassist_tool_confirmation_required_total', 'Tool calls requiring confirmation', ['tool_name', 'confirmed'] ) tool_phi_detected = Counter( 'voiceassist_tool_phi_detected_total', 'Tool calls with PHI detected', ['tool_name'] ) tool_errors = Counter( 'voiceassist_tool_errors_total', 'Tool execution errors', ['tool_name', 'error_code'] ) tool_timeouts = Counter( 'voiceassist_tool_timeouts_total', 'Tool execution timeouts', ['tool_name'] ) tool_active_calls = Gauge( 'voiceassist_tool_active_calls', 'Currently executing tool calls', ['tool_name'] ) async def execute_tool( tool_name: str, args: dict, user: UserContext, trace_id: str, ) -> ToolResult: """ Execute a tool with comprehensive metrics tracking. See: docs/TOOLS_AND_INTEGRATIONS.md See: docs/ORCHESTRATION_DESIGN.md#tool-execution-engine """ start_time = time.time() status = "failed" # Default to failed # Increment active calls tool_active_calls.labels(tool_name=tool_name).inc() try: # Get tool definition tool_def = TOOL_REGISTRY.get(tool_name) if not tool_def: tool_errors.labels(tool_name=tool_name, error_code="TOOL_NOT_FOUND").inc() raise ToolNotFoundError(f"Tool {tool_name} not found") # Check for PHI in arguments phi_result = await phi_detector.detect_in_dict(args) if phi_result.contains_phi: tool_phi_detected.labels(tool_name=tool_name).inc() # Ensure tool allows PHI if not tool_def.allows_phi: tool_errors.labels(tool_name=tool_name, error_code="PHI_VIOLATION").inc() raise ToolPHIViolationError( f"Tool {tool_name} cannot process PHI" ) # Check if confirmation required if tool_def.requires_confirmation: confirmed = await request_user_confirmation(tool_name, args, user, trace_id) tool_confirmation_required.labels( tool_name=tool_name, confirmed=str(confirmed).lower() ).inc() if not confirmed: status = "cancelled" return ToolResult( success=False, error_code="USER_CANCELLED", error_message="User cancelled tool execution" ) # Execute tool with timeout timeout_seconds = tool_def.timeout_seconds try: async with asyncio.timeout(timeout_seconds): result = await tool_def.execute(args, user, trace_id) status = "completed" return result except asyncio.TimeoutError: status = "timeout" tool_timeouts.labels(tool_name=tool_name).inc() raise ToolTimeoutError( f"Tool {tool_name} exceeded timeout ({timeout_seconds}s)" ) except ToolError as e: status = "failed" tool_errors.labels(tool_name=tool_name, error_code=e.error_code).inc() raise except Exception as e: status = "failed" tool_errors.labels(tool_name=tool_name, error_code="UNKNOWN_ERROR").inc() raise finally: # Record metrics duration = time.time() - start_time tool_execution_duration.labels(tool_name=tool_name).observe(duration) tool_calls_total.labels(tool_name=tool_name, status=status).inc() tool_active_calls.labels(tool_name=tool_name).dec() # Structured logging logger.info( "Tool execution completed", extra={ "tool_name": tool_name, "status": status, "duration_ms": int(duration * 1000), "phi_detected": phi_result.contains_phi if 'phi_result' in locals() else False, "trace_id": trace_id, "user_id": user.id, } )

External Tool Metrics (Legacy)

For backward compatibility, external API calls also emit these metrics:

MetricTypeLabelsPurpose
tool_requests_totalCountertoolTotal external API requests (legacy)
tool_failure_totalCountertool, error_typeExternal tool failures (legacy)
tool_duration_secondsHistogramtoolExternal tool latency (legacy)

Note: New code should use voiceassist_tool_* metrics above. These legacy metrics are maintained for backward compatibility with Phase 5 implementations.


Logging Conventions

Log Structure

Every log line must include:

  • timestamp (ISO 8601 UTC)
  • level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
  • service (service name)
  • trace_id (from request)
  • message (log message)
  • session_id (if applicable)
  • user_id (if applicable, never with PHI)

JSON Format:

{ "timestamp": "2025-11-20T12:34:56.789Z", "level": "INFO", "service": "kb-service", "trace_id": "550e8400-e29b-41d4-a716-446655440000", "session_id": "abc123", "user_id": "user_456", "message": "KB search completed", "duration_ms": 1234, "results_count": 5 }

Python Logging Setup

import logging import json from datetime import datetime from contextvars import ContextVar # Context var for trace_id trace_id_var: ContextVar[str] = ContextVar('trace_id', default='') class JSONFormatter(logging.Formatter): """Format logs as JSON.""" def format(self, record): log_data = { "timestamp": datetime.utcnow().isoformat(), "level": record.levelname, "service": settings.SERVICE_NAME, "trace_id": trace_id_var.get(), "message": record.getMessage(), } # Add extra fields if hasattr(record, 'session_id'): log_data['session_id'] = record.session_id if hasattr(record, 'user_id'): log_data['user_id'] = record.user_id if hasattr(record, 'duration_ms'): log_data['duration_ms'] = record.duration_ms # Add exception info if record.exc_info: log_data['exception'] = self.formatException(record.exc_info) return json.dumps(log_data) # Configure logger logger = logging.getLogger("voiceassist") handler = logging.StreamHandler() handler.setFormatter(JSONFormatter()) logger.addHandler(handler) logger.setLevel(logging.INFO)

PHI Logging Rules

CRITICAL: PHI must NEVER be logged directly.

Allowed:

  • Session IDs (UUIDs)
  • User IDs (UUIDs)
  • Document IDs
  • Trace IDs
  • Intent types
  • Error codes
  • Counts and aggregates

FORBIDDEN:

  • Patient names
  • Patient dates of birth
  • Medical record numbers
  • Actual query text (if contains PHI)
  • Clinical context details
  • Document content

Instead of logging query text:

# Bad - may contain PHI logger.info(f"Processing query: {query}") # Good - log query hash or length logger.info( "Processing query", extra={ "query_length": len(query), "query_hash": sha256(query.encode()).hexdigest()[:8], "phi_detected": phi_result.contains_phi, } )

Alerting Rules

Critical Alerts (Page On-Call)

AlertConditionAction
Service DownHealth check failing > 2 minutesPage on-call engineer
Database UnavailablePostgreSQL readiness check failingPage DBA + engineer
High Error RateError rate > 5% for 5 minutesPage on-call engineer
PHI Leak DetectedPHI in logs or external API callPage security team immediately

Warning Alerts (Slack Notification)

AlertConditionAction
High Latencyp95 latency > 5s for 10 minutesNotify #engineering
KB Search Timeouts> 10% timeout rate for 5 minutesNotify #engineering
External Tool Failures> 20% failure rate for 10 minutesNotify #engineering
Indexing Job Failures> 3 failed jobs in 1 hourNotify #admin

Example Prometheus Alert Rules

# alerts.yml groups: - name: voiceassist rules: - alert: HighChatLatency expr: histogram_quantile(0.95, chat_duration_seconds_bucket) > 5 for: 10m labels: severity: warning annotations: summary: "High chat latency detected" description: "95th percentile chat latency is {{ $value }}s" - alert: HighErrorRate expr: rate(chat_requests_total{status="error"}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}" - alert: ExternalToolFailures expr: rate(tool_failure_total[5m]) > 0.2 for: 10m labels: severity: warning annotations: summary: "High external tool failure rate" description: "Tool {{ $labels.tool }} failing at {{ $value | humanizePercentage }}"

Grafana Dashboards

Suggested Dashboards

  1. System Overview

    • Request rate (requests/sec)
    • Error rate (%)
    • Latency (p50, p95, p99)
    • Active sessions
  2. Chat Service

    • Chat requests by intent
    • Streaming vs non-streaming
    • PHI detection rate
    • Citations per response
  3. Knowledge Base

    • KB search latency
    • Cache hit rate
    • Indexing job status
    • Document count by source type
  4. External Tools

    • Tool request rate
    • Tool failure rate
    • Tool latency by tool
    • Cost tracking (API usage)

Distributed Tracing (Phase 11-14)

For microservices deployment, add distributed tracing:

Tools: Jaeger or OpenTelemetry

Trace Spans:

  • Chat request (root span)
    • PHI detection
    • KB search
    • External tool calls (parallel)
    • LLM generation
    • Safety filters

Benefits:

  • Visualize request flow across services
  • Identify bottlenecks
  • Debug distributed failures


Summary

  • All services expose /health, /ready, /metrics
  • Metrics use Prometheus format
  • Logs use structured JSON with trace IDs
  • PHI must NEVER be logged
  • Critical alerts page on-call
  • Grafana dashboards for monitoring

VoiceAssist Performance Benchmarks

Overview

This document provides comprehensive performance benchmarks for VoiceAssist Phase 10, including baseline metrics, load test results, and performance targets. Use these benchmarks to:

  • Evaluate system performance under various load conditions
  • Identify performance regressions
  • Set realistic SLOs (Service Level Objectives)
  • Plan capacity and scaling strategies

Table of Contents


Testing Environment

Infrastructure

  • Kubernetes Version: 1.28+
  • Node Configuration:
    • 3 worker nodes
    • 4 vCPU, 16GB RAM per node
    • SSD storage
  • Database: PostgreSQL 15
    • 2 vCPU, 8GB RAM
    • Connection pool: 20-50 connections
  • Cache: Redis 7
    • 2 vCPU, 4GB RAM
    • Max memory: 2GB

Application Configuration

  • API Gateway: 2-10 replicas (HPA enabled)
  • Worker Service: 2-8 replicas (HPA enabled)
  • Resource Limits:
    • CPU: 500m-2000m
    • Memory: 512Mi-2Gi
  • HPA Thresholds:
    • CPU: 70%
    • Memory: 80%
    • Custom: 50 req/s per pod

Baseline Performance

No Load Conditions

Metrics collected with zero active users:

MetricValueNotes
Idle CPU Usage5-10%Background tasks only
Idle Memory Usage200-300 MBPer pod
Pod Count2 (min replicas)API Gateway + Worker
DB Connections5-10 activeConnection pool idle
Cache Memory50-100 MBWarm cache
Health Check Response10-20msP95

Single User Performance

Metrics collected with 1 active user:

EndpointP50 (ms)P95 (ms)P99 (ms)Notes
/health51015Basic health check
/api/auth/login5080100Includes password hash
/api/chat (simple)150250350Simple query, cache hit
/api/chat (complex)80012001500Complex query, RAG
/api/documents/upload50080012001MB document
/api/admin/dashboard100180250Dashboard metrics

Load Test Results

Test Methodology

  • Tool: Locust (primary), k6 (validation)
  • User Distribution:
    • 70% Regular Users (simple queries)
    • 20% Power Users (complex queries)
    • 10% Admin Users (document management)
  • Ramp-up: Linear, 10 users/minute
  • Duration: 30 minutes steady state
  • Think Time: 3-10 seconds between requests

50 Virtual Users

Target: Baseline performance validation

MetricValueTargetStatus
Throughput45 req/s40+ req/sPASS
P50 Response Time120ms<200msPASS
P95 Response Time380ms<500msPASS
P99 Response Time650ms<1000msPASS
Error Rate0.1%<1%PASS
CPU Utilization35-45%<60%PASS
Memory Utilization40-50%<70%PASS
Pod Count2-3--
DB Connections15-20<40PASS
Cache Hit Rate (L1)85%>80%PASS
Cache Hit Rate (L2)70%>60%PASS
Cache Hit Rate (RAG)55%>50%PASS

Key Findings:

  • System handles 50 users comfortably with minimal scaling
  • Response times well within targets
  • Cache performing as expected
  • No database bottlenecks

100 Virtual Users

Target: Production load simulation

MetricValueTargetStatus
Throughput90 req/s80+ req/sPASS
P50 Response Time180ms<250msPASS
P95 Response Time520ms<800msPASS
P99 Response Time950ms<1500msPASS
Error Rate0.3%<1%PASS
CPU Utilization55-65%<70%PASS
Memory Utilization55-65%<75%PASS
Pod Count4-5--
DB Connections25-35<45PASS
Cache Hit Rate (L1)83%>75%PASS
Cache Hit Rate (L2)68%>55%PASS
Cache Hit Rate (RAG)52%>45%PASS

Key Findings:

  • HPA triggered at ~70 users (CPU threshold)
  • Scaled to 4-5 pods
  • Response times increased but within targets
  • Cache efficiency remains high
  • DB connection pool sufficient

200 Virtual Users

Target: Peak load handling

MetricValueTargetStatus
Throughput175 req/s150+ req/sPASS
P50 Response Time280ms<400msPASS
P95 Response Time850ms<1200msPASS
P99 Response Time1450ms<2000msPASS
Error Rate0.8%<2%PASS
CPU Utilization68-78%<80%PASS
Memory Utilization65-75%<80%PASS
Pod Count7-8--
DB Connections35-45<50PASS
Cache Hit Rate (L1)80%>70%PASS
Cache Hit Rate (L2)65%>50%PASS
Cache Hit Rate (RAG)48%>40%PASS

Key Findings:

  • Aggressive scaling to 7-8 pods
  • Response times degrading but acceptable
  • CPU approaching threshold
  • DB connection pool near capacity
  • Cache still providing value

500 Virtual Users

Target: Stress test / Breaking point

MetricValueTargetStatus
Throughput380 req/s300+ req/sPASS
P50 Response Time520ms<800msPASS
P95 Response Time1850ms<3000msPASS
P99 Response Time3200ms<5000msPASS
Error Rate2.5%<5%PASS
CPU Utilization75-85%<90%PASS
Memory Utilization70-80%<85%PASS
Pod Count10 (max)--
DB Connections45-50<50MARGINAL
Cache Hit Rate (L1)75%>65%PASS
Cache Hit Rate (L2)60%>45%PASS
Cache Hit Rate (RAG)42%>35%PASS

Key Findings:

  • System at maximum capacity (10 pods)
  • Response times significantly degraded
  • DB connection pool saturated
  • Error rate increasing but acceptable
  • Cache hit rates dropping due to churn
  • Recommendation: 500 users is operational limit

Breaking Point Analysis:

  • At 600+ users: Error rate >5%, P99 >8000ms
  • Primary bottleneck: Database connection pool
  • Secondary bottleneck: CPU at peak load
  • Mitigation: Scale database vertically or add read replicas

Response Time Targets

SLO Definitions

PercentileTargetCritical ThresholdNotes
P50<200ms<500msMedian user experience
P95<500ms<1000ms95% of requests
P99<1000ms<2000msEdge cases
P99.9<2000ms<5000msRare outliers

By Endpoint Category

Fast Endpoints (<100ms P95)

  • Health checks
  • Static content
  • Cache hits
  • Simple queries

Medium Endpoints (100-500ms P95)

  • Authentication
  • Simple chat queries
  • Profile operations
  • Dashboard views

Slow Endpoints (500-1500ms P95)

  • Complex chat queries (RAG)
  • Document uploads
  • Batch operations
  • Report generation

Acceptable Outliers (>1500ms)

  • Large document processing
  • Complex analytics
  • Historical data exports
  • AI model inference (cold start)

Throughput Targets

Overall System

Load LevelTarget (req/s)Measured (req/s)Status
Light (50 users)40+45PASS
Normal (100 users)80+90PASS
Heavy (200 users)150+175PASS
Peak (500 users)300+380PASS

By Service

ServiceTarget (req/s)Peak (req/s)Notes
API Gateway400+380Primary entry point
Auth Service50+45Login/logout operations
Chat Service300+280Main workload
Document Service20+25Upload/download
Admin Service10+15Management operations

Resource Utilization

At Different Load Levels

CPU Utilization

LoadAvg CPUPeak CPUPod CountNotes
50 users40%55%2-3Minimal scaling
100 users60%75%4-5Active scaling
200 users73%85%7-8Frequent scaling
500 users80%95%10Max capacity

Memory Utilization

LoadAvg MemoryPeak MemoryPod CountNotes
50 users45%60%2-3Stable
100 users60%72%4-5Gradual increase
200 users70%82%7-8High utilization
500 users75%88%10Near limit

Network I/O

LoadIngress (MB/s)Egress (MB/s)Notes
50 users2.53.5Low bandwidth
100 users5.07.0Moderate
200 users10.014.0High
500 users22.030.0Very high

Disk I/O

LoadRead (IOPS)Write (IOPS)Notes
50 users15080Minimal disk usage
100 users300150Moderate
200 users550280High
500 users1200600Very high

Cache Performance

L1 Cache (In-Memory)

Metric50 Users100 Users200 Users500 UsersTarget
Hit Rate85%83%80%75%>70%
Miss Rate15%17%20%25%<30%
Avg Latency0.5ms0.6ms0.8ms1.2ms<2ms
P95 Latency1.0ms1.2ms1.5ms2.5ms<5ms
Eviction Rate2/min5/min12/min35/min-

L2 Cache (Redis)

Metric50 Users100 Users200 Users500 UsersTarget
Hit Rate70%68%65%60%>55%
Miss Rate30%32%35%40%<45%
Avg Latency2.5ms3.0ms3.8ms5.5ms<10ms
P95 Latency5.0ms6.0ms8.0ms12.0ms<20ms
Eviction Rate5/min10/min25/min80/min-

RAG Cache (Vector/Semantic)

Metric50 Users100 Users200 Users500 UsersTarget
Hit Rate55%52%48%42%>40%
Miss Rate45%48%52%58%<60%
Avg Latency15ms18ms22ms35ms<50ms
P95 Latency35ms42ms55ms85ms<100ms
Eviction Rate3/min8/min20/min60/min-

Key Findings:

  • L1 cache most effective, even at high load
  • L2 cache provides good fallback
  • RAG cache hit rate lower but still valuable
  • Cache eviction increases with load (expected)
  • Overall cache strategy working well

Database Performance

Query Performance

Query TypeP50 (ms)P95 (ms)P99 (ms)Target P95Status
Simple SELECT51218<20msPASS
JOIN (2 tables)153555<50msPASS
JOIN (3+ tables)3585150<100msMARGINAL
INSERT81828<25msPASS
UPDATE102235<30msPASS
DELETE82032<25msPASS
Aggregate2565120<80msMARGINAL
Full-text Search45120200<150msMARGINAL

Connection Pool

Metric50 Users100 Users200 Users500 UsersNotes
Active Connections15-2025-3535-4545-50Max: 50
Idle Connections5-105-103-50-2-
Wait Time0ms0ms0-5ms5-20msQueueing at peak
Checkout Time0.5ms0.8ms1.2ms2.5ms-
Utilization35%65%85%98%Near capacity

Slow Queries

Queries exceeding 100ms threshold:

LoadSlow Queries/minMost CommonNotes
50 users2-5Complex JOINsAcceptable
100 users8-15Aggregates, Full-textWithin limits
200 users25-40Unoptimized queriesNeeds attention
500 users80-120All complex queriesCritical

Recommendations:

  • Add indexes for common query patterns
  • Optimize 3+ table JOINs
  • Consider read replicas for 200+ users
  • Review and optimize aggregate queries
  • Implement query result caching

Autoscaling Behavior

HPA Metrics

MetricConfigurationObserved Behavior
Min Replicas2Maintained during idle
Max Replicas10Reached at 500 users
Target CPU70%Triggers scale-up reliably
Target Memory80%Rarely triggers (CPU first)
Custom Metric50 req/sWorks well for API Gateway
Scale-up Speed1 pod/30sConservative, prevents flapping
Scale-down Speed1 pod/5minGradual, allows warmup
Stabilization3minPrevents rapid oscillation

Scaling Events Timeline

0-100 Users (Ramp-up Phase)

User CountEventPod CountReason
0Start2Min replicas
50-2Below threshold
70Scale up3CPU >70%
85Scale up4CPU >70%
100Stable4-5Fluctuating

100-200 Users (Growth Phase)

User CountEventPod CountReason
120Scale up5CPU >70%
140Scale up6CPU >70%
170Scale up7CPU >70%
200Stable7-8Fluctuating

200-500 Users (Peak Phase)

User CountEventPod CountReason
250Scale up8CPU >70%
320Scale up9CPU >70%
400Scale up10CPU >70%
500Max10Max replicas

VPA Recommendations

VPA observed resource usage and made the following recommendations:

Before Optimization

ResourceRequestedRecommendedActual UsageNotes
CPU500m800m600-700m avgUnder-provisioned
Memory512Mi768Mi650-750Mi avgUnder-provisioned

After Tuning

ResourceRequestedRecommendedActual UsageNotes
CPU1000m1000m700-900m avgWell-provisioned
Memory1Gi1Gi700-900Mi avgWell-provisioned

Result: VPA recommendations now align with actual usage, indicating proper resource allocation.


Before vs After Optimization

Optimization Focus Areas

  1. Database Query Optimization

    • Added missing indexes
    • Optimized N+1 queries
    • Implemented query result caching
  2. Cache Strategy Enhancement

    • Implemented 3-tier cache (L1, L2, RAG)
    • Optimized TTL values
    • Added cache warming
  3. Resource Tuning

    • Adjusted CPU/Memory limits based on VPA
    • Optimized connection pool sizing
    • Fine-tuned HPA thresholds
  4. Code Optimization

    • Reduced middleware overhead
    • Optimized serialization
    • Implemented async processing

Performance Comparison (100 Users)

MetricBeforeAfterImprovement
P50 Response Time320ms180ms44% faster
P95 Response Time980ms520ms47% faster
P99 Response Time1850ms950ms49% faster
Throughput65 req/s90 req/s38% increase
Error Rate1.2%0.3%75% reduction
CPU Utilization75%60%20% reduction
Memory Utilization70%60%14% reduction
DB Queries150/s90/s40% reduction
Cache Hit Rate (L1)65%83%28% increase
Pod Count5-64-51 fewer pod

Cost Implications

MetricBeforeAfterSavings
Avg Pod Count5.54.518%
CPU Hours/Day13210818%
Memory GB-Hours/Day13210818%
Estimated Monthly Cost$450$370$80 (18%)

Performance SLOs

Production SLOs (100-200 Users)

MetricTargetCriticalCurrentStatus
Availability99.9%99.5%99.95%PASS
P50 Response Time<250ms<500ms180-280msPASS
P95 Response Time<800ms<1500ms520-850msPASS
P99 Response Time<1500ms<3000ms950-1450msPASS
Error Rate<1%<3%0.3-0.8%PASS
Throughput>100 req/s>50 req/s90-175 req/sPASS

Performance Budget

Maximum acceptable degradation:

MetricBaselineBudgetAlert Threshold
P95 Response Time520ms+30%>675ms
Throughput90 req/s-20%<72 req/s
Error Rate0.3%+200%>0.9%
Cache Hit Rate83%-10%<75%

Alerting Rules

Critical Alerts (Page oncall):

  • P95 response time >1500ms for 5 minutes
  • Error rate >5% for 5 minutes
  • Availability <99.5% over 1 hour
  • Database connection pool >95% for 10 minutes

Warning Alerts (Notify team):

  • P95 response time >800ms for 10 minutes
  • Error rate >1% for 10 minutes
  • CPU utilization >80% for 15 minutes
  • Memory utilization >85% for 15 minutes
  • Cache hit rate <70% for 15 minutes

Info Alerts (Log only):

  • P95 response time >500ms for 15 minutes
  • CPU utilization >70% for 20 minutes
  • Autoscaling events

Continuous Monitoring

Key Metrics to Track

  1. Golden Signals

    • Latency (P50, P95, P99)
    • Traffic (req/s)
    • Errors (rate, count)
    • Saturation (CPU, memory, DB connections)
  2. Performance Indicators

    • Cache hit rates (all tiers)
    • Database query performance
    • Autoscaling behavior
    • Resource utilization
  3. Business Metrics

    • User satisfaction (survey data)
    • Feature usage
    • Peak load patterns
    • Cost per request

Dashboards

  • Load Testing Overview: /dashboards/load-testing-overview.json
  • Autoscaling Monitoring: /dashboards/autoscaling-monitoring.json
  • System Performance: /dashboards/system-performance.json

Review Cadence

  • Daily: Review overnight metrics, check for anomalies
  • Weekly: Analyze trends, update capacity plans
  • Monthly: Review SLOs, update benchmarks
  • Quarterly: Performance audit, optimization sprint

Conclusion

VoiceAssist Phase 10 demonstrates strong performance characteristics:

Strengths:

  • Handles 100-200 concurrent users comfortably
  • Response times well within targets
  • Effective caching strategy
  • Reliable autoscaling
  • Significant improvements post-optimization

Areas for Improvement:

  • Database connection pool at capacity during peak load (500+ users)
  • Some complex queries need optimization
  • Cache eviction rate high at extreme load

Recommendations:

  1. Plan for database scaling (read replicas) before 300+ users
  2. Continue query optimization efforts
  3. Monitor cache efficiency and adjust TTLs
  4. Consider implementing rate limiting for burst traffic
  5. Review and update benchmarks quarterly

Next Steps:

  • See LOAD_TESTING_GUIDE.md for testing procedures
  • See PERFORMANCE_TUNING_GUIDE.md for optimization techniques
  • Use Grafana dashboards for ongoing monitoring