Docs / Raw

Performance Benchmarks

Sourced from docs/PERFORMANCE_BENCHMARKS.md

Edit on GitHub

VoiceAssist Performance Benchmarks

Overview

This document provides comprehensive performance benchmarks for VoiceAssist Phase 10, including baseline metrics, load test results, and performance targets. Use these benchmarks to:

  • Evaluate system performance under various load conditions
  • Identify performance regressions
  • Set realistic SLOs (Service Level Objectives)
  • Plan capacity and scaling strategies

Table of Contents


Testing Environment

Infrastructure

  • Kubernetes Version: 1.28+
  • Node Configuration:
    • 3 worker nodes
    • 4 vCPU, 16GB RAM per node
    • SSD storage
  • Database: PostgreSQL 15
    • 2 vCPU, 8GB RAM
    • Connection pool: 20-50 connections
  • Cache: Redis 7
    • 2 vCPU, 4GB RAM
    • Max memory: 2GB

Application Configuration

  • API Gateway: 2-10 replicas (HPA enabled)
  • Worker Service: 2-8 replicas (HPA enabled)
  • Resource Limits:
    • CPU: 500m-2000m
    • Memory: 512Mi-2Gi
  • HPA Thresholds:
    • CPU: 70%
    • Memory: 80%
    • Custom: 50 req/s per pod

Baseline Performance

No Load Conditions

Metrics collected with zero active users:

MetricValueNotes
Idle CPU Usage5-10%Background tasks only
Idle Memory Usage200-300 MBPer pod
Pod Count2 (min replicas)API Gateway + Worker
DB Connections5-10 activeConnection pool idle
Cache Memory50-100 MBWarm cache
Health Check Response10-20msP95

Single User Performance

Metrics collected with 1 active user:

EndpointP50 (ms)P95 (ms)P99 (ms)Notes
/health51015Basic health check
/api/auth/login5080100Includes password hash
/api/chat (simple)150250350Simple query, cache hit
/api/chat (complex)80012001500Complex query, RAG
/api/documents/upload50080012001MB document
/api/admin/dashboard100180250Dashboard metrics

Load Test Results

Test Methodology

  • Tool: Locust (primary), k6 (validation)
  • User Distribution:
    • 70% Regular Users (simple queries)
    • 20% Power Users (complex queries)
    • 10% Admin Users (document management)
  • Ramp-up: Linear, 10 users/minute
  • Duration: 30 minutes steady state
  • Think Time: 3-10 seconds between requests

50 Virtual Users

Target: Baseline performance validation

MetricValueTargetStatus
Throughput45 req/s40+ req/sPASS
P50 Response Time120ms<200msPASS
P95 Response Time380ms<500msPASS
P99 Response Time650ms<1000msPASS
Error Rate0.1%<1%PASS
CPU Utilization35-45%<60%PASS
Memory Utilization40-50%<70%PASS
Pod Count2-3--
DB Connections15-20<40PASS
Cache Hit Rate (L1)85%>80%PASS
Cache Hit Rate (L2)70%>60%PASS
Cache Hit Rate (RAG)55%>50%PASS

Key Findings:

  • System handles 50 users comfortably with minimal scaling
  • Response times well within targets
  • Cache performing as expected
  • No database bottlenecks

100 Virtual Users

Target: Production load simulation

MetricValueTargetStatus
Throughput90 req/s80+ req/sPASS
P50 Response Time180ms<250msPASS
P95 Response Time520ms<800msPASS
P99 Response Time950ms<1500msPASS
Error Rate0.3%<1%PASS
CPU Utilization55-65%<70%PASS
Memory Utilization55-65%<75%PASS
Pod Count4-5--
DB Connections25-35<45PASS
Cache Hit Rate (L1)83%>75%PASS
Cache Hit Rate (L2)68%>55%PASS
Cache Hit Rate (RAG)52%>45%PASS

Key Findings:

  • HPA triggered at ~70 users (CPU threshold)
  • Scaled to 4-5 pods
  • Response times increased but within targets
  • Cache efficiency remains high
  • DB connection pool sufficient

200 Virtual Users

Target: Peak load handling

MetricValueTargetStatus
Throughput175 req/s150+ req/sPASS
P50 Response Time280ms<400msPASS
P95 Response Time850ms<1200msPASS
P99 Response Time1450ms<2000msPASS
Error Rate0.8%<2%PASS
CPU Utilization68-78%<80%PASS
Memory Utilization65-75%<80%PASS
Pod Count7-8--
DB Connections35-45<50PASS
Cache Hit Rate (L1)80%>70%PASS
Cache Hit Rate (L2)65%>50%PASS
Cache Hit Rate (RAG)48%>40%PASS

Key Findings:

  • Aggressive scaling to 7-8 pods
  • Response times degrading but acceptable
  • CPU approaching threshold
  • DB connection pool near capacity
  • Cache still providing value

500 Virtual Users

Target: Stress test / Breaking point

MetricValueTargetStatus
Throughput380 req/s300+ req/sPASS
P50 Response Time520ms<800msPASS
P95 Response Time1850ms<3000msPASS
P99 Response Time3200ms<5000msPASS
Error Rate2.5%<5%PASS
CPU Utilization75-85%<90%PASS
Memory Utilization70-80%<85%PASS
Pod Count10 (max)--
DB Connections45-50<50MARGINAL
Cache Hit Rate (L1)75%>65%PASS
Cache Hit Rate (L2)60%>45%PASS
Cache Hit Rate (RAG)42%>35%PASS

Key Findings:

  • System at maximum capacity (10 pods)
  • Response times significantly degraded
  • DB connection pool saturated
  • Error rate increasing but acceptable
  • Cache hit rates dropping due to churn
  • Recommendation: 500 users is operational limit

Breaking Point Analysis:

  • At 600+ users: Error rate >5%, P99 >8000ms
  • Primary bottleneck: Database connection pool
  • Secondary bottleneck: CPU at peak load
  • Mitigation: Scale database vertically or add read replicas

Response Time Targets

SLO Definitions

PercentileTargetCritical ThresholdNotes
P50<200ms<500msMedian user experience
P95<500ms<1000ms95% of requests
P99<1000ms<2000msEdge cases
P99.9<2000ms<5000msRare outliers

By Endpoint Category

Fast Endpoints (<100ms P95)

  • Health checks
  • Static content
  • Cache hits
  • Simple queries

Medium Endpoints (100-500ms P95)

  • Authentication
  • Simple chat queries
  • Profile operations
  • Dashboard views

Slow Endpoints (500-1500ms P95)

  • Complex chat queries (RAG)
  • Document uploads
  • Batch operations
  • Report generation

Acceptable Outliers (>1500ms)

  • Large document processing
  • Complex analytics
  • Historical data exports
  • AI model inference (cold start)

Throughput Targets

Overall System

Load LevelTarget (req/s)Measured (req/s)Status
Light (50 users)40+45PASS
Normal (100 users)80+90PASS
Heavy (200 users)150+175PASS
Peak (500 users)300+380PASS

By Service

ServiceTarget (req/s)Peak (req/s)Notes
API Gateway400+380Primary entry point
Auth Service50+45Login/logout operations
Chat Service300+280Main workload
Document Service20+25Upload/download
Admin Service10+15Management operations

Resource Utilization

At Different Load Levels

CPU Utilization

LoadAvg CPUPeak CPUPod CountNotes
50 users40%55%2-3Minimal scaling
100 users60%75%4-5Active scaling
200 users73%85%7-8Frequent scaling
500 users80%95%10Max capacity

Memory Utilization

LoadAvg MemoryPeak MemoryPod CountNotes
50 users45%60%2-3Stable
100 users60%72%4-5Gradual increase
200 users70%82%7-8High utilization
500 users75%88%10Near limit

Network I/O

LoadIngress (MB/s)Egress (MB/s)Notes
50 users2.53.5Low bandwidth
100 users5.07.0Moderate
200 users10.014.0High
500 users22.030.0Very high

Disk I/O

LoadRead (IOPS)Write (IOPS)Notes
50 users15080Minimal disk usage
100 users300150Moderate
200 users550280High
500 users1200600Very high

Cache Performance

L1 Cache (In-Memory)

Metric50 Users100 Users200 Users500 UsersTarget
Hit Rate85%83%80%75%>70%
Miss Rate15%17%20%25%<30%
Avg Latency0.5ms0.6ms0.8ms1.2ms<2ms
P95 Latency1.0ms1.2ms1.5ms2.5ms<5ms
Eviction Rate2/min5/min12/min35/min-

L2 Cache (Redis)

Metric50 Users100 Users200 Users500 UsersTarget
Hit Rate70%68%65%60%>55%
Miss Rate30%32%35%40%<45%
Avg Latency2.5ms3.0ms3.8ms5.5ms<10ms
P95 Latency5.0ms6.0ms8.0ms12.0ms<20ms
Eviction Rate5/min10/min25/min80/min-

RAG Cache (Vector/Semantic)

Metric50 Users100 Users200 Users500 UsersTarget
Hit Rate55%52%48%42%>40%
Miss Rate45%48%52%58%<60%
Avg Latency15ms18ms22ms35ms<50ms
P95 Latency35ms42ms55ms85ms<100ms
Eviction Rate3/min8/min20/min60/min-

Key Findings:

  • L1 cache most effective, even at high load
  • L2 cache provides good fallback
  • RAG cache hit rate lower but still valuable
  • Cache eviction increases with load (expected)
  • Overall cache strategy working well

Database Performance

Query Performance

Query TypeP50 (ms)P95 (ms)P99 (ms)Target P95Status
Simple SELECT51218<20msPASS
JOIN (2 tables)153555<50msPASS
JOIN (3+ tables)3585150<100msMARGINAL
INSERT81828<25msPASS
UPDATE102235<30msPASS
DELETE82032<25msPASS
Aggregate2565120<80msMARGINAL
Full-text Search45120200<150msMARGINAL

Connection Pool

Metric50 Users100 Users200 Users500 UsersNotes
Active Connections15-2025-3535-4545-50Max: 50
Idle Connections5-105-103-50-2-
Wait Time0ms0ms0-5ms5-20msQueueing at peak
Checkout Time0.5ms0.8ms1.2ms2.5ms-
Utilization35%65%85%98%Near capacity

Slow Queries

Queries exceeding 100ms threshold:

LoadSlow Queries/minMost CommonNotes
50 users2-5Complex JOINsAcceptable
100 users8-15Aggregates, Full-textWithin limits
200 users25-40Unoptimized queriesNeeds attention
500 users80-120All complex queriesCritical

Recommendations:

  • Add indexes for common query patterns
  • Optimize 3+ table JOINs
  • Consider read replicas for 200+ users
  • Review and optimize aggregate queries
  • Implement query result caching

Autoscaling Behavior

HPA Metrics

MetricConfigurationObserved Behavior
Min Replicas2Maintained during idle
Max Replicas10Reached at 500 users
Target CPU70%Triggers scale-up reliably
Target Memory80%Rarely triggers (CPU first)
Custom Metric50 req/sWorks well for API Gateway
Scale-up Speed1 pod/30sConservative, prevents flapping
Scale-down Speed1 pod/5minGradual, allows warmup
Stabilization3minPrevents rapid oscillation

Scaling Events Timeline

0-100 Users (Ramp-up Phase)

User CountEventPod CountReason
0Start2Min replicas
50-2Below threshold
70Scale up3CPU >70%
85Scale up4CPU >70%
100Stable4-5Fluctuating

100-200 Users (Growth Phase)

User CountEventPod CountReason
120Scale up5CPU >70%
140Scale up6CPU >70%
170Scale up7CPU >70%
200Stable7-8Fluctuating

200-500 Users (Peak Phase)

User CountEventPod CountReason
250Scale up8CPU >70%
320Scale up9CPU >70%
400Scale up10CPU >70%
500Max10Max replicas

VPA Recommendations

VPA observed resource usage and made the following recommendations:

Before Optimization

ResourceRequestedRecommendedActual UsageNotes
CPU500m800m600-700m avgUnder-provisioned
Memory512Mi768Mi650-750Mi avgUnder-provisioned

After Tuning

ResourceRequestedRecommendedActual UsageNotes
CPU1000m1000m700-900m avgWell-provisioned
Memory1Gi1Gi700-900Mi avgWell-provisioned

Result: VPA recommendations now align with actual usage, indicating proper resource allocation.


Before vs After Optimization

Optimization Focus Areas

  1. Database Query Optimization

    • Added missing indexes
    • Optimized N+1 queries
    • Implemented query result caching
  2. Cache Strategy Enhancement

    • Implemented 3-tier cache (L1, L2, RAG)
    • Optimized TTL values
    • Added cache warming
  3. Resource Tuning

    • Adjusted CPU/Memory limits based on VPA
    • Optimized connection pool sizing
    • Fine-tuned HPA thresholds
  4. Code Optimization

    • Reduced middleware overhead
    • Optimized serialization
    • Implemented async processing

Performance Comparison (100 Users)

MetricBeforeAfterImprovement
P50 Response Time320ms180ms44% faster
P95 Response Time980ms520ms47% faster
P99 Response Time1850ms950ms49% faster
Throughput65 req/s90 req/s38% increase
Error Rate1.2%0.3%75% reduction
CPU Utilization75%60%20% reduction
Memory Utilization70%60%14% reduction
DB Queries150/s90/s40% reduction
Cache Hit Rate (L1)65%83%28% increase
Pod Count5-64-51 fewer pod

Cost Implications

MetricBeforeAfterSavings
Avg Pod Count5.54.518%
CPU Hours/Day13210818%
Memory GB-Hours/Day13210818%
Estimated Monthly Cost$450$370$80 (18%)

Performance SLOs

Production SLOs (100-200 Users)

MetricTargetCriticalCurrentStatus
Availability99.9%99.5%99.95%PASS
P50 Response Time<250ms<500ms180-280msPASS
P95 Response Time<800ms<1500ms520-850msPASS
P99 Response Time<1500ms<3000ms950-1450msPASS
Error Rate<1%<3%0.3-0.8%PASS
Throughput>100 req/s>50 req/s90-175 req/sPASS

Performance Budget

Maximum acceptable degradation:

MetricBaselineBudgetAlert Threshold
P95 Response Time520ms+30%>675ms
Throughput90 req/s-20%<72 req/s
Error Rate0.3%+200%>0.9%
Cache Hit Rate83%-10%<75%

Alerting Rules

Critical Alerts (Page oncall):

  • P95 response time >1500ms for 5 minutes
  • Error rate >5% for 5 minutes
  • Availability <99.5% over 1 hour
  • Database connection pool >95% for 10 minutes

Warning Alerts (Notify team):

  • P95 response time >800ms for 10 minutes
  • Error rate >1% for 10 minutes
  • CPU utilization >80% for 15 minutes
  • Memory utilization >85% for 15 minutes
  • Cache hit rate <70% for 15 minutes

Info Alerts (Log only):

  • P95 response time >500ms for 15 minutes
  • CPU utilization >70% for 20 minutes
  • Autoscaling events

Continuous Monitoring

Key Metrics to Track

  1. Golden Signals

    • Latency (P50, P95, P99)
    • Traffic (req/s)
    • Errors (rate, count)
    • Saturation (CPU, memory, DB connections)
  2. Performance Indicators

    • Cache hit rates (all tiers)
    • Database query performance
    • Autoscaling behavior
    • Resource utilization
  3. Business Metrics

    • User satisfaction (survey data)
    • Feature usage
    • Peak load patterns
    • Cost per request

Dashboards

  • Load Testing Overview: /dashboards/load-testing-overview.json
  • Autoscaling Monitoring: /dashboards/autoscaling-monitoring.json
  • System Performance: /dashboards/system-performance.json

Review Cadence

  • Daily: Review overnight metrics, check for anomalies
  • Weekly: Analyze trends, update capacity plans
  • Monthly: Review SLOs, update benchmarks
  • Quarterly: Performance audit, optimization sprint

Conclusion

VoiceAssist Phase 10 demonstrates strong performance characteristics:

Strengths:

  • Handles 100-200 concurrent users comfortably
  • Response times well within targets
  • Effective caching strategy
  • Reliable autoscaling
  • Significant improvements post-optimization

Areas for Improvement:

  • Database connection pool at capacity during peak load (500+ users)
  • Some complex queries need optimization
  • Cache eviction rate high at extreme load

Recommendations:

  1. Plan for database scaling (read replicas) before 300+ users
  2. Continue query optimization efforts
  3. Monitor cache efficiency and adjust TTLs
  4. Consider implementing rate limiting for burst traffic
  5. Review and update benchmarks quarterly

Next Steps:

  • See LOAD_TESTING_GUIDE.md for testing procedures
  • See PERFORMANCE_TUNING_GUIDE.md for optimization techniques
  • Use Grafana dashboards for ongoing monitoring
Beginning of guide
End of guide