2:I[7012,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],"MarkdownRenderer"]
4:I[9856,["4765","static/chunks/4765-f5afdf8061f456f3.js","9856","static/chunks/9856-3b185291364d9bef.js","6687","static/chunks/app/docs/%5B...slug%5D/page-e07536548216bee4.js"],""]
5:I[4126,[],""]
7:I[9630,[],""]
8:I[4278,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"HeadingProvider"]
9:I[1476,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Header"]
a:I[3167,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"Sidebar"]
b:I[7409,["9856","static/chunks/9856-3b185291364d9bef.js","8172","static/chunks/8172-b3a2d6fe4ae10d40.js","3185","static/chunks/app/layout-2814fa5d15b84fe4.js"],"PageFrame"]
3:T8298,
# VoiceAssist Performance Tuning Guide

## Overview

This comprehensive guide provides strategies, techniques, and best practices for optimizing VoiceAssist performance. Use this guide to identify bottlenecks, implement optimizations, and maintain peak system performance.

## Table of Contents

- [Performance Philosophy](#performance-philosophy)
- [Database Optimization](#database-optimization)
- [Caching Strategy](#caching-strategy)
- [Kubernetes Resource Tuning](#kubernetes-resource-tuning)
- [HPA Threshold Tuning](#hpa-threshold-tuning)
- [Application-Level Optimizations](#application-level-optimizations)
- [Monitoring and Alerting](#monitoring-and-alerting)
- [Common Bottlenecks](#common-bottlenecks)

---

## Performance Philosophy

### Principles

1. **Measure First, Optimize Second**
   - Never optimize without data
   - Establish baselines before changes
   - Use profiling tools
   - A/B test optimizations

2. **Focus on Bottlenecks**
   - Identify the slowest component
   - 80/20 rule: Focus on biggest impact
   - Don't optimize prematurely
   - Avoid micro-optimizations

3. **Balance Trade-offs**
   - Performance vs Complexity
   - Cost vs Speed
   - Consistency vs Availability
   - Developer time vs Runtime performance

4. **Iterate and Validate**
   - Make one change at a time
   - Test thoroughly
   - Monitor impact
   - Roll back if needed

### Performance Optimization Workflow

```
1. Identify Issue
   ├─> Monitor metrics
   ├─> User reports
   └─> Load test results

2. Measure & Profile
   ├─> Collect baseline data
   ├─> Identify bottleneck
   └─> Understand root cause

3. Hypothesis
   ├─> Propose solution
   ├─> Estimate impact
   └─> Consider alternatives

4. Implement
   ├─> Make targeted change
   ├─> Keep it simple
   └─> Document reasoning

5. Validate
   ├─> Run load tests
   ├─> Compare metrics
   └─> Verify improvement

6. Deploy & Monitor
   ├─> Gradual rollout
   ├─> Watch dashboards
   └─> Gather feedback
```

---

## Database Optimization

### Query Optimization Checklist

#### 1. Identify Slow Queries

**Tools**:

```sql
-- Enable pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find slowest queries
SELECT
  query,
  calls,
  total_time,
  mean_time,
  max_time,
  stddev_time
FROM pg_stat_statements
WHERE mean_time > 100  -- Over 100ms average
ORDER BY mean_time DESC
LIMIT 20;

-- Find most frequent queries
SELECT
  query,
  calls,
  mean_time
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 20;
```

**VoiceAssist-Specific**:

```bash
# Check slow query log
kubectl exec -it postgres-0 -n voiceassist -- \
  tail -100 /var/log/postgresql/postgresql-slow.log

# Query Prometheus for slow queries
curl -g 'http://prometheus:9090/api/v1/query?query=voiceassist_db_slow_queries_total'
```

#### 2. Add Missing Indexes

**Common Patterns**:

```sql
-- Foreign key columns (if not already indexed)
CREATE INDEX idx_conversations_user_id ON conversations(user_id);
CREATE INDEX idx_messages_conversation_id ON messages(conversation_id);
CREATE INDEX idx_documents_uploaded_by ON documents(uploaded_by);

-- Frequently filtered columns
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_conversations_created_at ON conversations(created_at);
CREATE INDEX idx_messages_timestamp ON messages(timestamp);

-- Composite indexes for common query patterns
CREATE INDEX idx_messages_conversation_timestamp
  ON messages(conversation_id, timestamp);

CREATE INDEX idx_conversations_user_created
  ON conversations(user_id, created_at);

-- Partial indexes for filtered queries
CREATE INDEX idx_active_users
  ON users(id)
  WHERE is_active = true;

CREATE INDEX idx_recent_conversations
  ON conversations(id, created_at)
  WHERE created_at > NOW() - INTERVAL '30 days';

-- Full-text search indexes
CREATE INDEX idx_documents_fts
  ON documents
  USING gin(to_tsvector('english', content));
```

**Index Analysis**:

```sql
-- Check index usage
SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan,
  idx_tup_read,
  idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;

-- Find unused indexes (candidates for removal)
SELECT
  schemaname,
  tablename,
  indexname
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexname NOT LIKE '%_pkey';

-- Check index size
SELECT
  tablename,
  indexname,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;
```

#### 3. Optimize Query Structure

**N+1 Query Problem**:

**Bad** (N+1 queries):

```python
# Fetches users, then queries conversations for each user
users = session.query(User).all()
for user in users:
    # N additional queries
    conversations = session.query(Conversation)\
        .filter_by(user_id=user.id)\
        .all()
```

**Good** (Single query with JOIN):

```python
# Single query with eager loading
users = session.query(User)\
    .options(joinedload(User.conversations))\
    .all()

# Or using explicit JOIN
results = session.query(User, Conversation)\
    .join(Conversation, User.id == Conversation.user_id)\
    .all()
```

**VoiceAssist Implementation**:

```python
# In server/models/user.py
class User(Base):
    __tablename__ = "users"

    # Enable relationship eager loading
    conversations = relationship(
        "Conversation",
        back_populates="user",
        lazy="selectin"  # or "joined" for INNER JOIN
    )

# In API endpoint
@router.get("/users/{user_id}/conversations")
async def get_user_conversations(user_id: int, db: Session = Depends(get_db)):
    user = db.query(User)\
        .options(joinedload(User.conversations))\
        .filter(User.id == user_id)\
        .first()
    return user.conversations
```

**Pagination**:

**Bad** (Loads all results):

```python
conversations = session.query(Conversation)\
    .filter(Conversation.user_id == user_id)\
    .all()
```

**Good** (Paginated):

```python
def get_conversations_paginated(user_id: int, page: int = 1, per_page: int = 20):
    offset = (page - 1) * per_page
    conversations = session.query(Conversation)\
        .filter(Conversation.user_id == user_id)\
        .order_by(Conversation.created_at.desc())\
        .limit(per_page)\
        .offset(offset)\
        .all()

    total = session.query(func.count(Conversation.id))\
        .filter(Conversation.user_id == user_id)\
        .scalar()

    return {
        "items": conversations,
        "page": page,
        "per_page": per_page,
        "total": total,
        "pages": (total + per_page - 1) // per_page
    }
```

**Query Result Caching**:

```python
from functools import lru_cache
import hashlib
import json

def cache_key(user_id: int, filters: dict) -> str:
    """Generate cache key from query parameters."""
    key_data = {"user_id": user_id, "filters": filters}
    return f"query:{hashlib.md5(json.dumps(key_data, sort_keys=True).encode()).hexdigest()}"

async def get_conversations_cached(
    user_id: int,
    filters: dict,
    db: Session,
    cache: Redis
):
    """Get conversations with Redis caching."""
    key = cache_key(user_id, filters)

    # Check cache
    cached = await cache.get(key)
    if cached:
        return json.loads(cached)

    # Query database
    conversations = db.query(Conversation)\
        .filter(Conversation.user_id == user_id)\
        .all()

    result = [c.to_dict() for c in conversations]

    # Cache result (5 minutes)
    await cache.setex(key, 300, json.dumps(result))

    return result
```

#### 4. Connection Pool Tuning

**Current Configuration** (`server/database.py`):

```python
from sqlalchemy.pool import QueuePool

engine = create_engine(
    DATABASE_URL,
    poolclass=QueuePool,
    pool_size=20,          # Core connections
    max_overflow=10,       # Additional connections (total: 30)
    pool_timeout=30,       # Wait time for connection (seconds)
    pool_recycle=3600,     # Recycle connections after 1 hour
    pool_pre_ping=True,    # Verify connections before use
    echo=False,            # Don't log SQL (performance)
)
```

**Tuning Guidelines**:

| Scenario                       | pool_size | max_overflow | Total | Notes             |
| ------------------------------ | --------- | ------------ | ----- | ----------------- |
| **Light Load** (<50 users)     | 10        | 5            | 15    | Minimal resources |
| **Normal Load** (50-100 users) | 20        | 10           | 30    | Current setting   |
| **Heavy Load** (100-200 users) | 30        | 20           | 50    | Increase pool     |
| **Peak Load** (200+ users)     | 40        | 30           | 70    | May need replicas |

**Monitoring**:

```python
# Add to server/monitoring/metrics.py
from prometheus_client import Gauge

db_pool_size = Gauge('voiceassist_db_pool_size', 'Database pool size')
db_pool_checked_out = Gauge('voiceassist_db_pool_checked_out', 'Checked out connections')
db_pool_overflow = Gauge('voiceassist_db_pool_overflow', 'Overflow connections')
db_pool_utilization = Gauge('voiceassist_db_pool_utilization_percent', 'Pool utilization %')

def update_pool_metrics():
    """Update connection pool metrics."""
    pool = engine.pool
    db_pool_size.set(pool.size())
    db_pool_checked_out.set(pool.checkedout())
    db_pool_overflow.set(pool.overflow())

    total = pool.size() + pool.overflow()
    used = pool.checkedout()
    db_pool_utilization.set((used / total * 100) if total > 0 else 0)
```

#### 5. Database Maintenance

**Regular Tasks**:

```sql
-- Analyze tables (update statistics)
ANALYZE users;
ANALYZE conversations;
ANALYZE messages;
ANALYZE documents;

-- Vacuum (reclaim space)
VACUUM ANALYZE users;
VACUUM ANALYZE conversations;

-- Reindex (rebuild indexes)
REINDEX TABLE users;
REINDEX TABLE conversations;

-- Check bloat
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) AS external_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
```

**Automated Script** (`scripts/db-maintenance.sh`):

```bash
#!/bin/bash
# Daily database maintenance

PGHOST="${PGHOST:-localhost}"
PGPORT="${PGPORT:-5432}"
PGDATABASE="${PGDATABASE:-voiceassist}"
PGUSER="${PGUSER:-postgres}"

echo "=== Starting database maintenance ==="
echo "Date: $(date)"

# Analyze all tables
echo "Running ANALYZE..."
psql -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" -d "$PGDATABASE" -c "ANALYZE;"

# Vacuum (non-blocking)
echo "Running VACUUM..."
psql -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" -d "$PGDATABASE" -c "VACUUM (ANALYZE, VERBOSE);"

# Check for bloat
echo "Checking for bloat..."
psql -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" -d "$PGDATABASE" << EOF
SELECT
  schemaname || '.' || tablename AS table_name,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;
EOF

echo "=== Maintenance complete ==="
```

**CronJob** (`k8s/cronjobs/db-maintenance.yaml`):

```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: db-maintenance
  namespace: voiceassist
spec:
  schedule: "0 2 * * *" # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: maintenance
              image: postgres:15
              command:
                - /bin/bash
                - -c
                - |
                  echo "Running VACUUM ANALYZE..."
                  psql "$DATABASE_URL" -c "VACUUM ANALYZE;"
                  echo "Complete"
              env:
                - name: DATABASE_URL
                  valueFrom:
                    secretKeyRef:
                      name: voiceassist-secrets
                      key: database-url
          restartPolicy: OnFailure
```

---

## Caching Strategy

### Three-Tier Cache Architecture

```
┌─────────────────────────────────────────────┐
│           Client Request                    │
└─────────────┬───────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────┐
│  L1 Cache: In-Memory (Python dict/LRU)     │
│  - Fastest (< 1ms)                          │
│  - Per-process                              │
│  - TTL: 60s                                 │
│  - Size: 1000 items                         │
└─────────────┬───────────────────────────────┘
              │ Cache Miss
              ▼
┌─────────────────────────────────────────────┐
│  L2 Cache: Redis                            │
│  - Fast (< 5ms)                             │
│  - Shared across pods                       │
│  - TTL: 300s (5 min)                        │
│  - Size: 10GB                               │
└─────────────┬───────────────────────────────┘
              │ Cache Miss
              ▼
┌─────────────────────────────────────────────┐
│  L3 Cache: RAG/Semantic Cache               │
│  - Moderate (< 50ms)                        │
│  - Similar query matching                   │
│  - TTL: 3600s (1 hour)                      │
│  - Vector similarity                        │
└─────────────┬───────────────────────────────┘
              │ Cache Miss
              ▼
┌─────────────────────────────────────────────┐
│          Database Query                     │
└─────────────────────────────────────────────┘
```

### L1 Cache: In-Memory

**Implementation**:

```python
# server/cache/l1_cache.py
from functools import lru_cache
from typing import Any, Optional
import time
import threading

class L1Cache:
    """In-memory cache with TTL support."""

    def __init__(self, max_size: int = 1000, ttl: int = 60):
        self.max_size = max_size
        self.ttl = ttl
        self._cache = {}
        self._timestamps = {}
        self._lock = threading.Lock()

    def get(self, key: str) -> Optional[Any]:
        """Get value from cache if not expired."""
        with self._lock:
            if key not in self._cache:
                return None

            # Check TTL
            if time.time() - self._timestamps[key] > self.ttl:
                del self._cache[key]
                del self._timestamps[key]
                return None

            return self._cache[key]

    def set(self, key: str, value: Any):
        """Set value in cache with TTL."""
        with self._lock:
            # Evict oldest if at capacity
            if len(self._cache) >= self.max_size:
                oldest_key = min(self._timestamps, key=self._timestamps.get)
                del self._cache[oldest_key]
                del self._timestamps[oldest_key]

            self._cache[key] = value
            self._timestamps[key] = time.time()

    def invalidate(self, key: str):
        """Remove key from cache."""
        with self._lock:
            self._cache.pop(key, None)
            self._timestamps.pop(key, None)

    def clear(self):
        """Clear all cached items."""
        with self._lock:
            self._cache.clear()
            self._timestamps.clear()

# Global instance
l1_cache = L1Cache(max_size=1000, ttl=60)
```

**Usage**:

```python
from server.cache.l1_cache import l1_cache

async def get_user(user_id: int, db: Session) -> User:
    """Get user with L1 caching."""
    cache_key = f"user:{user_id}"

    # Check L1 cache
    cached = l1_cache.get(cache_key)
    if cached:
        return cached

    # Query database
    user = db.query(User).filter(User.id == user_id).first()

    # Cache result
    if user:
        l1_cache.set(cache_key, user)

    return user
```

### L2 Cache: Redis

**Configuration**:

```python
# server/cache/redis_cache.py
import redis.asyncio as redis
from typing import Any, Optional
import json
import pickle

class RedisCache:
    """Redis cache with serialization support."""

    def __init__(self, url: str):
        self.redis = redis.from_url(url, decode_responses=False)

    async def get(self, key: str, deserialize: str = "json") -> Optional[Any]:
        """Get value from Redis."""
        value = await self.redis.get(key)
        if value is None:
            return None

        if deserialize == "json":
            return json.loads(value)
        elif deserialize == "pickle":
            return pickle.loads(value)
        else:
            return value

    async def set(
        self,
        key: str,
        value: Any,
        ttl: int = 300,
        serialize: str = "json"
    ):
        """Set value in Redis with TTL."""
        if serialize == "json":
            serialized = json.dumps(value)
        elif serialize == "pickle":
            serialized = pickle.dumps(value)
        else:
            serialized = value

        await self.redis.setex(key, ttl, serialized)

    async def delete(self, key: str):
        """Delete key from Redis."""
        await self.redis.delete(key)

    async def invalidate_pattern(self, pattern: str):
        """Invalidate all keys matching pattern."""
        keys = await self.redis.keys(pattern)
        if keys:
            await self.redis.delete(*keys)

# Global instance
redis_cache = RedisCache(REDIS_URL)
```

**Usage**:

```python
from server.cache.redis_cache import redis_cache

async def get_conversation(conversation_id: int, db: Session) -> Conversation:
    """Get conversation with Redis caching."""
    cache_key = f"conversation:{conversation_id}"

    # Check Redis
    cached = await redis_cache.get(cache_key)
    if cached:
        return Conversation(**cached)

    # Query database
    conversation = db.query(Conversation)\
        .filter(Conversation.id == conversation_id)\
        .first()

    # Cache result (5 minutes)
    if conversation:
        await redis_cache.set(cache_key, conversation.to_dict(), ttl=300)

    return conversation
```

### L3 Cache: RAG/Semantic

**Implementation**:

```python
# server/cache/semantic_cache.py
from typing import Optional, Tuple
import numpy as np
from sentence_transformers import SentenceTransformer

class SemanticCache:
    """Semantic cache using vector similarity."""

    def __init__(self, threshold: float = 0.85):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.cache = {}  # {embedding_hash: (query, response, embedding)}
        self.threshold = threshold

    def _encode(self, text: str) -> np.ndarray:
        """Encode text to vector."""
        return self.model.encode(text, convert_to_numpy=True)

    def _similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float:
        """Calculate cosine similarity."""
        return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

    async def get(self, query: str) -> Optional[str]:
        """Get similar cached response."""
        query_embedding = self._encode(query)

        best_similarity = 0
        best_response = None

        for cached_query, response, cached_embedding in self.cache.values():
            similarity = self._similarity(query_embedding, cached_embedding)

            if similarity > best_similarity and similarity >= self.threshold:
                best_similarity = similarity
                best_response = response

        return best_response

    async def set(self, query: str, response: str):
        """Cache query-response pair."""
        embedding = self._encode(query)
        cache_key = hash(query)

        self.cache[cache_key] = (query, response, embedding)

        # Limit cache size
        if len(self.cache) > 1000:
            # Remove oldest (simple LRU would be better)
            oldest_key = next(iter(self.cache))
            del self.cache[oldest_key]

# Global instance
semantic_cache = SemanticCache(threshold=0.85)
```

### Cache Invalidation Strategies

**Time-Based (TTL)**:

```python
# Set with TTL
await redis_cache.set(key, value, ttl=300)  # 5 minutes
```

**Event-Based**:

```python
# Invalidate on update
@router.put("/users/{user_id}")
async def update_user(user_id: int, data: UserUpdate, db: Session):
    # Update database
    user = db.query(User).filter(User.id == user_id).first()
    user.name = data.name
    db.commit()

    # Invalidate caches
    l1_cache.invalidate(f"user:{user_id}")
    await redis_cache.delete(f"user:{user_id}")
    await redis_cache.invalidate_pattern(f"user:{user_id}:*")

    return user
```

**Write-Through**:

```python
# Update cache on write
@router.post("/conversations")
async def create_conversation(data: ConversationCreate, db: Session):
    # Create in database
    conversation = Conversation(**data.dict())
    db.add(conversation)
    db.commit()
    db.refresh(conversation)

    # Update cache immediately
    cache_key = f"conversation:{conversation.id}"
    await redis_cache.set(cache_key, conversation.to_dict(), ttl=300)

    return conversation
```

### Cache Warming

**On Application Startup**:

```python
# server/cache/warming.py
async def warm_cache():
    """Warm cache with frequently accessed data."""
    db = SessionLocal()

    try:
        # Cache active users
        active_users = db.query(User)\
            .filter(User.is_active == True)\
            .limit(100)\
            .all()

        for user in active_users:
            cache_key = f"user:{user.id}"
            await redis_cache.set(cache_key, user.to_dict(), ttl=600)

        # Cache recent conversations
        recent_conversations = db.query(Conversation)\
            .filter(Conversation.created_at > datetime.now() - timedelta(days=7))\
            .limit(500)\
            .all()

        for conv in recent_conversations:
            cache_key = f"conversation:{conv.id}"
            await redis_cache.set(cache_key, conv.to_dict(), ttl=300)

        logger.info(f"Cache warmed: {len(active_users)} users, {len(recent_conversations)} conversations")

    finally:
        db.close()

# In server/main.py
@app.on_event("startup")
async def startup_event():
    await warm_cache()
```

---

## Kubernetes Resource Tuning

### Resource Requests and Limits

**Current Configuration**:

```yaml
# k8s/deployments/api-gateway.yaml
resources:
  requests:
    cpu: 1000m # 1 CPU core
    memory: 1Gi # 1 GB
  limits:
    cpu: 2000m # 2 CPU cores
    memory: 2Gi # 2 GB
```

**Tuning Process**:

1. **Monitor Actual Usage** (Use VPA or metrics):

   ```bash
   # Get VPA recommendations
   kubectl get vpa voiceassist-api -n voiceassist -o yaml

   # Monitor actual usage
   kubectl top pods -n voiceassist --containers
   ```

2. **Adjust Based on Observations**:

| Scenario              | CPU Request  | CPU Limit    | Memory Request  | Memory Limit    |
| --------------------- | ------------ | ------------ | --------------- | --------------- |
| **Under-provisioned** | Increase 50% | Increase 50% | Increase 50%    | Increase 50%    |
| **Over-provisioned**  | Decrease 25% | Decrease 25% | Decrease 25%    | Decrease 25%    |
| **CPU-bound**         | Increase CPU | Increase CPU | Keep same       | Keep same       |
| **Memory-bound**      | Keep same    | Keep same    | Increase Memory | Increase Memory |

3. **Quality of Service (QoS)**:

```yaml
# Guaranteed QoS (requests == limits)
resources:
  requests:
    cpu: 1000m
    memory: 1Gi
  limits:
    cpu: 1000m      # Same as request
    memory: 1Gi     # Same as request

# Burstable QoS (requests < limits)
resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 2000m      # Higher than request
    memory: 2Gi     # Higher than request

# BestEffort QoS (no requests/limits)
# Not recommended for production
```

**Recommended Settings** (Post-Optimization):

```yaml
# API Gateway
resources:
  requests:
    cpu: 1000m
    memory: 1Gi
  limits:
    cpu: 2000m
    memory: 2Gi

# Worker Service
resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 1500m
    memory: 1536Mi

# Background Jobs
resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 1Gi
```

---

## HPA Threshold Tuning

### Current HPA Configuration

```yaml
# k8s/performance/api-gateway-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: voiceassist-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: voiceassist-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 50
          periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
```

### Tuning Guidelines

#### CPU Threshold

| Load Pattern              | Target % | Reasoning                                  |
| ------------------------- | -------- | ------------------------------------------ |
| **Steady, predictable**   | 70%      | Default, balances utilization and headroom |
| **Bursty, unpredictable** | 60%      | More headroom for spikes                   |
| **Cost-sensitive**        | 80%      | Higher utilization, less headroom          |
| **Critical workload**     | 50%      | Maximum headroom for reliability           |

#### Memory Threshold

| Memory Characteristics        | Target % | Reasoning       |
| ----------------------------- | -------- | --------------- |
| **Stable usage**              | 80%      | Default         |
| **Growing over time (leak?)** | 70%      | Trigger earlier |
| **Highly variable**           | 75%      | Balance         |

#### Scale-Up Speed

```yaml
scaleUp:
  stabilizationWindowSeconds: 0 # No delay
  policies:
    - type: Percent
      value: 100 # Double pods
      periodSeconds: 15 # Every 15 seconds
```

**Use Cases**:

- **Aggressive** (above): Flash crowds, rapid traffic increase
- **Moderate** (default): Normal production usage
- **Conservative**: Development, cost-conscious

#### Scale-Down Speed

```yaml
scaleDown:
  stabilizationWindowSeconds: 600 # 10 minute delay
  policies:
    - type: Percent
      value: 5 # Remove 5% of pods
      periodSeconds: 120 # Every 2 minutes
```

**Use Cases**:

- **Slow** (above): Avoid flapping, warm pods are valuable
- **Moderate** (default): Balance responsiveness and stability
- **Fast**: Development environments, cost optimization

### Custom Metrics

**Add Request Rate Metric**:

```yaml
# k8s/performance/api-gateway-hpa.yaml
spec:
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "50" # 50 req/s per pod
```

**Prometheus Adapter Configuration**:

```yaml
# k8s/performance/prometheus-adapter-config.yaml
rules:
  - seriesQuery: 'http_requests_total{namespace="voiceassist"}'
    resources:
      overrides:
        namespace: { resource: "namespace" }
        pod: { resource: "pod" }
    name:
      matches: "^http_requests_total"
      as: "http_requests_per_second"
    metricsQuery: "rate(http_requests_total{<<.LabelMatchers>>}[1m])"
```

---

## Application-Level Optimizations

### Async/Await Patterns

**Before** (Synchronous):

```python
@router.get("/dashboard")
def get_dashboard(user_id: int, db: Session):
    # Sequential, blocking
    user = get_user(user_id, db)                     # 50ms
    conversations = get_conversations(user_id, db)   # 100ms
    documents = get_documents(user_id, db)           # 80ms
    stats = calculate_stats(user_id, db)             # 120ms

    # Total: 350ms
    return DashboardResponse(user, conversations, documents, stats)
```

**After** (Asynchronous):

```python
@router.get("/dashboard")
async def get_dashboard(user_id: int, db: Session):
    # Parallel, non-blocking
    user_task = get_user_async(user_id, db)
    conversations_task = get_conversations_async(user_id, db)
    documents_task = get_documents_async(user_id, db)
    stats_task = calculate_stats_async(user_id, db)

    # Wait for all concurrently
    user, conversations, documents, stats = await asyncio.gather(
        user_task,
        conversations_task,
        documents_task,
        stats_task
    )

    # Total: ~120ms (longest operation)
    return DashboardResponse(user, conversations, documents, stats)
```

### Response Compression

```python
# server/middleware/compression.py
from fastapi.middleware.gzip import GZIPMiddleware

app.add_middleware(GZIPMiddleware, minimum_size=1000)
```

**Benchmark**:

- Uncompressed: 150KB response, 50ms transfer
- Compressed: 15KB response, 10ms transfer
- **Savings**: 90% size, 80% transfer time

### Connection Pooling

```python
# server/http_client.py
import httpx

# Reuse HTTP client with connection pooling
http_client = httpx.AsyncClient(
    timeout=30.0,
    limits=httpx.Limits(
        max_keepalive_connections=20,
        max_connections=100,
        keepalive_expiry=300
    )
)

# Use in requests
async def call_external_api(url: str):
    response = await http_client.get(url)
    return response.json()
```

### Batch Processing

**Before** (Individual operations):

```python
# Process documents one by one
for document_id in document_ids:
    process_document(document_id)  # 100ms each
# Total: 100ms × 50 = 5000ms (5 seconds)
```

**After** (Batch):

```python
# Process documents in batch
process_documents_batch(document_ids)  # 800ms total
# Total: 800ms for all 50
# **Speedup**: 6.25x faster
```

---

## Monitoring and Alerting

### Key Metrics to Monitor

```yaml
# k8s/monitoring/prometheus-rules.yaml
groups:
  - name: performance
    interval: 30s
    rules:
      # Response time alerts
      - alert: HighResponseTime
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High P95 response time"
          description: "P95 response time is {{ $value }}s"

      # Error rate alerts
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"

      # Cache performance
      - alert: LowCacheHitRate
        expr: rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m])) < 0.7
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Low cache hit rate"
```

---

## Common Bottlenecks

### 1. Database Connection Pool Exhaustion

**Symptoms**:

- Timeouts waiting for connections
- "Connection pool is full" errors
- Increasing response times

**Solutions**:

- Increase pool size
- Add read replicas
- Implement connection pooling best practices
- Review long-running queries

### 2. Memory Leaks

**Symptoms**:

- Memory usage growing over time
- Pods being OOMKilled
- Performance degrading over time

**Solutions**:

- Profile memory usage
- Fix unclosed connections/files
- Implement proper cleanup in finally blocks
- Use context managers

### 3. Slow Queries

**Symptoms**:

- High P95/P99 response times
- Database CPU high
- Slow query logs filling up

**Solutions**:

- Add missing indexes
- Optimize query structure
- Implement query result caching
- Use EXPLAIN ANALYZE

### 4. Cache Misses

**Symptoms**:

- Low cache hit rates
- High database load
- Inconsistent response times

**Solutions**:

- Warm cache on startup
- Adjust TTL values
- Implement cache hierarchies
- Review invalidation strategy

---

## Conclusion

Performance tuning is an ongoing process. Follow these principles:

1. **Measure before optimizing**
2. **Focus on bottlenecks**
3. **Make incremental changes**
4. **Validate improvements**
5. **Monitor continuously**

For ongoing support:

- Review dashboards daily
- Run load tests weekly
- Conduct performance reviews monthly
- Update this guide as you learn

**Related Documentation**:

- Performance Benchmarks: `/docs/PERFORMANCE_BENCHMARKS.md`
- Load Testing Guide: `/docs/LOAD_TESTING_GUIDE.md`
- Dashboards: `/dashboards/`
6:["slug","PERFORMANCE_TUNING_GUIDE","c"]
0:["X7oMT3VrOffzp0qvbeOas",[[["",{"children":["docs",{"children":[["slug","PERFORMANCE_TUNING_GUIDE","c"],{"children":["__PAGE__?{\"slug\":[\"PERFORMANCE_TUNING_GUIDE\"]}",{}]}]}]},"$undefined","$undefined",true],["",{"children":["docs",{"children":[["slug","PERFORMANCE_TUNING_GUIDE","c"],{"children":["__PAGE__",{},[["$L1",["$","div",null,{"children":[["$","div",null,{"className":"mb-6 flex items-center justify-between gap-4","children":[["$","div",null,{"children":[["$","p",null,{"className":"text-sm text-gray-500 dark:text-gray-400","children":"Docs / Raw"}],["$","h1",null,{"className":"text-3xl font-bold text-gray-900 dark:text-white","children":"Performance Tuning Guide"}],["$","p",null,{"className":"text-sm text-gray-600 dark:text-gray-400","children":["Sourced from"," ",["$","code",null,{"className":"font-mono text-xs","children":["docs/","PERFORMANCE_TUNING_GUIDE.md"]}]]}]]}],["$","a",null,{"href":"https://github.com/mohammednazmy/VoiceAssist/edit/main/docs/PERFORMANCE_TUNING_GUIDE.md","target":"_blank","rel":"noreferrer","className":"inline-flex items-center gap-2 rounded-md border border-gray-200 dark:border-gray-700 px-3 py-1.5 text-sm text-gray-700 dark:text-gray-200 hover:border-primary-500 dark:hover:border-primary-400 hover:text-primary-700 dark:hover:text-primary-300","children":"Edit on GitHub"}]]}],["$","div",null,{"className":"rounded-lg border border-gray-200 dark:border-gray-800 bg-white dark:bg-gray-900 p-6","children":["$","$L2",null,{"content":"$3"}]}],["$","div",null,{"className":"mt-6 flex flex-wrap gap-2 text-sm","children":[["$","$L4",null,{"href":"/reference/all-docs","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"← All documentation"}],["$","$L4",null,{"href":"/","className":"inline-flex items-center gap-1 rounded-md bg-gray-100 px-3 py-1 text-gray-700 hover:bg-gray-200 dark:bg-gray-800 dark:text-gray-200 dark:hover:bg-gray-700","children":"Home"}]]}]]}],null],null],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children","$6","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[null,["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children","docs","children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":"$undefined","notFoundStyles":"$undefined"}]],null]},[[[["$","link","0",{"rel":"stylesheet","href":"/_next/static/css/7f586cdbbaa33ff7.css","precedence":"next","crossOrigin":"$undefined"}]],["$","html",null,{"lang":"en","className":"h-full","children":["$","body",null,{"className":"__className_f367f3 h-full bg-white dark:bg-gray-900","children":[["$","a",null,{"href":"#main-content","className":"skip-to-content","children":"Skip to main content"}],["$","$L8",null,{"children":[["$","$L9",null,{}],["$","$La",null,{}],["$","main",null,{"id":"main-content","className":"lg:pl-64","role":"main","aria-label":"Documentation content","children":["$","$Lb",null,{"children":["$","$L5",null,{"parallelRouterKey":"children","segmentPath":["children"],"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L7",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[]}]}]}]]}]]}]}]],null],null],["$Lc",null]]]]
c:[["$","meta","0",{"name":"viewport","content":"width=device-width, initial-scale=1"}],["$","meta","1",{"charSet":"utf-8"}],["$","title","2",{"children":"Performance Tuning Guide | Docs | VoiceAssist Docs"}],["$","meta","3",{"name":"description","content":"This comprehensive guide provides strategies, techniques, and best practices for optimizing VoiceAssist performance. Use this guide to identify bottle..."}],["$","meta","4",{"name":"keywords","content":"VoiceAssist,documentation,medical AI,voice assistant,healthcare,HIPAA,API"}],["$","meta","5",{"name":"robots","content":"index, follow"}],["$","meta","6",{"name":"googlebot","content":"index, follow"}],["$","link","7",{"rel":"canonical","href":"https://assistdocs.asimo.io"}],["$","meta","8",{"property":"og:title","content":"VoiceAssist Documentation"}],["$","meta","9",{"property":"og:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","10",{"property":"og:url","content":"https://assistdocs.asimo.io"}],["$","meta","11",{"property":"og:site_name","content":"VoiceAssist Docs"}],["$","meta","12",{"property":"og:type","content":"website"}],["$","meta","13",{"name":"twitter:card","content":"summary"}],["$","meta","14",{"name":"twitter:title","content":"VoiceAssist Documentation"}],["$","meta","15",{"name":"twitter:description","content":"Comprehensive documentation for VoiceAssist - Enterprise Medical AI Assistant"}],["$","meta","16",{"name":"next-size-adjust"}]]
1:null