Docs / Raw

Service Level Objectives (SLOs)

Sourced from docs/operations/SLO_DEFINITIONS.md

Edit on GitHub

Service Level Objectives (SLOs) - VoiceAssist V2

Version: 1.0 Last Updated: 2025-11-27 Owner: Platform Engineering Team

Overview

This document defines the Service Level Objectives (SLOs) for VoiceAssist V2. SLOs are reliability targets that balance user expectations with engineering effort.

SLO Framework

  • SLI (Service Level Indicator): Quantitative measure of service behavior (e.g., request latency, error rate)
  • SLO (Service Level Objective): Target value or range for an SLI (e.g., 99.9% availability)
  • SLA (Service Level Agreement): Customer-facing commitment with consequences (not defined yet for internal MVP)

Error Budget

An error budget is the maximum allowed unreliability before violating an SLO. For a 99.9% availability target over 30 days:

  • Allowed downtime: 43.2 minutes/month
  • Allowed errors: 0.1% of requests

Core SLOs

1. API Availability SLO

Objective: API endpoints should be available and responsive

MetricTargetMeasurement WindowError Budget
Availability99.9%30 days43.2 min/month
Success Rate99.5%30 days0.5% errors

SLI Definition:

# Availability: Percentage of requests returning 2xx/3xx status sum(rate(http_requests_total{status_code=~"2..|3.."}[5m])) / sum(rate(http_requests_total[5m])) # Success Rate: Percentage of requests not returning 5xx errors 1 - ( sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) )

Rationale:

  • 99.9% availability is industry standard for non-critical services
  • Allows for planned maintenance and incident recovery
  • Balances reliability with development velocity

Exclusions:

  • Planned maintenance windows (announced 48h in advance)
  • User errors (4xx responses except 429 rate limiting)
  • External service failures (OpenAI, Nextcloud) beyond our control

2. API Latency SLO

Objective: API requests should complete quickly

PercentileTargetMeasurement Window
P50 (median)< 200ms5 minutes
P95< 500ms5 minutes
P99< 1000ms5 minutes

SLI Definition:

# P95 latency histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint) ) # P99 latency histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint) )

Rationale:

  • P50 target ensures fast response for majority of requests
  • P95/P99 targets catch tail latency issues
  • Targets aligned with user patience thresholds (< 1s for interactivity)

Critical Endpoints:

  • /api/auth/login: P95 < 300ms (authentication is time-sensitive)
  • /api/realtime/query: P95 < 2000ms (RAG queries are more complex)
  • /health: P95 < 100ms (health checks must be fast)

3. RAG Query Quality SLO

Objective: RAG queries should return relevant, accurate results

MetricTargetMeasurement Window
Query Success Rate99%30 days
Cache Hit Rate> 30%24 hours
Average Search Results> 2 results24 hours

SLI Definition:

# Query Success Rate sum(rate(rag_query_duration_seconds_count{stage="total"}[5m])) / sum(rate(rag_query_attempts_total[5m])) # Cache Hit Rate sum(rate(cache_hits_total{cache_key_prefix="search_results"}[1h])) / sum(rate(cache_hits_total{cache_key_prefix="search_results"}[1h]) + rate(cache_misses_total{cache_key_prefix="search_results"}[1h])) # Average Search Results avg(rag_search_results_total)

Rationale:

  • 99% success rate allows for edge cases and system issues
  • 30% cache hit rate indicates effective caching strategy
  • 2+ results ensure users get actionable information

4. Database Performance SLO

Objective: Database operations should be fast and reliable

MetricTargetMeasurement Window
Query P95 Latency< 100ms5 minutes
Connection Success Rate99.9%30 days
Connection Pool Utilization< 80%5 minutes

SLI Definition:

# Query Latency P95 histogram_quantile(0.95, sum(rate(db_query_duration_seconds_bucket[5m])) by (le) ) # Connection Success Rate 1 - ( sum(rate(db_connection_errors_total[5m])) / sum(rate(db_query_duration_seconds_count[5m])) ) # Pool Utilization sum(db_connections_total{state="in_use"}) / (sum(db_connections_total{state="in_use"}) + sum(db_connections_total{state="idle"}))

Rationale:

  • 100ms P95 ensures responsive API layer
  • High connection success rate prevents cascading failures
  • 80% pool utilization threshold leaves headroom for spikes

5. Cache Performance SLO

Objective: Cache should provide performance improvements

MetricTargetMeasurement Window
Overall Hit Rate> 40%24 hours
L1 Cache Hit Rate> 60% (of hits)24 hours
Cache Operation LatencyP95 < 10ms5 minutes

SLI Definition:

# Overall Cache Hit Rate sum(rate(cache_hits_total[1h])) / (sum(rate(cache_hits_total[1h])) + sum(rate(cache_misses_total[1h]))) # L1 Hit Rate (of all cache hits) sum(rate(cache_hits_total{cache_layer="l1"}[1h])) / sum(rate(cache_hits_total[1h])) # Cache Operation Latency P95 histogram_quantile(0.95, sum(rate(cache_latency_seconds_bucket[5m])) by (le, cache_layer) )

Rationale:

  • 40% overall hit rate demonstrates effective caching
  • 60% L1 hit rate shows hot data staying in fast cache
  • Sub-10ms latency ensures cache doesn't become bottleneck

6. Document Processing SLO

Objective: Document uploads should complete reliably

MetricTargetMeasurement Window
Job Success Rate95%7 days
Processing Time P95< 2 minutes7 days
Queue Depth< 100 jobs5 minutes

SLI Definition:

# Job Success Rate sum(rate(document_processing_jobs_total{status="completed"}[1h])) / sum(rate(document_processing_jobs_total[1h])) # Processing Time P95 histogram_quantile(0.95, sum(rate(document_processing_duration_seconds_bucket[1h])) by (le) ) # Queue Depth sum(arq_queue_depth)

Rationale:

  • 95% success rate accounts for malformed documents and external API failures
  • 2-minute P95 ensures users don't wait excessively
  • Queue depth threshold prevents backlog accumulation

SLO Monitoring Strategy

Recording Rules

Prometheus recording rules pre-compute complex SLI queries:

# /infrastructure/observability/prometheus/rules/slo_recording_rules.yml groups: - name: slo_recording_rules interval: 30s rules: # API Availability - record: slo:api_availability:ratio_rate5m expr: | sum(rate(http_requests_total{status_code=~"2..|3.."}[5m])) / sum(rate(http_requests_total[5m])) # API Latency P95 - record: slo:api_latency_p95:seconds expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint) )

Alerting Rules

Alerts fire when SLOs are at risk or violated:

# /infrastructure/observability/prometheus/rules/slo_alerts.yml groups: - name: slo_alerts rules: # Critical: SLO violated - alert: APIAvailabilitySLOViolation expr: slo:api_availability:ratio_rate5m < 0.999 for: 5m labels: severity: critical slo: availability annotations: summary: "API availability below SLO (99.9%)" description: "Current availability: {{ $value | humanizePercentage }}" # Warning: Error budget at risk (50% consumed) - alert: ErrorBudgetAtRisk expr: | (1 - slo:api_availability:ratio_rate30d) > 0.0005 for: 15m labels: severity: warning slo: availability annotations: summary: "Error budget consumption > 50%"

Dashboard Panels

Grafana dashboard panels for SLO tracking:

  1. Availability Overview: Current vs target availability
  2. Error Budget: Remaining budget and burn rate
  3. Latency Distribution: P50/P95/P99 over time
  4. SLO Compliance: Per-service SLO status (green/yellow/red)
  5. Error Budget Timeline: 30-day error budget consumption

SLO Review Process

Weekly Review

  • Owner: On-call engineer
  • Review: Current SLO status, recent violations
  • Action: Update runbooks if patterns emerge

Monthly Review

  • Owner: Engineering Manager
  • Review:
    • SLO compliance trends
    • Error budget consumption patterns
    • SLO target appropriateness
  • Action: Adjust targets if consistently over/under performing

Quarterly Review

  • Owner: Platform Team
  • Review:
    • SLO framework effectiveness
    • New SLOs needed
    • Retired SLOs
    • Target adjustments based on user feedback
  • Action: Update SLO document and targets

Error Budget Policy

When Error Budget is Healthy (> 50% remaining)

  • ✅ Deploy new features freely
  • ✅ Experiment with new technologies
  • ✅ Planned maintenance allowed
  • ✅ Refactoring and tech debt work

When Error Budget is At Risk (25-50% remaining)

  • ⚠️ Increase review rigor for deployments
  • ⚠️ Prioritize reliability improvements
  • ⚠️ Defer non-critical features
  • ⚠️ Increase monitoring and alerting

When Error Budget is Depleted (< 25% remaining)

  • 🛑 Feature freeze - only reliability improvements
  • 🛑 Increase on-call staffing
  • 🛑 Daily SLO status reviews
  • 🛑 Defer all non-critical work until budget recovers

SLO Exceptions

Planned Maintenance

  • Must be announced 48 hours in advance
  • Limited to 4 hours/month
  • Excluded from availability SLO
  • User-facing status page updated

External Dependencies

Failures of external services are tracked separately:

  • OpenAI API failures: Tracked but excluded from API availability SLO
  • Nextcloud unavailability: Tracked separately
  • DNS/network issues: Excluded if beyond our control

Known Limitations

  • Cold starts: First request after deployment may be slow
  • Cache warming: Cache hit rate temporarily lower after cache clear
  • Large documents: Processing time varies with document size

Appendix: Prometheus Queries

Quick SLO Status Check

# All SLOs at a glance {__name__=~"slo:.*"}

Error Budget Remaining

# 30-day error budget remaining (as percentage) 1 - ( (1 - slo:api_availability:ratio_rate30d) / 0.001 )

SLO Burn Rate

# How fast are we consuming error budget? # > 1.0 means consuming faster than sustainable (1 - slo:api_availability:ratio_rate1h) / (1 - slo:api_availability:ratio_rate30d)

References

Contact

For SLO-related questions:

  • Slack: #platform-sre
  • On-call: PagerDuty escalation
  • Documentation: This file and /docs/operations/RUNBOOKS.md
Beginning of guide
End of guide