- Home
- Roadmap
- Acceleration
- SLO Observability
SLO-Driven Observability & Error Budgets
Production SLOs with error budgets, advanced monitoring, distributed tracing, and proactive alerting with noise reduction.
Business Value
Provides objective Go/No-Go deployment decisions and prevents 80% of error budget violations through SLO-driven alerting and error budget policies
DORA Impact
- Mean Time to Recover
- Change Failure Rate
Key Features
- Distributed Tracing
- SLO Error Budget Management
- ML-Based Anomaly Detection
- Business KPI Monitoring
- Advanced Log Analysis
Who
When
Acceleration (90-180 days)
Capabilities in This Epic
Distributed Tracing
>= 85% of services instrumented for distributed tracing (Jaeger, Tempo) with trace sampling >= 10% of requests.
SLO Error Budget Management
>= 80% of services track error budgets monthly with alerts when 50% budget consumed and deployment freezes at 90%.
ML-Based Anomaly Detection
>= 60% of critical metrics use ML anomaly detection (DeepAR, ARIMA) for dynamic thresholds instead of static alerts.
Business KPI Monitoring
>= 70% of services expose business KPIs (orders/min, revenue, conversions) in observability platform alongside technical metrics.
Advanced Log Analysis
>= 80% of log queries use structured log fields with indexed tags for <3 second query response on 30-day data.
Implementation Journey
Prerequisites
Complete these before starting:
- Observability monitoring epic complete (basic monitoring)
- Key user journeys and critical services identified
- SLI/SLO framework selected (Prometheus, Datadog, etc.)
Typical Timeline
4 weeks
Effort Estimate
Breakdown by role:
Team Composition
Cross-functional team including: sre, platform, product
Applicable Environments
Success Metrics
Entry Criteria
Prerequisites to start implementing this epic:
Exit Criteria
Criteria defined at the Acceleration milestone level:
DORA Metrics Impact
Resources
Implementation Kit
Step-by-step guide, templates, and tools for this epic
View SLO-Driven Observability & Error Budgets Implementation KitCommon Pitfalls
Mitigation: Base SLOs on current performance + margin. Start conservative (99% vs 99.99%). Review quarterly and adjust.
Mitigation: Link deployments to error budget. Require approval when budget low. Report budget status in standups.
Mitigation: Standardize SLI definitions (request-based). Use common instrumentation library. Validate SLI queries in CI.
Next Steps
After Completing This Epic
Once you've met all exit criteria, consider these next steps:
- Review metrics to validate DORA improvements
- Document lessons learned and update team playbooks
- Share success stories with other teams
Alternative Paths
Other epics that can be tackled in parallel: