Skip to main content
    DevOps
    Way of Working
    1. Home
    2. Capabilities
    3. Monitor Anomaly Detection

    ML-Based Anomaly Detection

    Acceleration
    Phase: monitor
    MTTR
    CFR

    Quick Reference

    Phase
    monitor
    Epic
    SLO-Driven Observability & Error Budgets
    Milestone
    Acceleration
    Target
    >= 60% critical metrics have anomaly detection
    Implementation Time
    Part of SLO-Driven Observability & Error Budgets epic: 4 weeks (34 hours per capability avg)

    What & Why

    Definition

    >= 60% of critical metrics use ML anomaly detection (DeepAR, ARIMA) for dynamic thresholds instead of static alerts.

    Business Value

    Provides objective Go/No-Go deployment decisions and prevents 80% of error budget violations through SLO-driven alerting and error budget policies Achieving >= 60% critical metrics have anomaly detection is a key milestone toward this goal.

    Context

    This capability is part of the Acceleration milestone's focus on scale automation, embed compliance, improve speed & reliability. Essential for teams targeting MTTR, CFR improvements.

    Success Criteria

    Target

    >= 60% critical metrics have anomaly detection

    Measurement

    Anomaly detection coverage across metrics

    Evidence

    • Anomaly detection model configs
    • Anomaly alert examples
    • False positive rate < 10%

    In Practice

    Real-World Implementation

    Teams train ML models on metric history (7-30 days), detect anomalies based on seasonal patterns, normal variance. Alert on deviations > 3 sigma.

    Concrete Example

    Request rate normal: 1000-2000 req/min (weekday 9am-5pm). Model detects anomaly: 500 req/min Wed 2pm (3.5 sigma below expected). Alert fires.

    Implementation Guide

    Prerequisites

    Application Metrics
    >= 80% services expose RED metrics

    Implementation Steps

    Follow the measurement approach: Anomaly detection coverage across metrics

    For detailed step-by-step guidance, refer to the SLO-Driven Observability & Error Budgets Implementation Kit.

    Resources

    Implementation Kit

    SLO-Driven Observability & Error Budgets Kit

    Templates

    Browse all templates

    Related Resources

    View learning paths

    Related Capabilities

    Prerequisites

    Implement these first

    Application Metrics

    Enables

    What this unlocks

    Predictive Rollback Detection
    Predictive Incident Detection
    Adaptive Monitoring Thresholds

    Complementary

    Often adopted together, from the SLO-Driven Observability & Error Budgets epic

    Distributed Tracing
    SLO Error Budget Management
    Business KPI Monitoring
    Advanced Log Analysis

    Troubleshooting & FAQs

    Common Issues

    Issue: Target metric not improving

    Solution: Verify measurement is accurate, check if prerequisites are fully implemented, review evidence artifacts for completeness

    Issue: Team resistance to adoption

    Solution: Start with pilot team, demonstrate value with metrics, provide training and support during transition

    Issue: Inconsistent implementation across teams

    Solution: Create shared templates and guidelines, establish regular sync meetings, use automation to enforce standards

    Frequently Asked Questions

    Can we implement this before completing prerequisites?

    While possible, it's not recommended. Prerequisites ensure foundational practices are in place, making this capability more effective and easier to adopt.

    How long does implementation typically take?

    Most capabilities can be implemented within 90 days when tackled as part of the Acceleration milestone. Individual timelines vary based on team size and existing practices.

    DevOps
    Way of Working

    DevOps practices for the entire delivery lifecycle

    © 2019-2026 devopswow.com. Created by Burhan Öcüt

    PartnersAboutPrivacyTermsCookies