ML-Based Anomaly Detection

Acceleration

Phase: monitor

MTTR

CFR

Quick Reference

Phase

monitor

Epic

SLO-Driven Observability & Error Budgets

Milestone

Acceleration

Target

>= 60% critical metrics have anomaly detection

Implementation Time

Part of SLO-Driven Observability & Error Budgets epic: 4 weeks (34 hours per capability avg)

What & Why

Definition

>= 60% of critical metrics use ML anomaly detection (DeepAR, ARIMA) for dynamic thresholds instead of static alerts.

Business Value

Provides objective Go/No-Go deployment decisions and prevents 80% of error budget violations through SLO-driven alerting and error budget policies Achieving >= 60% critical metrics have anomaly detection is a key milestone toward this goal.

Context

This capability is part of the Acceleration milestone's focus on scale automation, embed compliance, improve speed & reliability. Essential for teams targeting MTTR, CFR improvements.

Success Criteria

Target

>= 60% critical metrics have anomaly detection

Measurement

Anomaly detection coverage across metrics

Evidence

Anomaly detection model configs
Anomaly alert examples
False positive rate < 10%

In Practice

Real-World Implementation

Teams train ML models on metric history (7-30 days), detect anomalies based on seasonal patterns, normal variance. Alert on deviations > 3 sigma.

Concrete Example

Request rate normal: 1000-2000 req/min (weekday 9am-5pm). Model detects anomaly: 500 req/min Wed 2pm (3.5 sigma below expected). Alert fires.

Implementation Guide

Prerequisites

Application Metrics

>= 80% services expose RED metrics

Implementation Steps

Follow the measurement approach: Anomaly detection coverage across metrics

For detailed step-by-step guidance, refer to the SLO-Driven Observability & Error Budgets Implementation Kit.

Resources

Related Capabilities

Prerequisites

Implement these first

Application Metrics

Enables

What this unlocks

Predictive Rollback Detection

Predictive Incident Detection

Adaptive Monitoring Thresholds

Complementary

Often adopted together, from the SLO-Driven Observability & Error Budgets epic

Distributed Tracing

SLO Error Budget Management

Business KPI Monitoring

Advanced Log Analysis

Troubleshooting & FAQs

Common Issues

Issue: Target metric not improving

Solution: Verify measurement is accurate, check if prerequisites are fully implemented, review evidence artifacts for completeness

Issue: Team resistance to adoption

Solution: Start with pilot team, demonstrate value with metrics, provide training and support during transition

Issue: Inconsistent implementation across teams

Solution: Create shared templates and guidelines, establish regular sync meetings, use automation to enforce standards

Frequently Asked Questions

Can we implement this before completing prerequisites?

While possible, it's not recommended. Prerequisites ensure foundational practices are in place, making this capability more effective and easier to adopt.

How long does implementation typically take?

Most capabilities can be implemented within 90 days when tackled as part of the Acceleration milestone. Individual timelines vary based on team size and existing practices.

What & Why

Definition

>= 60% of critical metrics use ML anomaly detection (DeepAR, ARIMA) for dynamic thresholds instead of static alerts.

Business Value

Context

This capability is part of the Acceleration milestone's focus on scale automation, embed compliance, improve speed & reliability. Essential for teams targeting MTTR, CFR improvements.

In Practice

Real-World Implementation

Teams train ML models on metric history (7-30 days), detect anomalies based on seasonal patterns, normal variance. Alert on deviations > 3 sigma.

Concrete Example

Request rate normal: 1000-2000 req/min (weekday 9am-5pm). Model detects anomaly: 500 req/min Wed 2pm (3.5 sigma below expected). Alert fires.

Troubleshooting & FAQs

Common Issues

Issue: Target metric not improving

Solution: Verify measurement is accurate, check if prerequisites are fully implemented, review evidence artifacts for completeness

Issue: Team resistance to adoption

Solution: Start with pilot team, demonstrate value with metrics, provide training and support during transition

Issue: Inconsistent implementation across teams

Solution: Create shared templates and guidelines, establish regular sync meetings, use automation to enforce standards

Frequently Asked Questions

Can we implement this before completing prerequisites?

While possible, it's not recommended. Prerequisites ensure foundational practices are in place, making this capability more effective and easier to adopt.

How long does implementation typically take?

Most capabilities can be implemented within 90 days when tackled as part of the Acceleration milestone. Individual timelines vary based on team size and existing practices.

ML-Based Anomaly Detection

Quick Reference

What & Why

Definition

Business Value

Context

Success Criteria

Measurement

Evidence

In Practice

Real-World Implementation

Concrete Example

Implementation Guide

Prerequisites

Implementation Steps

Resources

Implementation Kit

Templates

Related Resources

Related Capabilities

Prerequisites

Enables

Complementary

Troubleshooting & FAQs

Common Issues

Issue: Target metric not improving

Issue: Team resistance to adoption

Issue: Inconsistent implementation across teams

Frequently Asked Questions

Can we implement this before completing prerequisites?

How long does implementation typically take?

ML-Based Anomaly Detection

Quick Reference

What & Why

Definition

Business Value

Context

Success Criteria

Measurement

Evidence

In Practice

Real-World Implementation

Concrete Example

Implementation Guide

Prerequisites

Implementation Steps

Resources

Implementation Kit

Templates

Related Resources

Related Capabilities

Prerequisites

Enables

Complementary

Troubleshooting & FAQs

Common Issues

Issue: Target metric not improving

Issue: Team resistance to adoption

Issue: Inconsistent implementation across teams

Frequently Asked Questions

Can we implement this before completing prerequisites?

How long does implementation typically take?