Observability & Monitoring Foundations

Foundation Milestone

Phase: monitor

MTTR

CFR

Logs, metrics, traces instrumentation. Golden signals dashboards, healthchecks, SLO drafts, and incident response runbooks.

Business Value

Reduces mean time to detection (MTTD) from 4 hours to 15 minutes and enables proactive issue resolution for 60% of incidents before user impact

DORA Impact

Mean Time to Recover
Change Failure Rate

Key Features

Centralized Logging
Application Metrics
Health Check Endpoints
Alerting Rules
Service Level Objectives
Observability Dashboards

Who

sre

platform

engineer

When

Foundation (0-90 days)

Capabilities in This Epic

Centralized Logging

>= 90% of services send structured logs to centralized platform (ELK, Loki, CloudWatch) with retention >= 30 days.

Target: >= 90% services send structured logs

Application Metrics

>= 80% of services expose RED metrics (Rate, Errors, Duration) in Prometheus/StatsD format.

Target: >= 80% services expose RED metrics

Health Check Endpoints

100% of services expose /health and /ready endpoints for liveness and readiness probes.

Target: 100% services have health endpoints

Alerting Rules

>= 80% of services have alerting for high error rate (>= 5% 5xx), high latency (p95 >= 1s), and down status.

Target: >= 80% services have critical alerts

Service Level Objectives

>= 60% of user-facing services have defined SLOs with >= 99% availability target and <= 500ms latency target.

Target: >= 60% services have documented SLOs

Observability Dashboards

>= 80% of services have Grafana/Datadog dashboards showing RED metrics, resource usage, and business KPIs.

Target: >= 80% services have dashboards

Implementation Journey

Prerequisites

Complete these before starting:

Services deployed to at least one environment
Log aggregation needs identified
Metrics collection requirements defined

Typical Timeline

3.5 weeks

Effort Estimate

140 hours

≈ 18 days

Breakdown by role:

SRE:80 hours

Platform:40 hours

Engineering:20 hours

Team Composition

Cross-functional team including: sre, platform, engineer

Applicable Environments

regulated

non-regulated

Success Metrics

Entry Criteria

Prerequisites to start implementing this epic:

Services deployed to at least one environment

Log aggregation needs identified

Metrics collection requirements defined

Exit Criteria

Criteria defined at the Foundation milestone level:

deployment Frequency: >= weekly (staging)

lead Time: <= 7 days (commit to staging)

change Failure Rate: <= 20%

mttr: <= 4h (staging)

observability Coverage: >= 80% services instrumented

ci Success: >= 90%

flaky Tests: < 5%

sbom Coverage: >= 90% services

secrets Policy: Approved secrets manager only

pr Cycle Time: p50 <= 24h

build Success: main >= 95%, PR >= 90%

ownership Coverage: >= 90% services

DORA Metrics Impact

MTTR

24 hours to 4 hours

83%

CFR

30% to 20%

33%

Resources

Implementation Kit

Step-by-step guide, templates, and tools for this epic

View Observability & Monitoring Foundations Implementation Kit

Templates

Ready-to-use templates for implementing capabilities

Browse All Templates

Learn More

Tutorials & Learning Paths Case Studies & Examples

Common Pitfalls

Too many alerts causing alert fatigue, real issues missed
Mitigation: Consolidate related alerts. Set meaningful thresholds (not default). Route to appropriate teams. Review alert noise weekly.

Logs not structured, difficult to query
Mitigation: Use JSON logging format. Include correlation IDs. Define standard fields (timestamp, level, service). Validate format in tests.

Metrics collected but never reviewed or acted upon
Mitigation: Create dashboards for key metrics. Set up regular review meetings. Link metrics to SLOs. Alert on SLO violations.

Next Steps

After Completing This Epic

Once you've met all exit criteria, consider these next steps:

Review metrics to validate DORA improvements
Document lessons learned and update team playbooks
Share success stories with other teams

Continue To

The natural next epic in the roadmap sequence:

Continuous Planning & Compliance Integration

Alternative Paths

Other epics that can be tackled in parallel:

Backlog Quality & Planning Enablement Code Quality & Review Standards CI/CD & Build Automation Testing Strategy & Quality Gates

Observability & Monitoring Foundations

Foundation Milestone

Phase: monitor

MTTR

CFR

Logs, metrics, traces instrumentation. Golden signals dashboards, healthchecks, SLO drafts, and incident response runbooks.

Business Value

Reduces mean time to detection (MTTD) from 4 hours to 15 minutes and enables proactive issue resolution for 60% of incidents before user impact

DORA Impact

Mean Time to Recover
Change Failure Rate

Key Features

Centralized Logging
Application Metrics
Health Check Endpoints
Alerting Rules
Service Level Objectives
Observability Dashboards

Who

sre

platform

engineer

When

Foundation (0-90 days)

Capabilities in This Epic

Centralized Logging

>= 90% of services send structured logs to centralized platform (ELK, Loki, CloudWatch) with retention >= 30 days.

Target: >= 90% services send structured logs

Application Metrics

>= 80% of services expose RED metrics (Rate, Errors, Duration) in Prometheus/StatsD format.

Target: >= 80% services expose RED metrics

Health Check Endpoints

100% of services expose /health and /ready endpoints for liveness and readiness probes.

Target: 100% services have health endpoints

Alerting Rules

>= 80% of services have alerting for high error rate (>= 5% 5xx), high latency (p95 >= 1s), and down status.

Target: >= 80% services have critical alerts

Service Level Objectives

>= 60% of user-facing services have defined SLOs with >= 99% availability target and <= 500ms latency target.

Target: >= 60% services have documented SLOs

Observability Dashboards

>= 80% of services have Grafana/Datadog dashboards showing RED metrics, resource usage, and business KPIs.

Target: >= 80% services have dashboards

Implementation Journey

Prerequisites

Complete these before starting:

Services deployed to at least one environment
Log aggregation needs identified
Metrics collection requirements defined

Typical Timeline

3.5 weeks

Effort Estimate

140 hours

≈ 18 days

Breakdown by role:

SRE:80 hours

Platform:40 hours

Engineering:20 hours

Team Composition

Cross-functional team including: sre, platform, engineer

Applicable Environments

regulated

non-regulated

Success Metrics

Entry Criteria

Prerequisites to start implementing this epic:

Services deployed to at least one environment

Log aggregation needs identified

Metrics collection requirements defined

Exit Criteria

Criteria defined at the Foundation milestone level:

deployment Frequency: >= weekly (staging)

lead Time: <= 7 days (commit to staging)

change Failure Rate: <= 20%

mttr: <= 4h (staging)

observability Coverage: >= 80% services instrumented

ci Success: >= 90%

flaky Tests: < 5%

sbom Coverage: >= 90% services

secrets Policy: Approved secrets manager only

pr Cycle Time: p50 <= 24h

build Success: main >= 95%, PR >= 90%

ownership Coverage: >= 90% services

DORA Metrics Impact

MTTR

24 hours to 4 hours

83%

CFR

30% to 20%

33%

Resources

Implementation Kit

Step-by-step guide, templates, and tools for this epic

View Observability & Monitoring Foundations Implementation Kit

Templates

Ready-to-use templates for implementing capabilities

Browse All Templates

Learn More

Tutorials & Learning Paths Case Studies & Examples

Common Pitfalls

Too many alerts causing alert fatigue, real issues missed
Mitigation: Consolidate related alerts. Set meaningful thresholds (not default). Route to appropriate teams. Review alert noise weekly.

Logs not structured, difficult to query
Mitigation: Use JSON logging format. Include correlation IDs. Define standard fields (timestamp, level, service). Validate format in tests.

Metrics collected but never reviewed or acted upon
Mitigation: Create dashboards for key metrics. Set up regular review meetings. Link metrics to SLOs. Alert on SLO violations.

Next Steps

After Completing This Epic

Once you've met all exit criteria, consider these next steps:

Review metrics to validate DORA improvements
Document lessons learned and update team playbooks
Share success stories with other teams

Continue To

The natural next epic in the roadmap sequence:

Continuous Planning & Compliance Integration

Alternative Paths

Other epics that can be tackled in parallel:

Backlog Quality & Planning Enablement Code Quality & Review Standards CI/CD & Build Automation Testing Strategy & Quality Gates