Resilient Operations & Chaos Engineering

Acceleration Milestone

Phase: operate

MTTR

CFR

Chaos engineering experiments, disaster recovery automation, advanced IaC with policy enforcement, and resilience testing.

Business Value

Improves system uptime from 99.5% to 99.9% and reduces blast radius of failures by 70% through circuit breakers and chaos engineering validation

DORA Impact

Mean Time to Recover
Change Failure Rate

Key Features

Chaos Engineering Practices
Automated Disaster Recovery
Circuit Breaker Patterns
Adaptive Rate Limiting
Graceful Degradation Strategies

Who

sre

platform

When

Acceleration (90-180 days)

Capabilities in This Epic

Chaos Engineering Practices

>= 60% of critical services undergo monthly chaos experiments (pod failures, network latency, resource exhaustion).

Target: >= 60% services chaos tested monthly

Automated Disaster Recovery

>= 80% of critical services have automated DR failover tested quarterly with RTO < 1hr and RPO < 15min.

Target: RTO < 1hr, RPO < 15min for 80% services

Circuit Breaker Patterns

>= 75% of service-to-service calls protected by circuit breakers (Istio, Envoy, Resilience4j) preventing cascade failures.

Target: >= 75% calls have circuit breakers

Adaptive Rate Limiting

>= 80% of public APIs have adaptive rate limiting protecting against traffic spikes and abuse.

Target: >= 80% APIs have rate limiting

Graceful Degradation Strategies

>= 70% of services implement degraded mode (serve cached data, disable non-critical features) during dependency failures.

Target: >= 70% services support degraded mode

Implementation Journey

Prerequisites

Complete these before starting:

Observability monitoring epic complete (metrics, logs)
SLO-observability epic complete (SLOs, error budgets)
Incident response process established

Typical Timeline

4.5 weeks

Effort Estimate

180 hours

≈ 23 days

Breakdown by role:

SRE:100 hours

Platform:50 hours

Engineering:30 hours

Team Composition

Cross-functional team including: sre, platform

Applicable Environments

regulated

non-regulated

Success Metrics

Entry Criteria

Prerequisites to start implementing this epic:

Observability monitoring epic complete (metrics, logs)

SLO-observability epic complete (SLOs, error budgets)

Incident response process established

Exit Criteria

Criteria defined at the Acceleration milestone level:

deployment Frequency: >= daily (non-critical prod)

lead Time: <= 24h (commit to prod non-critical)

change Failure Rate: <= 10%

mttr: <= 1h

slo Coverage: >= 95% services with SLOs

policy Coverage: >= 70% changes pass automated checks

progressive Delivery: >= 80% rollouts

error Budget Policy: enforced on all SLOs

slsa Level: >= 2

dr Drills: quarterly (RTO/RPO met)

pr Cycle Time: p50 <= 8h

artifact Verification: signatures verified at deploy

DORA Metrics Impact

MTTR

4 hours to 1 hour

75%

CFR

20% to 10%

50%

Resources

Implementation Kit

Step-by-step guide, templates, and tools for this epic

View Resilient Operations & Chaos Engineering Implementation Kit

Templates

Ready-to-use templates for implementing capabilities

Browse All Templates

Learn More

Tutorials & Learning Paths Case Studies & Examples

Common Pitfalls

Chaos experiments cause real outages, team loses trust
Mitigation: Start with non-prod environments. Use GameDays with planned windows. Have rollback ready. Communicate experiments.

Circuit breakers trip too frequently, service always degraded
Mitigation: Tune thresholds based on actual traffic patterns. Implement exponential backoff. Monitor false positive rate.

Incident runbooks outdated, steps no longer work
Mitigation: Test runbooks in chaos experiments. Update runbooks after each incident. Version runbooks with infrastructure.

Next Steps

After Completing This Epic

Once you've met all exit criteria, consider these next steps:

Review metrics to validate DORA improvements
Document lessons learned and update team playbooks
Share success stories with other teams

Continue To

The natural next epic in the roadmap sequence:

SLO-Driven Observability & Error Budgets

Alternative Paths

Other epics that can be tackled in parallel:

Continuous Planning & Compliance Integration Secure Code & Advanced Review Secure & Performant Build Pipelines Advanced Testing & Performance Validation

Resilient Operations & Chaos Engineering

Acceleration Milestone

Phase: operate

MTTR

CFR

Chaos engineering experiments, disaster recovery automation, advanced IaC with policy enforcement, and resilience testing.

Business Value

Improves system uptime from 99.5% to 99.9% and reduces blast radius of failures by 70% through circuit breakers and chaos engineering validation

DORA Impact

Mean Time to Recover
Change Failure Rate

Key Features

Chaos Engineering Practices
Automated Disaster Recovery
Circuit Breaker Patterns
Adaptive Rate Limiting
Graceful Degradation Strategies

Who

sre

platform

When

Acceleration (90-180 days)

Capabilities in This Epic

Chaos Engineering Practices

>= 60% of critical services undergo monthly chaos experiments (pod failures, network latency, resource exhaustion).

Target: >= 60% services chaos tested monthly

Automated Disaster Recovery

>= 80% of critical services have automated DR failover tested quarterly with RTO < 1hr and RPO < 15min.

Target: RTO < 1hr, RPO < 15min for 80% services

Circuit Breaker Patterns

>= 75% of service-to-service calls protected by circuit breakers (Istio, Envoy, Resilience4j) preventing cascade failures.

Target: >= 75% calls have circuit breakers

Adaptive Rate Limiting

>= 80% of public APIs have adaptive rate limiting protecting against traffic spikes and abuse.

Target: >= 80% APIs have rate limiting

Graceful Degradation Strategies

>= 70% of services implement degraded mode (serve cached data, disable non-critical features) during dependency failures.

Target: >= 70% services support degraded mode

Implementation Journey

Prerequisites

Complete these before starting:

Observability monitoring epic complete (metrics, logs)
SLO-observability epic complete (SLOs, error budgets)
Incident response process established

Typical Timeline

4.5 weeks

Effort Estimate

180 hours

≈ 23 days

Breakdown by role:

SRE:100 hours

Platform:50 hours

Engineering:30 hours

Team Composition

Cross-functional team including: sre, platform

Applicable Environments

regulated

non-regulated

Success Metrics

Entry Criteria

Prerequisites to start implementing this epic:

Observability monitoring epic complete (metrics, logs)

SLO-observability epic complete (SLOs, error budgets)

Incident response process established

Exit Criteria

Criteria defined at the Acceleration milestone level:

deployment Frequency: >= daily (non-critical prod)

lead Time: <= 24h (commit to prod non-critical)

change Failure Rate: <= 10%

mttr: <= 1h

slo Coverage: >= 95% services with SLOs

policy Coverage: >= 70% changes pass automated checks

progressive Delivery: >= 80% rollouts

error Budget Policy: enforced on all SLOs

slsa Level: >= 2

dr Drills: quarterly (RTO/RPO met)

pr Cycle Time: p50 <= 8h

artifact Verification: signatures verified at deploy

DORA Metrics Impact

MTTR

4 hours to 1 hour

75%

CFR

20% to 10%

50%

Resources

Implementation Kit

Step-by-step guide, templates, and tools for this epic

View Resilient Operations & Chaos Engineering Implementation Kit

Templates

Ready-to-use templates for implementing capabilities

Browse All Templates

Learn More

Tutorials & Learning Paths Case Studies & Examples

Common Pitfalls

Chaos experiments cause real outages, team loses trust
Mitigation: Start with non-prod environments. Use GameDays with planned windows. Have rollback ready. Communicate experiments.

Circuit breakers trip too frequently, service always degraded
Mitigation: Tune thresholds based on actual traffic patterns. Implement exponential backoff. Monitor false positive rate.

Incident runbooks outdated, steps no longer work
Mitigation: Test runbooks in chaos experiments. Update runbooks after each incident. Version runbooks with infrastructure.

Next Steps

After Completing This Epic

Once you've met all exit criteria, consider these next steps:

Review metrics to validate DORA improvements
Document lessons learned and update team playbooks
Share success stories with other teams

Continue To

The natural next epic in the roadmap sequence:

SLO-Driven Observability & Error Budgets

Alternative Paths

Other epics that can be tackled in parallel:

Continuous Planning & Compliance Integration Secure Code & Advanced Review Secure & Performant Build Pipelines Advanced Testing & Performance Validation