Reduce MTTR

Mean Time to Resolution (MTTR) measures how quickly systems are restored after an incident begins.

Reducing MTTR depends on:

  • Operational readiness: Clear ownership, actionable runbooks, and reliable on-call escalations ensure the right people are reachable at all times.

  • Repair effectiveness: Tracking and improving how fast incidents are resolved and ensuring Cortex Workflows are configured to rollback, restart, scale up pods, and other processes you use during incidents.

A well-designed Scorecard shouldn't just track MTTR as a number; it should validate the conditions that enable low MTTR, such as up-to-date runbooks, ownership, and on-call coverage with escalation depth.

Best practices

When creating a Scorecard aimed at reducing MTTR, follow these best practices:

  • Group rules by functional area (e.g., Incident Response, Observability, Reliability) to make gaps obvious.

  • Keep evaluation windows aligned so that related signals trend together.

  • Enable Cortex notifications for when overall Scorecard scores drop, prompting teams to review and act.

Rules that focus on reducing MTTR

Category
Purpose
Example CQL expression

On-call configuration

Ensure service has an active PagerDuty schedule

oncall != null

Contact reliability

Verify responders can be reached through multiple channels

actMethods(allowed=["EMAIL","PHONE","PUSH_NOTIFICATION","SMS"]) == 0

Ownership coverage

Require that every service has a defined owning team

ownership != null

Monitoring coverage

Ensure critical services have active alerting rules

datadog.monitors().length > 0

SLO tracking

Ensure SLOs exist for latency

slos().filter((slo) => slo.name.matchesIn("latency") and slo.sliValue >= 0.9999).length > 0

Ownership coverage

Require that every service has a defined owning team

ownership != null

Alerting channel configured

Require that a Slack or Microsoft Teams channel is configured for each service

slack != null

Runbooks configured

Require that a runbook is linked so responders have clear steps to follow

links("runbook").length > 0

MTTR benchmarks

Target benchmarks for MTTR: In lower levels of a Scorecard, you might target <90 minutes, and in higher levels you might target <30 minutes

oncall.analysis(lookback = duration("P30D")).meanSecondsToResolve < 1800

CI/CD pipeline configured

Require a CI/CD pipeline to exist so deployments can be automated and repeatable.

git.fileExists(".gitlab-ci.yml")

Pipeline success rate

Ensure builds are passing successfully, showing stable automation and tests. You might target 85-95%.

git.percentBuildSuccess() >= 0.95

Launch an Incident Preparedness Scorecard

Cortex offers a pre-built Scorecard for Incident Preparedness. You can launch this Scorecard to improve your incident processes, enabling a reduced MTTR.

Learn more about the template, and how to handle a broader Incident Management & Response use case, in the Solutions docs.

Example: LetsGetChecked

Cortex customer LetsGetChecked used Cortex to automatically sync service and resource catalogs, enabling their team to quickly find accurate service information. They used Scorecards to drive operational excellence for onboarding, service maturity, and deployment frequency.

Impact

They reduced MTTR by 67% and doubled their deployment frequency.

Learn more

Learn more in the case study: How LetsGetChecked doubled deployment frequency and slashed MTTR by 67% with Cortex.

Example: H&R Block

H&R block used Cortex to automate manual, repetitive tasks that were draining velocity and morale.

Impact

They reduced MTTR from up to 24 hours to less than one hour.

Learn more

Learn more in the Cortex blog: How H&R Block automated the toil out of its developer experience.

Last updated

Was this helpful?