Reduce MTTR
Mean Time to Resolution (MTTR) measures how quickly systems are restored after an incident begins.
Reducing MTTR depends on:
Operational readiness: Clear ownership, actionable runbooks, and reliable on-call escalations ensure the right people are reachable at all times.
Repair effectiveness: Tracking and improving how fast incidents are resolved and ensuring Cortex Workflows are configured to rollback, restart, scale up pods, and other processes you use during incidents.
See an example Workflow: Rollback a service during an incident
A well-designed Scorecard shouldn't just track MTTR as a number; it should validate the conditions that enable low MTTR, such as up-to-date runbooks, ownership, and on-call coverage with escalation depth.
Best practices
When creating a Scorecard aimed at reducing MTTR, follow these best practices:
Group rules by functional area (e.g., Incident Response, Observability, Reliability) to make gaps obvious.
Keep evaluation windows aligned so that related signals trend together.
Enable Cortex notifications for when overall Scorecard scores drop, prompting teams to review and act.
Rules that focus on reducing MTTR
On-call configuration
Ensure service has an active PagerDuty schedule
oncall != null
Contact reliability
Verify responders can be reached through multiple channels
actMethods(allowed=["EMAIL","PHONE","PUSH_NOTIFICATION","SMS"]) == 0
Ownership coverage
Require that every service has a defined owning team
ownership != null
Monitoring coverage
Ensure critical services have active alerting rules
datadog.monitors().length > 0
SLO tracking
Ensure SLOs exist for latency
slos().filter((slo) => slo.name.matchesIn("latency") and slo.sliValue >= 0.9999).length > 0
Ownership coverage
Require that every service has a defined owning team
ownership != null
Alerting channel configured
Require that a Slack or Microsoft Teams channel is configured for each service
slack != null
Runbooks configured
Require that a runbook is linked so responders have clear steps to follow
links("runbook").length > 0
MTTR benchmarks
Target benchmarks for MTTR: In lower levels of a Scorecard, you might target <90 minutes, and in higher levels you might target <30 minutes
oncall.analysis(lookback = duration("P30D")).meanSecondsToResolve < 1800
CI/CD pipeline configured
Require a CI/CD pipeline to exist so deployments can be automated and repeatable.
git.fileExists(".gitlab-ci.yml")
Pipeline success rate
Ensure builds are passing successfully, showing stable automation and tests. You might target 85-95%.
git.percentBuildSuccess() >= 0.95
Launch an Incident Preparedness Scorecard
Cortex offers a pre-built Scorecard for Incident Preparedness. You can launch this Scorecard to improve your incident processes, enabling a reduced MTTR.
Learn more about the template, and how to handle a broader Incident Management & Response use case, in the Solutions docs.
Example: LetsGetChecked
Cortex customer LetsGetChecked used Cortex to automatically sync service and resource catalogs, enabling their team to quickly find accurate service information. They used Scorecards to drive operational excellence for onboarding, service maturity, and deployment frequency.
Impact
They reduced MTTR by 67% and doubled their deployment frequency.
Learn more
Learn more in the case study: How LetsGetChecked doubled deployment frequency and slashed MTTR by 67% with Cortex.
Example: H&R Block
H&R block used Cortex to automate manual, repetitive tasks that were draining velocity and morale.
Impact
They reduced MTTR from up to 24 hours to less than one hour.
Learn more
Learn more in the Cortex blog: How H&R Block automated the toil out of its developer experience.
Last updated
Was this helpful?