Reduce MTTA

Incident Management Scorecards: Reduce Mean Time to Acknowledge (MTTA)

Mean Time to Acknowledge (MTTA) measures how quickly teams acknowledge an incident after it is triggered.

Reducing MTTA depends on:

  • Operational readiness: Ensuring the right people are reachable at all times.

  • Response behavior: Tracking and improving how fast incidents are acknowledged once triggered.

A well-designed Scorecard shouldn't just track MTTA as a number; it should validate the conditions that enable low MTTA, such as on-call setup, contact methods, and escalation policy depth.

Best practices

When creating a Scorecard aimed at reducing MTTA, follow these best practices:

  • Group rules by functional area (e.g., Incident Response, Monitoring, Reliability) to simplify assessment.

  • Keep evaluation windows aligned so that related signals trend together.

  • Enable Cortex notifications for when overall Scorecard scores drop, prompting teams to review and act.

Rules that focus on reducing MTTA

Category
Purpose
Example CQL expression

On-call configuration

Ensure service has an active PagerDuty schedule

oncall != null

Contact reliability

Verify responders can be reached through multiple channels

oncall.usersWithoutContactMethods(allowed=["EMAIL","PHONE","PUSH_NOTIFICATION","SMS"]) == 0

Escalation depth

Require at least two escalation tiers

oncall.numOfEscalations() >= 2

Acknowledgment time

Track MTTA against defined thresholds

jq(oncall.analysis(...), '.meanSecondsToFirstAck <= 300')

Monitoring coverage

Ensure critical services have active alerting rules

datadog.monitors().length > 0

SLO tracking

Validate that a service has defined SLOs and error budgets

slos().any((slo) => slo.name.matchesIn(".Uptime."))

Ownership coverage

Require that every service has a defined owning team

ownership != null

Communication channel is set

Require that every service has a defined communication channel

slack != null and slack.numOfMembers() > 0

Entity was verified in the last 90 days

Require entity information, including ownership, on-call, and Slack to be verified in the last 90 days

verifications().lastVerifiedAt() != null and verifications().lastVerifiedAt().fromNow() > duration("P-90D")

Entity does not have pending verifications

Ensure entity does not have any pending verifications

verifications().verifications().any(verification => verification.status == "PENDING") == 0

Examples from real Cortex users

The following anonymized examples come from real uses cases our customers are solving with Cortex.

Event Readiness Scorecard

For companies that have a busy season (e.g., companies that are busier during Black Friday), they might create a seasonal readiness Scorecard in Cortex. The following strategy ties performance metrics directly to readiness controls:

  • Track MTTA < 120 seconds for P1 and P2 incidents

  • Require two-tier escalation policies in PagerDuty: oncall.numOfEscalations() >= 2

  • Validate that on-call users have valid contact methods configured: oncall.usersWithoutContactMethods(...) == 0

  • Combine outcome metrics (MTTA) with configuration checks to ensure teams can meet targets consistently.

On-call Configuration Scorecard

The following strategy ensures every service can be reached before an incident occurs, eliminating the common MTTA outliers caused by misconfigured alerts:

  • Verify that on-call rotations exist oncall != null

  • Validate that on-call users have valid contact methods configured: oncall.usersWithoutContactMethods(...) == 0

  • Flag services without assigned responders

This CQL applies to all on-call integrations: Opsgenie, PagerDuty, Splunk On-Call (formerly VictorOps), and xMatters.

Operational Maturity Scorecard

The previous examples applies to services and other entities. Some organizations prefer to track team-level operational maturity, including incident management as one area of assessment.

The following strategy aims for low MTTA as part of a broader operational maturity, not an isolated performance goal:

  • Measure operational behaviors like post-incident reviews, ownership clarity, and alert hygiene.

  • Focus on consistent response patterns rather than single-point metrics.

Last updated

Was this helpful?