Incident Response in action

You've prepared your Cortex workspace for incidents — what can you do now to stay ahead of incidents? And what can you do when an incident does occur?

While Scorecards will help you prevent incidents by ensuring standards are met, there are still other potential ways an incident can occur. Learn more below on how you can promote the health of your data, trigger and handle incidents, and work through the Root Cause Analysis with Cortex.

Cortex gives you the context to take action and quickly resolve incidents.

Incident prevention in action

Review tasks on your engineering homepage

Engineers use the Engineering homepage as their personalized daily starting point. It provides a centralized view of key signals and tasks relevant to incident management, including:

  • Active work: See your open PRs, PRs assigned for review, active Jira tickets, and action items from Scorecards and Initiatives, ensuring you are aware of your responsibilities and can quickly address incident follow-ups and reliability improvements.

  • On-call visibility: See current and upcoming on-call shifts, ensuring engineers always know who is responsible for incident response at any given time, reducing confusion and response delays. This information is pulled live from integrations like PagerDuty and Opsgenie.

  • Scorecards and Initiatives: See Scorecard compliance and real-time progress toward incident-related standards. This helps teams proactively address gaps before incidents occur.

  • Centralized access to runbooks and service health: The homepage provides quick links to your owned services, where you can quickly see their health status and associated runbooks. This enables rapid access to critical information needed during an incident, streamlining triage and resolution.

Review and act on failing Scorecard rules

Review failing rules

When a Scorecard rule is failing, this information is surfaced to you in multiple ways:

Remediate failing rules

When a service falls short of standards, there are different ways you can approach remediation:

Gain visibility with Engineering Intelligence

View metrics in Eng Intelligence to understand how well teams are performing during and after incidents.

Review metrics and drive action
  • Use Metrics Explorer to view incident frequency and time to resolution metrics pulled from PagerDuty.

  • Review the DORA Dashboard for clear, real-time visibility into key metrics that reflect your team's ability to respond to and recover from incidents, including:

    • Time to Resolution, showing how quickly your time resolves production failures — a direct measure of incident response effectiveness.

    • Change failure rate, indicating the percentage of deployments that result in incidents or require remediation, helping you identify trends in deployment stability and areas for improvement.

    • Deployment frequency and cycle time, helping you understand how quick fixes and improvements can be shipped in response to incidents.

  • Drive action: When you identify an issue that affects your incident preparedness, create a targeted Initiative to fix it.

    • For example, if you notice a spike in incidents or a high MTTR, you can assign remediation gaps and track progress with an Initiative.

    • Initiatives send notifications to users asking them to complete tasks by the deadline you configured. Learn more in Initiatives and Action items.

Eng Intelligence Scorecards

You can also create a Scorecard to set benchmarks and measure team progress for Eng Intelligence metrics, starting with Cortex's built-in Scorecard template called Eng Intelligence:

I

Incident Response in action

Cortex provides full context of your services, allowing you to take action, quickly mitigate incidents, and work through Root Cause Analysis.

Trigger an incident

After integrating with an incident management tool, you can trigger an incident directly from Cortex while viewing an entity's details page:

This is supported for PagerDuty, incident.io, FireHydrant, Rootly, and xMatters.

View entities with active incidents

While viewing a catalog, quickly see which entities have active incidents:

Example Incident Response approaches with Cortex

The following examples demonstrate how Cortex can help you navigate an efficient incident response.

Incident Response with Cortex MCP

1: You detect a critical issue.

A monitoring system detects a critical issue affecting a service. You get an alert from the monitoring service.

  • Your team has just started implementing Cortex at your organization. You have owners listed on entities, runbooks linked to entities, Slack channels connected to entities, and Cortex MCP configured.

  • You haven't yet configured your on-call or incident tools yet. While it's not an ideal situation, fortunately it's still possible to start an investigation with Cortex.

2: You ask Cortex MCP for more information.

You open your MCP client that is already configured with Cortex MCP, and type in a prompt to gather information: There is an active incident on notifications-service, give information to help investigate.

3: You investigate using context provided by Cortex MCP.

The MCP provides context to help you investigate:

  • Service details

  • Key investigation points

    • When it detects no dependencies for the affected entity, it takes into consideration that you may not have mapped your dependencies yet

    • It notes that on-call is not configured yet, and recommends a team that you could contact.

  • A list of immediate action items

4: You mitigate the incident.

Using the context provided, you contact the team who owns the entity. Together with the owning team, you review the MCP's suggested action items and you're able to quickly narrow down the cause of the incident.

5: You analyze the incident to make future improvements.

While working through the Root Cause Analysis, in addition to narrowing down the cause of the incident, you also determine a few ways you can make organizational improvements:

Incident Response with Cortex On-Call Assisstant

1: You detect a critical issue.

A monitoring system detects a critical issue affecting a service. Your Security Engineering team lead gets an alert from the monitoring service.

  • Fortunately, the organization prepared for this moment by using an Incident Preparedness Scorecard to ensure that they were fully prepared.

  • They already had on-call and incident handling tools integrated with Cortex (PagerDuty and On-Call Assistant enabled), owners listed on entities, runbooks and docs linked to entities, and Slack channels linked to entities.

2: You trigger an incident from Cortex.

While viewing the service in Cortex, the security engineer triggers an incident directly from the entity page in Cortex.

  1. An incident is created separately in PagerDuty, which notifies the on-call engineer.

  2. The active incident appears on the affected entity details pages:

    Active incidents appear at the top of an entity page.
  3. On-Call Assistant automatically sends a Slack notification to the on-call engineers.

    • The notification includes the most critical information needed while handling an incident: The affected entity, the on-call rotation, the last deploy and commit, monitoring and metric information, the owner and Slack channel for the entity, and links to view the entity and its dependencies in Cortex:

3: You investigate using context provided by Cortex.

With the context provided by On-Call Assistant, the primary on-call engineer starts their investigation:

  1. They navigate to the Slack channel provided in the On-Call Assistant message to begin communicating with the right team:

  2. They open the link provided by On-Call Assistant to view the entity in Cortex, immediately gaining visibility into dependencies, recent deploys and commits, and more.

  3. They review the event timeline to understand what changed before and during the incident, allowing them to narrow down potential causes.

  4. They review the relationship graph to visualize to better understand the upstream and downstream services that could be affected or could have caused the incident.

  5. From the entity page, they can access the entity's linked runbooks and logs. Their runbooks contain common failure modes, diagnostic commands, and remediation steps.

    Relevant links and docs appear on the entity page.

4: You mitigate the incident, using a Cortex Workflow.

The on-call engineer quickly determine that the incident was caused by a recent deploy, and decides to roll it back:

  • To prepare for incidents, the organization has already configured a Rollback Workflow to use in case of incidents. The on-call engineer navigates to the Workflow in Cortex and runs it.

  • After the Workflow run completes, the initial incident is mitigated.

Working through Root Cause Analysis (RCA) with Cortex

The examples above demonstrate how engineering teams can use Cortex to quickly investigate and mitigate an incident. Teams can also leverage Cortex's unified view of service metadata while working through the Root Cause Analysis (RCA) after an incident:

  • Reconstruct what happened without digging across fragmented tools:

    • Review the event timeline on the entity details page to understand the deploys that occurred before and during the incident and any other changes that may have occurred.

    • Review dependencies for the affected entity, giving insight into impact and potential causes of the incident.

  • Review the service's reliability posture:

    • Is there a Scorecard in place to measure and enforce reliability? If not, you can implement one across all of your services, ensuring the prevention of similar incidents from happening again in the future.

Last updated

Was this helpful?