Prepare for and prevent incidents

To configure your Cortex workspace to handle Incident Management & Response, we recommend the following actions:

Use Cortex features to prepare for incidents

Expand the tiles below to learn about configuring Cortex features to stay prepared in case of incidents.

Step 1: Ingest data and solve ownership 🔌

Before getting started on any use case, it is crucial to import your services, resources, infrastructure, and other entities, and to have clear visibility into the ownership of your entities.

Connecting your entities to Cortex establishes a single source of truth across your engineering organization. It enables the ability to track progress via Scorecards, automate Workflows, and gain insights from Eng Intelligence.

Setting ownership of entities ensures that every service and system is clearly linked to accountable teams or individuals, enabling faster incident response, reducing handoff friction, and making it possible to enforce standards consistently.

The more data you have available, the more actionable and insightful your Scorecards can be.

Relevant integrations

To focus on Incident Management & Response, Cortex recommends integrating with tools that help automate alerting, manage on-call schedules, trigger and track incidents, and facilitate post-incident analysis. Make sure you have configured integrations for the following categories:

Cortex also recommends linking to runbooks and documentation for your entities, ensuring your users have access to critical information.

With your data in Cortex, you have a jumping-off point to start driving a successful Incident Management process.

Step 2: Configure a Scorecard for Incident Preparedness 📋

Scorecards automate the process of checking whether services meet criteria such as ownership, on-call coverage, runbooks, monitoring, and security requirements.

Cortex's incident templates include predefined rules which can be customized based on your organization's requirements, infrastructure, and goals. The templates are structured into three levels — Bronze, Silver, and Gold — with each representing increasing levels of success.

Step 2.1: Create the Scorecard and configure the basics

  1. On the Scorecards page in your workspace, click Create Scorecard.

  2. There are two incident-related Scorecard templates available: Incident Preparedness and Incident Response Performance. On the template you want to use, click Use.

    Click "Use" on the incident Scorecard template you want to use.
  3. Configure basic settings, including the Scorecard's name, unique identifier, description, and more.

Step 2.2: Review and modify the rules

The Scorecard template contains rules that prepare your organization for incidents and enforce industry best practices, such as:

  • Enforce ownership, linked docs, and linked Slack channels to enable quick action during incidents.

  • Enforce having monitors documented to accurately identify issues and reduce mean time to resolution (MTTR).

  • Enforce incoming and outgoing dependencies being documented, to allow responders to assess the full impact and prioritize remediation efforts effectively.

While Cortex's template is based on common industry best practices, you may need to adjust the rules based on which tools you use and how your organization prioritizes requirements and metrics. You can reorder, delete, and edit rules, and you can add more rules to a level.

When adding or changing the template rules, you can select from a list of available pre-built rules. Behind each rule is a Cortex Query Language (CQL) query; you can also write your own queries to further refine your rules.

Step 3: Enable On-Call Assistant 🔔

Cortex's On-Call Assistant simplifies the incident response process and reduces MTTR. It leverages the PagerDuty integration to automatically surface the most vital information about an entity when an incident has been triggered. During an incident, it notifies the responsible users via Slack, providing incident details, deploy and monitoring information, the entity's owner and related Slack channel, and links to more information.

Step 5: Configure Cortex MCP 🤖

Cortex MCP can significantly help during an incident by providing instant, conversational access to critical service and team information directly from your MCP client. It supports incident response by providing:

  • Real-time, structured answers: Ask questions like "Who is on call for backend-server?" or "Give me all the details for parser-service." MCP fetches the data in real time from Cortex's API, ensuring accurate and up-to-date information about service health, ownership, and operational readiness.

  • Actionable recommendations: MCP can suggest next steps or remediation ideas based on Scorecard and Initiative data, helping you identify and address gaps in incident response.

  • Reduced context switching: It meets engineers where they work, such as in an IDE or MCP chat client, eliminating the need to switch between tools during a high-pressure incident.

Step 6: Review and act on Eng Intelligence 📈

Use Eng Intelligence features — DORA dashboard, Velocity Dashboard, and Metrics Explorer — to understand how well teams are performing during and after incidents.

Review trends in Eng Intelligence graphs and metrics.

Review trends in areas such as incident frequency and time to resolution.

Incident Preparedness in action

After you have configured your workspace for incident preparedness, you are well prepared to handle incidents when they arise.

Learn more about preventing and handling active incidents in Incident Response in action.

Last updated

Was this helpful?