Prepare for and prevent incidents

To configure your Cortex workspace to handle Incident Management & Response, we recommend the following actions:

Connect Data: Ingest data, ensure ownership is assigned to your entities, and configure integrations for incident management, on-call, and other tools your organization uses.
Standardize: Configure a Scorecard to enforce Incident Management best practices and measure relevant metrics
Streamline: Enable On-Call Assistant in Cortex for automated notifications during incidents, reducing the time to resolution. Configure Workflows to streamline incident-related steps, and configure Cortex MCP to ensure you have a quick way to get answers during incidents
Improve: Review Eng Intelligence metrics and take action to improve incident preparedness

Use Cortex features to prepare for incidents

Expand the tiles below to learn about configuring Cortex features to stay prepared in case of incidents.

Step 1: Ingest data and solve ownership 🔌

Action Items:

Before getting started on any use case, it is crucial to import your services, resources, infrastructure, and other entities, and to have clear visibility into the ownership of your entities.

Connecting your entities to Cortex establishes a single source of truth across your engineering organization. It enables the ability to track progress via Scorecards, automate Workflows, and gain insights from Eng Intelligence.

Setting ownership of entities ensures that every service and system is clearly linked to accountable teams or individuals, enabling faster incident response, reducing handoff friction, and making it possible to enforce standards consistently.

The more data you have available, the more actionable and insightful your Scorecards can be.

Relevant integrations

To focus on Incident Management & Response, Cortex recommends integrating with tools that help automate alerting, manage on-call schedules, trigger and track incidents, and facilitate post-incident analysis. Make sure you have configured integrations for the following categories:

Incident management: FireHydrant, Incident.io, PagerDuty, Rootly
- Trigger incidents, route alerts, and view incident data on entity pages
On-call: PagerDuty, Opsgenie, Splunk On-Call (formerly VictorOps), xMatters
- Track on-call responsibilities to confirm that support teams are always assigned
Monitoring and observability: Coralogix, Datadog, Dynatrace, Google Observability Cloud, Instana, New Relic, Prometheus, ServiceNow Cloud Observability (formerly Lightstep), Splunk Observability Cloud (formerly SignalFX), Sumo Logic
- Detect issues faster and improve visibility
Project management: GitHub, Jira, Azure DevOps, ClickUp
- Track incidents, bugs, and compliance issues
Code quality and security: Checkmarx, Codecov, Mend, Snyk, SonarQube, Veracode, Wiz
- Enforce code coverage, vulnerability scanning, and other quality measures

Cortex also recommends linking to runbooks and documentation for your entities, ensuring your users have access to critical information.

With your data in Cortex, you have a jumping-off point to start driving a successful Incident Management process.

Step 2: Configure a Scorecard for Incident Preparedness 📋

Action Item: Create a Scorecard

Scorecards automate the process of checking whether services meet criteria such as ownership, on-call coverage, runbooks, monitoring, and security requirements.

Cortex's incident templates include predefined rules which can be customized based on your organization's requirements, infrastructure, and goals. The templates are structured into three levels — Bronze, Silver, and Gold — with each representing increasing levels of success.

Step 2.1: Create the Scorecard and configure the basics

On the Scorecards page in your workspace, click Create Scorecard.
There are two incident-related Scorecard templates available: Incident Preparedness and Incident Response Performance. On the template you want to use, click Use.
Configure basic settings, including the Scorecard's name, unique identifier, description, and more.
- Learn about configuring the basic settings in the Creating a Scorecard documentation.

Step 2.2: Review and modify the rules

The Scorecard template contains rules that prepare your organization for incidents and enforce industry best practices, such as:

Enforce ownership, linked docs, and linked Slack channels to enable quick action during incidents.
Enforce having monitors documented to accurately identify issues and reduce mean time to resolution (MTTR).
Enforce incoming and outgoing dependencies being documented, to allow responders to assess the full impact and prioritize remediation efforts effectively.

While Cortex's template is based on common industry best practices, you may need to adjust the rules based on which tools you use and how your organization prioritizes requirements and metrics. You can reorder, delete, and edit rules, and you can add more rules to a level.

When adding or changing the template rules, you can select from a list of available pre-built rules. Behind each rule is a Cortex Query Language (CQL) query; you can also write your own queries to further refine your rules.

Step 3: Enable On-Call Assistant 🔔

Action Item: Enable On-Call Assistant

Cortex's On-Call Assistant simplifies the incident response process and reduces MTTR. It leverages the PagerDuty integration to automatically surface the most vital information about an entity when an incident has been triggered. During an incident, it notifies the responsible users via Slack, providing incident details, deploy and monitoring information, the entity's owner and related Slack channel, and links to more information.

Step 5: Configure Cortex MCP 🤖

Action Item: Configure Cortex MCP

Cortex MCP can significantly help during an incident by providing instant, conversational access to critical service and team information directly from your MCP client. It supports incident response by providing:

Real-time, structured answers: Ask questions like "Who is on call for backend-server?" or "Give me all the details for parser-service." MCP fetches the data in real time from Cortex's API, ensuring accurate and up-to-date information about service health, ownership, and operational readiness.

Actionable recommendations: MCP can suggest next steps or remediation ideas based on Scorecard and Initiative data, helping you identify and address gaps in incident response.
Reduced context switching: It meets engineers where they work, such as in an IDE or MCP chat client, eliminating the need to switch between tools during a high-pressure incident.

Step 6: Review and act on Eng Intelligence 📈

Action Item: Review Eng Intelligence metrics

Use Eng Intelligence features — DORA dashboard, Velocity Dashboard, and Metrics Explorer — to understand how well teams are performing during and after incidents.

Review trends in Eng Intelligence graphs and metrics.

Review trends in areas such as incident frequency and time to resolution.

Looking for additional resources on enforcing Incident Management best practices in Cortex? Check out the Cortex Academy "Incident Management & Response" course, available to all Cortex customers and POVs.

Incident Preparedness in action

After you have configured your workspace for incident preparedness, you are well prepared to handle incidents when they arise.

Learn more about preventing and handling active incidents in Incident Response in action.

Last updated 4 days ago

Was this helpful?