PagerDuty
PagerDuty is an incident response platform that allows developers to manage alerts, schedule rotations, define escalation policies, and more. By integrating PagerDuty with Cortex, you can track dozens of key on-call metrics and help teams enforce adoption of on-call best practices.
In this guide, you'll learn how to set up and use the PagerDuty integration in Cortex, enhancing incident response and reporting. The PagerDuty integration unlocks several powerful features:
- View on-call information directly in catalogs
- Trigger incidents directly
- Enforce adoption of on-call best practices for entities and teams
- Link to escalation policies
The PagerDuty integration also allows you to set up the on-call assistant. You can read more about the on-call assistant in this guide.
Setup and configuration
Getting started
In order to connect Cortex to your PagerDuty instance, you’ll need to create a PagerDuty API key.
When adding the API key, you have the option to set read or write permissions.
- Read-only key: Enables Cortex to read any and all data from PagerDuty
- Write key: Allows users to trigger incidents from an entity page in Cortex, and enables On-Call Assistant
You can use a read-only
key if you do not wish to trigger incidents directly from the catalog.
Configuration
Once you've created an API key in PagerDuty, you'll add it in PagerDuty settings in Cortex.
If you do not see the settings page you're looking for, you likely don't have the proper permissions and need to contact your admin.
You can specify a read-only API key by toggling on read-only API key. If this option is toggled off, Cortex will use assume the provided API key has write permissions.
At this stage, you can also enable or disable On-Call Assistant, which notifies users in Slack when an incident is triggered in PagerDuty. On-Call Assistant requires that you set up a webhook subscription in your PagerDuty account. Note that On-Call Assistant will only work for service-level PagerDuty registrations since these notifications are related to affected services. You can read more about On-Call Assistant in this walkthrough.
The read-only API key option must be toggled off in order for On-call Assistant to be enabled.
If you’ve set everything up correctly, you’ll see the option to Remove Integration in settings.
You can also use the Test configuration button to confirm that the configuration was successful. If your configuration is valid, you’ll see a banner that says “Configuration is valid. If you see issues, please see documentation or reach out to Cortex support.”
Service-level vs. team-level configuration
Cortex recommends setting up PagerDuty at the service level by registering service entities with PagerDuty services, rather than configuring team entities with a PagerDuty schedule.
If PagerDuty is set up on a service level, you'll be able to see current on-call information listed within a given services's page, but if PagerDuty is set up on the team level, you will only be able to view on-call rotation info from a team page.
There are several long-term benefits to setting up PagerDuty on a service level:
- Structuring PagerDuty 1-1 with services enables better alert routing and analytics, something that organizations struggle more with when PagerDuty is set up on a team level.
- With a service-level setup, it’s also easier to enforce that all services have a compliant on-call policy enacted in PagerDuty, especially when making use of Scorecards.
- The service-level setup is less reliant on team members tagging incidents with service info because services and incidents are already linked.
- By setting up PagerDuty on a service level, you also gain the ability to get data from your Cortex catalog into PagerDuty, such as tier/criticality. By tying the service entities in the catalog with those in PagerDuty, you can automate processes and streamline severity protocols.
Registration
Discovery
By default, Cortex will use the entity tag (e.g. my-entity
) as the "best guess" for PagerDuty project. For example, if your entity tag is my-entity
, then the corresponding project in PagerDuty should also be my-entity
.
If your PagerDuty project don’t cleanly match the Cortex entity tag, you can override this in the Cortex entity descriptor.
Entity descriptor
For a given entity, you can define the PagerDuty service, schedules, or escalation policy within the entity’s YAML. You can only set up one of these three options per entity.
Each of these has the same field definitions.
Field | Description | Required |
---|---|---|
id | PagerDuty ID for service, schedule, or escalation policy | ✓ |
type | SERVICE , SCHEDULE or ESCALATION_POLICY | ✓ |
PagerDuty service
You can find the service ID value by visiting PagerDuty → Configuration → Services
. The URL for the service will contain the ID, for example: https://cortexapp.pagerduty.com/services/<ID>
x-cortex-oncall:
pagerduty:
id: ASDF1234
type: SERVICE
Schedules
You can find the Schedule ID by vising PagerDuty → People → On-call schedules
and clicking on the desired schedule. The ID is found in the URL, for example https://cortexapp.pagerduty.com/schedules#<ID>
.
x-cortex-oncall:
pagerduty:
id: ASDF1234
type: SCHEDULE
Escalation policy
You can find the Escalation Policy ID by vising PagerDuty → People → Escalation Policies
and clicking on the desired policy. The ID is found in the URL, for example https://cortexapp.pagerduty.com/escalation_policies#<ID>
.
x-cortex-oncall:
pagerduty:
id: ASDF1234
type: ESCALATION_POLICY
You can only set up one of the three options above per entity.
Identity mappings
Cortex maps email addresses in your PagerDuty instance to email addresses that belong to team members in Cortex. When identity mapping is set up, users will be able to see their personal on-call status from the developer homepage.
Expected results
Entity pages
Once the PagerDuty integration is set up, you’ll be able to view on-call information on entity pages:
- Current on-call for an entity
- Escalation policy
- Service
The escalation policy and PagerDuty service details are hyperlinked to the corresponding pages in your PagerDuty instance.
Scorecards and CQL
With the integration, you can create Scorecard rules and write CQL queries based on .
Check if on-call is set
Check if entity has a registered service, schedule, or escalation policy. If the service does not have any registrations in its entity descriptor, Cortex searches for PagerDuty services matching the tag defined in the entity's x-cortex-tag
field.
Definition: oncall (==/!=) null
Example
For a Scorecard focused an production readiness, you can use this expression to make sure on-call is defined for entities:
oncall != null
This rule will pass if an entity has a service, schedule, or escalation policy set.
Forbidden contact methods
Number of users in each entity's escalation policy with missing or forbidden contact methods.
Allowed contact methods:
-
"SMS"
-
"PHONE"
-
"EMAIL"
-
"PUSH_NOTIFICATION"
Definition:
oncall.usersWithoutContactMethods(allowed=<allowed>).length
Examples
For a Scorecard focused on ownership, you can use this expression to make sure users have required contact methods enabled:
oncall.usersWithoutContactMethods(allowed=["SMS", "PHONE"]).length == 0
This rule will pass if every user in an associated escalation policy has either SMS or phone calls enabled as their contact method.
You can also use this expression in the Query builder to find users that lack the required contact method:
oncall.usersWithoutContactMethods(allowed=["EMAIL"]) > 0
This query will surface users without email addresses.
Incident response analysis
Get detailed on-call analysis stats for each entity:
- Mean assignment count
- Mean engaged seconds
- Mean engaged user count
- Mean seconds to engage
- Mean seconds to first ack
- Mean seconds to mobilize
- Mean seconds to resolve
- Total business-hour erruptions
- Total engaged seconds
- Total escalation count
- Total off-hour erruptions
- Total sleep-hour erruptions
- Total snoozed seconds
- Total incident count
- Up time percent
PagerDuty updates its analytics data once per day, and it can take up to 24 hours before new incidents appear in the analytics API.
Only works if entity has a registered PagerDuty service ID or if the PagerDuty service name matches the entity tag.
Definition: oncall.analysis(lookback = <duration>)
Examples
PagerDuty analytics can easily be used to craft rules for a DORA metrics Scorecard.
For mean time to acknowledge, you can use the meanSecondsToFirstAck
schema definition:
oncall.analysis(lookback = duration("P7D")).meanSecondsToFirstAck <= 300
Entities will pass this rule if incidents in the last week were acknowledged within 5 minutes.
For mean time to resolve, you can use meanSecondsToResolve
to make sure that incidents were handled within an hour:
oncall.analysis(lookback = duration("P7D")).meanSecondsToResolve < 3600
You can also use this expression to write a rule checking entities' change failure rate:
oncall.analysis(lookback = duration("P7D")).totalIncidentCount == 0
This rule will pass if there weren't any incidents in the last week.
Incidents
Get incident data for each entity:
- Assignee ID
- Created at
- Incident ID
- Last updated
- Resolved at
- Service ID
- Status
Only works if entity has a registered PagerDuty service ID or if the PagerDuty service name matches the entity tag.
Definition: oncall.incidents(lookback = <duration>)
Examples
For a Scorecard focused on service maturity or quality, you can use this expression to check the number of incidents opened in the last month:
oncall.incidents(lookback = duration("P1M")).length < 15
Entities will pass this rule if they have fewer than 15 incidents opened in the last month.
You can also use this expression to make sure there aren't incidents that remain open over the last month:
oncall.incidents(lookback=duration("P1M")).filter((incident) => incident.status.matches("TRIGGERED|ACKNOWLEDGED")).length < 1
Or you can check for incidents that took a certain amount of time to resolve:
oncall.incidents(lookback=duration("P1M")).filter((incident) => incident.createdAt.until(incident.resolvedAt) > duration("P-2D")).length < 2
Entities will pass this rule if there were 0 or 1 incidents in the last month that took more than 2 days to resolve.
Number of escalations
Number of escalation tiers in escalation policy.
Definition: oncall.numOfEscalations()
Example
This expression could be used in a Scorecard focused on production readiness or service maturity:
oncall.numOfEscalations() >= 2
This rule checks that there are at least two tiers in an escalation policy for a given entity, so that if the first on-call does not ack, there is a backup.
While making sure an on-call policy set is a rule that would be defined in a Scorecard's first level, a rule focused on escalation tiers would make more sense in a higher level.
On-call metadata
On-call metadata, including type, id, and name.
Definition: oncall.details()
Examples
To find all entities with a schedule-type on-call registration, you can use this expression in the Query builder:
oncall.details().type == "schedule"
If you're migrating on-call policies, you could use this rule to check for outdated policies. Let's say, for example, all outdated PagerDuty policies start with "Legacy" in their titles.
oncall.details().id.matches("Legacy*") == false
Entities with on-call policies that start with "Legacy" will fail, while those with other policy names will pass.
Dev homepage
The PagerDuty integration enables Cortex to pull on-call information into the on-call block on the Dev homepage. On-call data from PagerDuty is refreshed every 60 minutes.
Eng Intelligence
Cortex also pulls in metrics from PagerDuty for Eng Intelligence. This tool will display MTTR, incidents opened, and incidents opened per week.
Notifications
If you have a Slack integration set up, you can also use the /cortex oncall <tag>
command to retrieve current on-call information. This feature works for both services and teams with registered PagerDuty schedules or escalation policies.
Triggering incidents
If you used a write
token to set up the integration, you’ll also see the ability to trigger incidents from the PagerDuty tab in an entity’s home page. This will open a modal where you can enter information about the incident: title, details, urgency, and associated email address. The incident will then be triggered directly in PagerDuty.
The Trigger Incident feature in the catalog only works with PagerDuty services.
Background sync
PagerDuty performs a number of background jobs:
- On-call: On-call information displayed on the developer homepage is refreshed every 60 minutes.
- Services and incidents: Services used for automapping and active incidents viewable in the catalog are fetched approximately every 5 minutes, or however long the refresh takes.
- Users: User data for identity mapping is synced daily at 10 a.m. UTC.
Still need help?
The following are all the ways to get assistance from our customer engineering team. Please use the option that is best for your users:
- Email: help@cortex.io, or open a support ticket in the in app Resource Center
- Chat: Available in the Resource Center
- Slack: Users with a connected Slack channel will have a workflow added to their account. From here, you can either @CortexTechnicalSupport or add a
:ticket:
reaction to a question in Slack, and the team will respond directly.
Don’t have a Slack channel? Talk with your customer success manager.