Scorecard examples
The following Scorecard use cases and examples are based on engineering teams across a wide spectrum of sizes and maturity levels.
Scorecard use cases
Cortex users commonly define Scorecards across several categories:
- Development Maturity: Ensure services and resources conform to basic development best practices, such as established code coverage, checking in lockfiles, READMEs, package versions, and ownership.
- Operational Readiness: Determine whether services and resources are ready to be deployed to production, checking for runbooks, dashboards, logs, on-call escalation policies, monitoring/alerting, and accountable owners.
- Operational Maturity: Monitor whether services are meeting SLOs, on-call metrics look healthy, and post-mortem tickets are closed promptly, gauging if there too many customer-facing incidents.
- Security: Mitigate security vulnerabilities, achieve security compliance across services, measure code coverage
- Migrations: Track ad hoc projects like migrations between language versions, platforms, or deployment strategies, or perform security audits, such as PCI DSS or SOC 2 compliance.
- Best Practices: Define organization-wide best practices, such as infrastructure + platform, SRE, and security. For example, the Scorecard might help you ensure the correct platform library version is being used.
Scorecards are often aspirational. For example, an SRE team may define a readiness Scorecard with 20+ criteria that they think their services or resources should meet in order to be considered "ready" for SRE support. The reality may be that the engineering team is not resourced to actually meet those goals, but setting objective targets helps drive organization-wide cultural shifts and sets a baseline for conversations around tech debt, infrastructure investment, and service quality.
Scorecard examples
Development maturity
Developers should be checking in lockfiles to ensure repeatable builds.
sonarqube.metric("coverage") > 80.0
Set a threshold that’s achievable, so there’s an incentive to actually try. This also serves as a secondary check that the service is hooked up to Sonarqube and reporting frequently.
git.lastCommit().freshness < duration("P30D")
As counterintuitive as it may seem, services that are committed too infrequently are actually at more risk. This is because people who are familiar with the service may leave a team, tribal knowledge accumulates, and from a technical standpoint, the service may be running outdated versions of your platform tooling.
Use a wildcard search to make sure there are unit tests enabled.
git.numOfRequiredApprovals() >= 1
Ensure that a rigorous PR process is in place for the repo, and PRs must be approved by at least one user before merging.
git.fileContents("circleci/config.yml").matches(".*npm test.*")
Enforce that a CI pipeline exists, and that there is a testing step defined in the pipeline.
Operational readiness
ownership.allOwners().length > 2
Incident response requires crystal-clear accountability, so make sure there are owners defined for each service or resource.
oncall.numOfEscalations() > 1
Check that there are at least 2 levels in the escalation policy, so that if the first on-call does not acknowledge, there is an established backup.
links("runbooks").length >= 1
Create a culture of preparation by requiring runbooks to be established for the services or resources.
links("logs").length > 1
When there is an incident, responders should be able to find the right logs easily. Usually, this means load balancer logs and application logs.
embeds().length >= 1
Responders should have standard dashboards readily accessible for every service or resource in order to speed up triage.
custom("pre-prod-enabled") == true
Use an asynchronous process to check whether there is a live pre-production environment for the service or resource, and send a true/false flag to Cortex using the custom metadata API.
sonarqube.metric("vulnerabilities") < 3
Ensure that production services are not deployed with a high number of security vulnerabilities.
Operational maturity
oncall.analysis().meanSecondsToResolve < 3600
Make sure that issues are resolved in a reasonable amount of time. If they’re not, you can dig into the root cause.
oncall.analysis().offHourInterruptions < 3
If engineers are being paged off hours, it will lead to alert fatigue and low morale. By catching services and resources that are causing high numbers of off-hour interruptions, you can improve developer happiness.
JIRA: post mortem tickets opened in the last 6 months that are still open
Developers creating action items for services without actually closing them is an organizational risk. Either the team is not prioritizing incident-related issues, or the team is not equipped with the right resources.
jira.numOfIssues("labels=customer and created > startOfMonth(-3)")< 2
A reliable service or resource should not be a source of frequent customer-facing incidents.
jira.numOfIssues("labels=compliance") < 3
Make sure there are no outstanding compliance or legal issues affecting the service or resource.
snyk != null
The first step in monitoring security is making sure each service has as associated Snyk project.
git.lastCommit().freshness < duration("P7D")
By confirming whether a service was updated within the last week, outdated code can be caught sooner. Plus, if there is a security issue, you can quickly determine which services have or have not been updated to patch the vulnerability.
ownership.allOwners().length > 0
Making sure each entity has at least one owner helps ensure updates don't fall through the cracks.
git.numOfRequiredApprovals() > 0
Changes should be pushed through unless there is at least one approval.
sonarqube.metric("coverage") > 70
By monitoring code coverage, you can get a sense of how much of your code has been tested — entities with low scores are more likely to be vulnerable to attack.
git.branchProtection() != null
Make sure that your default branch is protected, as vulnerabilities here are critical.
sonarqube.freshness() < duration("P7D")
And check to make sure a SonarQube analysis has been uploaded within the last seven days, so teams are monitoring for compliance to coding rules.
snyk.issues() < 5
sonarqube.metric("security_hotspots") < 5
sonarqube.metric("vulnerabilities) < 5
Once an entity is meeting core requirements, developers can start focusing on quality by making sure entities have a low number of Snyk issues, security hotspots, and/or vulnerabilities.
custom("ci-platform-version") > semver("1.1.3")
Having every CI pipeline send a current version to Cortex on each master build lets you catch services or resources that rely on outdated versions of tooling, like CI or deploy scripts.
package("apache.commons.lang") > semver("1.2")
Cortex automatically parses dependency management files, so you can easily enforce library versions for platform migrations, security audits, and more.
Best practices
Best practices are unique to every organization and every application, so make sure to work across teams to develop a Scorecard measuring your organization's standards.
The following example uses JavaScript best practices:
git.fileExists("yarn.lock") or git.fileExists("package-lock.json")
Make sure a Lockfile is checked in to provide consistency in package installs.
git.fileExists(".prettierrc.json") or git.fileExists(".eslintrc.js")
Projcets should have a standard linter.
jq(git.fileContents("package.json"), ".engines.node") != null
Node engine version should be specified in the package.json file.
jq(git.fileContents("package.json"), ".devDependencies | with_entries(select(.key == \"typescript\")) | length") == 0 or git.fileExists("tsconfig.json")
Typescript projects should have a tsconfig checked in.
jq(git.fileContents("package.json"), ".engines.yarn") == null or jq(git.fileContents("package.json"), ".engine.npm") = "please-use-yarn"
If a project is using yarn, it should not allow NPM.
jq(git.fileContents("package.json"), ".engines.yarn") == null or !(semver("1.2.0") ~= semverRange(jq(git.fileContents("package.json"), ".engines.yarn")))
Finally, ensure that the yarn version being used is not deprecated.