The SOC metrics that actually tell you something

There is a particular kind of KPI that looks rigorous and turns out to be actively harmful. It produces a number. The number goes on a dashboard. The dashboard gets presented in QBRs. And somewhere in the background, the metric quietly incentivizes the exact behaviour you were trying to prevent.

Security operations is full of these. The most common offender is Mean Time to Respond (MTTR) — and the way it breaks is instructive for thinking about metrics more broadly.

Why MTTR breaks when applied to analysts

MTTR is a reasonable metric for measuring the performance of an engineering team responsible for tooling. If your detection infrastructure takes four hours to process a high-severity alert, that's a process problem worth measuring and reducing.

Applied to individual analysts as a performance measure, it breaks immediately. If "responding" means acknowledging an alert, you've created an incentive to acknowledge as many alerts as fast as possible. Analysts start context-switching constantly, mixing open investigations, losing thread on complex cases. The MTTR numbers look great. The detection quality degrades invisibly.

High-performing analysts find this particularly corrosive. They know that good triage takes time. They know that rushing through alerts to hit a number is security theatre. Being measured on that number feels like being punished for doing the job properly.

"The metric told me to close tickets quickly. My instincts told me to spend more time on the weird ones. Every month, the metric won, and every month I trusted it a little less."

This is the failure mode of a poorly designed KPI: it produces the metric at the cost of the outcome the metric was supposed to represent.

What makes a KPI robust

A useful test for any KPI is to ask: what happens if someone optimises hard against this metric? If the answer is "they game it and the underlying outcome gets worse," the KPI is broken by design.

A good KPI remains beneficial even under significant effort to over-optimise it. It should have a clear ceiling or natural constraint that prevents perverse incentives at the extremes. It should be calculated automatically from defined inputs, not manually compiled — otherwise the calculation itself becomes a source of noise and conflict.

Three metrics that hold up

1. Estate coverage

Estate coverage measures the proportion of the client's monitored environment — endpoints, cloud workloads, network segments — that falls within the SOC's field of view. A well-defined estate coverage metric, integrated with an accurate asset registry, has a clear ceiling of 100%, is easy to explain to non-technical stakeholders, and is structurally difficult to game.

More importantly, it directly addresses the question that should precede all other SOC performance questions: can we actually see what's happening across the environment? A SOC with excellent MTTR numbers but 60% estate coverage is blind to 40% of potential incidents. Estate coverage makes this visible.

The metric is most useful when tracked over time. Coverage that drops from 94% to 88% in a week is a signal — something changed in the environment and the telemetry didn't keep up. That's worth investigating before the next incident review, not after.

2. True positive rate by detection rule

Aggregate false positive / true positive ratios obscure as much as they reveal. The useful version of this metric is disaggregated by detection rule. Which specific rules are generating high alert volume with low confirmation rates? Which rules are rarely firing but reliably accurate?

This gives you two things: a tuning backlog (rules worth revisiting because their signal-to-noise ratio is poor) and a view of detection coverage quality that's actionable at an engineering level. It also tells you something about analytical load — analysts spending time triaging high-volume, low-fidelity rules are doing less useful work than analysts working high-fidelity alerts.

Practical note: This metric requires your SIEM alert data and analyst disposition data to be queryable together. In most environments, dispositions live in the ticketing system and alerts live in the SIEM — which means somebody has to join them. If that join doesn't happen automatically, this metric won't get tracked.

3. SLA adherence by severity tier

SLA adherence is a common metric, but it is often reported in aggregate ("we met SLA on 94% of alerts this week"), which hides the variance that matters. Critical-severity alerts missing SLA is a different problem than low-severity alerts missing SLA. Aggregate reporting makes both look equivalent.

The more useful version tracks adherence broken out by severity tier — critical, high, medium, low — and further broken out by client if you're running a multi-client SOC. This lets you spot patterns: a particular client's environment generating an unusual proportion of SLA misses, or a specific severity tier that your team consistently struggles with during overnight shifts.

On the reporting of metrics to clients

There is a persistent instinct in security operations to limit what clients see — to report the numbers that look good and explain away the ones that don't. This is understandable and usually counterproductive.

Clients who receive clear, consistent, honest metrics — including metrics that show room for improvement — develop a more accurate model of what their MSSP does. That accurate model is what makes them defensible to their own leadership when something goes wrong. It's also what makes contract renewals feel like continuations rather than negotiations.

The goal of SOC reporting is not to present a polished surface. It is to give clients the data they need to understand their risk posture and the value of the services they're paying for. Metrics that support that goal are worth tracking. Metrics that just fill a dashboard are noise.

A note on automation

All three of the metrics described above can be automated: estate coverage from the asset registry and telemetry sources, TP/FP rates from SIEM and ticketing data joined on alert ID, SLA adherence from ticket timestamps and severity classifications. When they're automated, they're consistent. When they're consistent, you can track trends. When you can track trends, the metrics start doing what metrics are supposed to do.

Manual metric compilation, by contrast, produces numbers that reflect the week's data and the analyst's choices about which query to run. That's not a measurement. It's a reconstruction.