LLMs in the SOC (Part 1) | Why Benchmarks Fail Security Operations Teams

LLMs in the SOC (Part 1) | Why Benchmarks Fail Security Operations Teams

SentinelLABS’ analysis finds that current LLM benchmarks from major players reduce continuous, collaborative security work to isolated, static tasks and therefore do not measure the operational outcomes defenders need. Benchmarks such as Microsoft’s ExCyTIn-Bench, Meta’s CyberSOCEval/CyberSecEval 3, and CTIBench show LLMs struggle with multi-hop investigations, poor calibration on severity, and evaluation loops that rely on vendor models to judge vendor models. #ExCyTIn-Bench #CyberSOCEval

Keypoints

  • Existing LLM benchmarks overwhelmingly use MCQ/static QA formats that simplify complex SOC and CTI workflows into single-shot tasks, which misaligns with real defender work.
  • ExCyTIn-Bench demonstrates LLMs struggle with multi-hop investigations over realistic, heterogeneous logs, with top model rewards well below reliable performance levels.
  • CyberSOCEval and CTIBench show models extract signals from malware sandbox logs and CTI reports but still miss most malware analysis questions and are unreliable on threat-intel reasoning and severity estimation.
  • Many benchmarks use the same or vendor-supplied LLMs to generate and judge test items, creating closed loops that are easy to overfit and hard to trust for governance decisions.
  • Important operational metrics (time-to-detect, time-to-contain, mean time to remediate) and dynamic, multi-source workflows are absent from current evaluations.
  • Statistical practices vary: some benchmarks report confidence intervals and drift checks, but others omit variance, contamination analysis, and judge robustness, leaving headline scores fragile.

MITRE Techniques

  • [None ] No specific MITRE ATT&CK technique IDs are named in the article – ‘…the model’s willingness to help with cyberattacks mapped to ATT&CK stages.’

Indicators of Compromise

  • [CTI reports ] used for threat-intel reasoning in CyberSOCEval and CTIBench – 45 CTI reports referenced as a corpus for evaluation; no specific report filenames provided.
  • [Sandbox detonation logs ] used for malware analysis in CyberSOCEval – real sandbox logs are cited as the data source, but no sample file names or hashes are listed.
  • [Vulnerability identifiers / CVE descriptions ] used in CTIBench CVSS tasks – CVE descriptions are the input for severity estimation, but no explicit CVE numbers are included in the article.
  • [SIEM/log tables / incident graphs ] presented in ExCyTIn-Bench – 57 Sentinel-style tables and 44 days of unified logs underpin the evaluation; no specific log indices or sample entries are shown.
  • [Attack chains / multi-stage incidents ] contextual indicators used as evaluation scenarios – ExCyTIn-Bench includes eight multi-stage attacks described at a high level, with no named malware families or IOC values.


Read more: https://www.sentinelone.com/labs/llms-in-the-soc-part-1-why-benchmarks-fail-security-operations-teams/