“Revolutionizing Threat Hunting: Machine Learning as the Key to Uncovering Hidden Threats”

Kaspersky applied machine learning on Kaspersky Security Network logs—using TF–IDF for text vectorization and Random Forest classifiers—to detect thousands of previously unknown advanced threats and new IoCs. Their process stresses careful dataset curation, preprocessing, hyperparameter tuning, deployment with monitoring for concept drift, and incremental learning to keep models accurate. #Kaspersky #RandomForest

Keypoints

  • Logs from endpoints, network devices, and applications were aggregated into a large labeled dataset representative of previously investigated incidents.
  • Preprocessing converted raw text telemetry into numerical features using TF–IDF with parameter tuning (frequency thresholds, max features, n‑gram ranges).
  • Random Forest was chosen for classification due to robustness against overfitting and ability to handle non‑linear, high‑dimensional data.
  • Model training used standard splits (train/validation/test), aimed for ~99% target accuracy, and included manual review for ambiguous cases.
  • Deployment relied on continuous monitoring to detect concept drift and required incremental learning to adapt to evolving threats.
  • Computational tradeoffs were addressed by tuning TF–IDF and Random Forest hyperparameters to balance accuracy and resource use.

MITRE Techniques

  • [T1486] Data Encrypted for Impact – Malware may encrypt files to disrupt access to data. (‘Malware may encrypt files to disrupt access to data.’)
  • [T1041] Exfiltration Over Command and Control Channel – Data may be exfiltrated through established command and control channels. (‘Data may be exfiltrated through established command and control channels.’)
  • [T1105] Remote File Copy – Malicious files may be copied to remote systems for execution. (‘Malicious files may be copied to remote systems for execution.’)
  • [T1059] Command-Line Interface – Attackers may use command-line interfaces to execute commands on compromised systems. (‘Attackers may use command-line interfaces to execute commands on compromised systems.’)

Indicators of Compromise

  • [Domain] domains used as IoCs across multiple target industries – cartoonplayer[.]com, microsoft.msonedriver[.]com, and other 24 domains observed in KSN findings

Prepare a high-quality labeled dataset representative of known incidents before model design: collect telemetry from endpoints, networks, and applications; perform manual verification of automatically gathered indicators to reduce noise; and split data into training, validation, and test sets to measure generalization. Apply robust preprocessing for text telemetry—cleaning, handling missing values, then vectorizing with TF–IDF while tuning parameters such as minimum/maximum term frequency thresholds, maximum number of features, and n‑gram ranges to limit sparsity and capture relevant token sequences.

Train a Random Forest classifier on TF–IDF features because it handles non‑linear relationships and reduces overfitting; tune hyperparameters (number of estimators, max tree depth, min samples split/leaf, impurity criteria) with cross‑validation and holdout validation to reach target performance (the project aimed for ~99% accuracy). Where required, add a manual triage step for borderline predictions. Monitor computational cost: TF–IDF can create large sparse matrices and Random Forest grows expensive with many trees and high dimensionality, so balance model complexity against available compute and use feature selection or dimensionality limits when necessary.

Deploy models with continuous monitoring for concept drift and automated metrics tracking; implement incremental learning or periodic retraining to incorporate new labeled data and preserve model maturity. Improve interpretability by extracting feature importance, visualizing decision trees for selected estimators, and logging examples that trigger high‑impact decisions to support analyst review. Combine these practices—curated datasets, careful preprocessing, tuned Random Forest models, resource-aware deployment, and ongoing retraining—to maintain robust, scalable threat detection from large log volumes.

Read more: https://securelist.com/machine-learning-in-threat-hunting/114016/