Keypoints
- Logs from endpoints, network devices, and applications were aggregated into a large labeled dataset representative of previously investigated incidents.
- Preprocessing converted raw text telemetry into numerical features using TFâIDF with parameter tuning (frequency thresholds, max features, nâgram ranges).
- Random Forest was chosen for classification due to robustness against overfitting and ability to handle nonâlinear, highâdimensional data.
- Model training used standard splits (train/validation/test), aimed for ~99% target accuracy, and included manual review for ambiguous cases.
- Deployment relied on continuous monitoring to detect concept drift and required incremental learning to adapt to evolving threats.
- Computational tradeoffs were addressed by tuning TFâIDF and Random Forest hyperparameters to balance accuracy and resource use.
MITRE Techniques
- [T1486] Data Encrypted for Impact â Malware may encrypt files to disrupt access to data. (âMalware may encrypt files to disrupt access to data.â)
- [T1041] Exfiltration Over Command and Control Channel â Data may be exfiltrated through established command and control channels. (âData may be exfiltrated through established command and control channels.â)
- [T1105] Remote File Copy â Malicious files may be copied to remote systems for execution. (âMalicious files may be copied to remote systems for execution.â)
- [T1059] Command-Line Interface â Attackers may use command-line interfaces to execute commands on compromised systems. (âAttackers may use command-line interfaces to execute commands on compromised systems.â)
Indicators of Compromise
- [Domain] domains used as IoCs across multiple target industries â cartoonplayer[.]com, microsoft.msonedriver[.]com, and other 24 domains observed in KSN findings
Prepare a high-quality labeled dataset representative of known incidents before model design: collect telemetry from endpoints, networks, and applications; perform manual verification of automatically gathered indicators to reduce noise; and split data into training, validation, and test sets to measure generalization. Apply robust preprocessing for text telemetryâcleaning, handling missing values, then vectorizing with TFâIDF while tuning parameters such as minimum/maximum term frequency thresholds, maximum number of features, and nâgram ranges to limit sparsity and capture relevant token sequences.
Train a Random Forest classifier on TFâIDF features because it handles nonâlinear relationships and reduces overfitting; tune hyperparameters (number of estimators, max tree depth, min samples split/leaf, impurity criteria) with crossâvalidation and holdout validation to reach target performance (the project aimed for ~99% accuracy). Where required, add a manual triage step for borderline predictions. Monitor computational cost: TFâIDF can create large sparse matrices and Random Forest grows expensive with many trees and high dimensionality, so balance model complexity against available compute and use feature selection or dimensionality limits when necessary.
Deploy models with continuous monitoring for concept drift and automated metrics tracking; implement incremental learning or periodic retraining to incorporate new labeled data and preserve model maturity. Improve interpretability by extracting feature importance, visualizing decision trees for selected estimators, and logging examples that trigger highâimpact decisions to support analyst review. Combine these practicesâcurated datasets, careful preprocessing, tuned Random Forest models, resource-aware deployment, and ongoing retrainingâto maintain robust, scalable threat detection from large log volumes.
Read more: https://securelist.com/machine-learning-in-threat-hunting/114016/