Keypoints
- Gemini 1.5 Pro supports up to 1 million tokens, allowing whole decompiled/decompiled files to be analyzed in a single pass without fragmenting code.
- Using automated preprocessing (Hex‑Rays/IDA Pro), decompiled C or full assembly can be fed to the model; examples processed in 27–46 seconds.
- The model can interpret code intent (not just patterns), produce human‑readable summaries, and extract IOCs such as filenames, registry keys, mutexes, domains, and network behaviors.
- Case studies: two WannaCry binaries (decompiled C totaling >280k tokens) and an unknown medui.exe (189k tokens) were analyzed end‑to‑end, yielding malicious verdicts and detailed behavior descriptions.
- Gemini handled both decompiled (higher‑level) and disassembled (assembly) outputs, enabling hybrid analysis workflows depending on analyst goals.
- Limitations remain: heavy obfuscation/packing, ever‑growing binary sizes, and evolving attack techniques require robust preprocessing and hybrid static/dynamic context to maximize accuracy.
MITRE Techniques
- [T1046] Network Service Discovery – The malware “generate[s] IP addresses and perform network scans to find targets on port 445/SMB to spread to other computers” (‘…generate IP addresses and perform network scans to find targets on port 445/SMB to spread to other computers…’).
- [T1021.002] Remote Services: SMB/Windows Admin Shares – The sample propagates by targeting SMB/port 445 to move laterally (“…perform network scans to find targets on port 445/SMB to spread to other computers…”).
- [T1055] Process Injection – A binary was identified as a game cheat that injects a DLL into the Grand Theft Auto process (“…inject a game hack dynamic-link library (DLL) into the Grand Theft Auto video game process…”).
- [T1112] Modify Registry – Analysis extracted relevant registry entries used by the malware as part of its behavior/persistence (“…identifies URL/domain (WannaCry’s ‘killswitch’) and relevant registry key and mutex…”).
- [T1562.001] Impair Defenses: Disable or Modify Tools – The medui.exe analysis noted disabling of security software to evade detection (“…evading detection through the disabling of security software…”).
- [T1565] Data Manipulation – The model concluded one sample is designed to hijack Bitcoin transactions, altering transaction flows to steal cryptocurrency (“…steal cryptocurrency by hijacking Bitcoin transactions…”).
Indicators of Compromise
- [Filenames] Discussed samples and IOCs – lhdfrgui.exe (WannaCry dropper), medui.exe, and tasksche.exe (WannaCry cryptor).
- [SHA-256 Hashes] Example hashes from the article – lhdfrgui.exe: 24d004a104d4d54034dbcffc2a4b19a11f39008a575aa614ea04703480b1022c; medui.exe: 719b44d93ab39b4fe6113825349addfe5bd411b4d25081916561f9c403599e50 (and 2 more hashes listed).
- [Network/Port] Scanning and propagation context – scans targeting port 445/SMB to find and infect other hosts.
- [Artifacts] Persistence/identifiers – registry key and mutex referenced as IOCs (specific names not provided in article).
Gemini 1.5 Pro technical workflow compresses several analyst steps into an automated pipeline: preprocess the binary with a decompiler or disassembler (e.g., Hex‑Rays/IDA Pro), feed the resulting decompiled C or full assembly into the LLM prompt, and request a structured analyst‑style report (verdict, activity list, and IOCs). Because decompiled output is typically 5–10× more concise than raw assembly, decompilation improves efficiency and fits more cleanly into LLM inputs; however, the system supports both approaches and can use a hybrid strategy when low‑level detail is required.
In practice, the team processed >280k token decompiled WannaCry C files and a 1.5 MB assembly output (from a 306.5 KB binary) in single passes, with timings ranging from 27 to 46 seconds. The model produced accurate behavior summaries—identifying ransomware behavior, file IOCs (e.g., c.wnry, tasksche.exe), SMB port‑scanning propagation logic, killswitch domains, registry keys, mutexes, DLL injection activity, and indicators suggesting cryptocurrency transaction hijacking—demonstrating that a large token window lets the LLM maintain global context across the entire binary rather than fragmented snippets.
To operationalize this capability at scale, integrate robust preprocessing (unpacking/deobfuscation, static & dynamic telemetry) and enforce validation steps: corroborate LLM findings with sandbox runs, signature engines, or manual reverse engineering when results affect blocking or remediation. Remaining technical challenges include increasingly large binaries beyond current token limits, sophisticated packing/obfuscation that requires dynamic unpacking, and continuous attacker adaptation—so combine LLM analysis with toolchains that supply richer contextual data for higher confidence detections.
Read more: https://cloud.google.com/blog/topics/threat-intelligence/gemini-for-malware-analysis/