Audio-jacking: Using generative AI to distort live audio transactions

Researchers built a proof-of-concept that intercepts live audio calls, runs speech-to-text into an LLM to detect context keywords (e.g., “bank account”), and uses text-to-speech with voice cloning to inject attacker-controlled replacements into the conversation. The PoC can be deployed via compromised devices or VoIP services and demonstrates how generative AI lowers the bar for realistic, on-the-fly audio manipulation. #LLMs #VoiceCloning

Keypoints

Proof-of-concept intercepts live calls as a man-in-the-middle and processes audio with speech-to-text, an LLM, and text-to-speech using cloned voices.
Attack uses keyword-triggered replacements (demo used “bank account”) so only the sensitive token is altered while preserving surrounding context.
Possible delivery vectors include malware on victims’ phones, compromised/ malicious VoIP services, or social-engineered calls joining two victims.
The PoC workflow: capture audio → speech-to-text → LLM decides whether to modify → if modified, synthesize cloned-voice audio; otherwise replay original audio.
Implementation is simple today: three seconds of voice can enable convincing cloning; main practical limits are API/GPU latency and voice-clone fidelity.
Latency was mitigated in the demo with bridging phrases; sufficient local GPU resources can reduce latency to near real-time, increasing realism and scalability.
Impacts include financial fraud (redirected payments), altered medical or operational instructions, and broader risks if applied to live video streams.

MITRE Techniques

[T1557] Adversary-in-the-Middle – Intercepted and altered live VoIP/audio streams to modify spoken content on the fly (‘Our program acts as a man-in-the-middle, monitoring a live conversation.’)
[T1566] Phishing – Identified as a likely initial compromise vector to obtain device access (‘Phishing … remain attackers’ top threat vectors of choice’)
[T1078] Valid Accounts – Use of compromised credentials to gain access to victims’ devices or services (‘using compromised credentials remain attackers’ top threat vectors of choice’)
[T1190] Exploit Public-Facing Application – Vulnerability exploitation of apps or VoIP services as an attack vector (‘vulnerability exploitation’)

Indicators of Compromise

[Domain] Hosting and PoC artifacts – securityintelligence.com (hosts demo videos and article content)
[MP4 video files] Demonstration audio samples – https://securityintelligence.com/wp-content/uploads/2024/02/LLMs-AudioJacking-Video_hijacked_short-1.mp4, https://securityintelligence.com/wp-content/uploads/2024/02/LLMs-AudioJacking-Video_original-call-1.mp4
[URL] Source report – https://securityintelligence.com/posts/using-generative-ai-distort-live-audio-transactions/

The technical procedure chains four core components: audio capture, speech-to-text (STT), a large language model (LLM) for context-aware editing decisions, and text-to-speech (TTS) with a cloned voice to replay modified utterances. In the PoC the program continuously captures microphone/VoIP audio, sends short segments to STT, and passes the transcript plus recent context to the LLM, which follows explicit rules (example LLM prompt provided) to decide whether to modify a detected sensitive token (the demo targeted “bank account”). If the LLM flags a modification, the system synthesizes the altered sentence with a pre-cloned voice and plays it back in place of the original audio; otherwise it forwards the untouched audio stream.

Deployment options include implanting malware on endpoints to access microphones, compromising VoIP providers or endpoints to insert a man-in-the-middle, or socially engineering sessions that connect two victims through an attacker-controlled bridge. Practical constraints observed were processing latency (API/GPU delays) and clone fidelity: latency was masked with bridging phrases in the demo, while higher-quality cloning (capturing tone and timing) and local GPU capacity reduce timing artifacts and increase believability. The provided pseudocode and LLM instruction snippet illustrate the minimal logic required: detect, decide (modified vs. not), and either synthesize-and-play or replay the original audio.

Operational detection opportunities include monitoring for unusual local GPU usage consistent with on-device generative models, anomalous network calls to STT/TTS/LLM endpoints, and indicators of compromise on VoIP services or endpoints (phishing, exploited vulnerabilities, stolen credentials). Mitigations center on preventing initial compromise (patching, phishing defenses, credential hygiene), validating sensitive spoken details by paraphrasing/repeating, and applying audio-deepfake detection where available.

SHARE THIS STORY

WhatsApp X (Twitter)Telegram Bluesky Facebook LinkedIn Threads Email Print