Token Injection: Crashing LLM Inference With Special Tokens

No ratings

Presented at Black Hat Europe 2025 by

As large language models (LLMs) are deployed at scale, their underlying inference frameworks (e.g., vLLM, SGLang, TensorRT-LLM) have become critical operational pillars. These systems must splice user prompts with control structures, tokenise them, and schedule requests within milliseconds. Within this high-speed pipeline, we identify an underappreciated attack surface: special tokens. We introduce the first "Token Injection" attack model, showing how a single prompt composed solely of special tokens can trigger uncaught exceptions in embedding and CUDA computation stages, resulting in denial of service (DoS) or full-service crashes. It can also cause inference manipulation, such as chat interruption and context pollution. The attack requires no authentication and works via standard input interfaces, affecting both self-hosted and managed deployments. We validate impact across multiple inference frameworks, including vLLM, SGLang, TensorRT-LLM, MLX, Ollama, and Hugging Face TGI; and across major platforms, including NVIDIA NIM, Google Vertex AI, Azure AI Foundry, Hugging Face, Meta AI, and OpenRouter. This work shifts the AI security focus from "model output" to the security of inference infrastructure, offering practitioners a new perspective and a concrete defence paradigm.