NeuroStrike: Neuron-Level Attacks on Aligned LLMs

No ratings

Presented at NDSS 2026 by

Stjepan Picek

Ahmad-Reza Sadeghi

www.ndss-symposium.org

Safety alignment is critical for the ethical deployment of large language models, guiding them to avoid generating harmful or unethical content. Current alignment techniques, such as supervised fine-tuning and reinforcement learning from human feedback, remain fragile and can be bypassed by carefully crafted adversarial prompts. Unfortunately, such attacks rely on trial and error, lack generalizability across models, and are constrained by scalability and reliability.