Loading…
10th International Congress on Information and Communication Technology in concurrent with ICT Excellence Awards (ICICT 2025) will be held at London, United Kingdom | February 18 - 21 2025.
Friday February 21, 2025 9:30am - 11:00am GMT

Authors - Junichiro Ando, Satoshi Okada, Takuho Mitsunaga
Abstract - Large Language Models (LLMs) like ChatGPT and Claude have demonstrated exceptional capabilities in content generation but remain vulnerable to adversarial jailbreak attacks that bypass safety mechanisms to output harmful content. This study introduces a novel jailbreak method targeting Autodefense, a multi-agent defense framework designed to detect and mitigate such attacks. By combining obfuscation techniques with the injection of harmless plaintext, our proposed method achieved a high jailbreak attack success rate (maximum value is 95.3%) across different obfuscation methods, which marks a significant increase compared to the ASR of 7.95% without our proposed method. Our experiments prove the effectiveness of our proposed method to bypass Autodefense system.
Paper Presenters
Friday February 21, 2025 9:30am - 11:00am GMT
Virtual Room E London, United Kingdom

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link