10th International Congress on Information and Communication Technology in concurrent with ICT Excellence Awards (ICICT 2025) will be held at London, United Kingdom | February 18 - 21 2025.
Authors - Junichiro Ando, Satoshi Okada, Takuho Mitsunaga Abstract - Large Language Models (LLMs) like ChatGPT and Claude have demonstrated exceptional capabilities in content generation but remain vulnerable to adversarial jailbreak attacks that bypass safety mechanisms to output harmful content. This study introduces a novel jailbreak method targeting Autodefense, a multi-agent defense framework designed to detect and mitigate such attacks. By combining obfuscation techniques with the injection of harmless plaintext, our proposed method achieved a high jailbreak attack success rate (maximum value is 95.3%) across different obfuscation methods, which marks a significant increase compared to the ASR of 7.95% without our proposed method. Our experiments prove the effectiveness of our proposed method to bypass Autodefense system.