Safeguard for Human-AI Conversations." Meta AI Research, 2023. 2. Meta. "Prompt Guard." Hugging Face, 2024. 3. Deng, Y. et al. "Multilingual Jailbreak Challenges in Large Language Models." ICLR 2024. 4. 国立情報学研究所 大規模言語モデル研究開発センター. "AnswerCarefully Dataset." 2024. 5. Wang, Y. et al. "Do-Not-Answer: Evaluating Safeguards in LLMs." Findings of the Association for Computational Linguistics: EACL 2024, pages 896-911, 2024. 6. Anthropic. "Many-shot jailbreaking." Anthropic Research, 2024. 7. Gandhi, S. et al. "Gandalf the Red: Adaptive Security for LLMs." arXiv preprint arXiv:2501.07927, 2025. 8. Shen, X. et al. ""Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models." Conference on Computer and Communications Security (2023). 1 / 15