Anthropic's Many-Shot Jailbreak Constitutional Classifiers
Anthropic has identified a vulnerability in large language models (LLMs) termed "many-shot jailbreaking." This technique involves presenting the model with numerous fabricated dialogues where an AI assistant provides harmful or unethical responses. By inundating the model with such examples, attackers can bypass its safety protocols, leading it to generate undesirable outputs. This exploit leverages the expanded context windows of modern LLMs, which allow them to process extensive amounts of information, thereby increasing their susceptibility to manipulation. Anthropic's research underscores the need for enhanced safeguards and collaborative efforts within the AI community to address this emerging threat. Anthropic has announced $15,000 rewards for anyone who can jail break it.
Try it here: Constitutional Classifiers
