ARES Strategies
ARES provides a collection of attack strategies designed to probe AI systems for vulnerabilities. Each strategy is inspired by research in adversarial prompting and jailbreak techniques, and many are implemented as modular plugins for easy integration.
Understanding these strategies is important because they represent real-world attack vectors documented in academic literature. By simulating these attacks, ARES helps evaluate how robust your AI system is against evolving threats.
Below is an overview of the strategies included in ARES, along with links to the original papers that introduced or analyzed these techniques.
Direct Requests
This strategy probes LLMs via direct requests for harmful content.
Human Jailbreaks
Plugin: ares-human-jailbreak
A human jailbreak is a manual, creative prompt engineering technique where a user crafts inputs that trick an LLM into ignoring its safety constraints or ethical guidelines, making it behave in unintended or unsafe ways. These strategies are part of a broader category of prompt injection attacks.
GCG (Zou et al.)
Plugin: ares-gcg
Greedy Coordinate Gradient (GCG) uses a white-box gradient based approach to construct adversarial suffixes to break LLM alignment.
AutoDAN (Liu et al.)
Plugin: ares-autodan
AutoDAN is an automated jailbreak attack that uses hierarchical genetic algorithms to generate adversarial prompts. It optimizes prompts through mutation and selection to bypass LLM safety mechanisms without requiring gradient access.
Key Features:
Black-box attack (no model gradients needed)
Genetic algorithm-based optimization
Hierarchical prompt generation
Configuration Example:
strategy:
autodan:
type: ares_autodan.strategies.autodan.AutoDAN
input_path: assets/attack_goals.json
output_path: assets/autodan_results.json
num_steps: 10
model: qwen
TAP (Mehrotra et al.)
Plugin: ares-tap
TAP is an automated method for generating jailbreaks that only requires black-box access to the target LLM. It uses an attacker LLM to iteratively refine attack prompts until one succeeds, while a built‑in filtering step removes low‑quality candidates before they are sent to the target, reducing the number of queries sent to the target LLM.
Crescendo (Russinovich et al.)
Plugin: ares-pyrit
Crescendo is a multi-turn attack which gradually escalates an initial benign question and via multi-turn dialogue via referencing the target’s replies progressively steers to a successful jailbreak.
Echo Chamber (Alobaid et al,)
Plugin: ares-echo-chamber
Echo Chamber is a multi-turn attack which begins by presenting the model with a seemingly innocuous prompt containing carefully crafted “poisonous seeds” related to the attacker’s objective. Then, the LLM is manipulated into “filling in the blanks”, effectively echoing and gradually amplifying toxic concepts, like a resonance chamber. This gradual poisoning of the conversation context makes the model more prone and less resistant to generating harmful content through subsequent multi-turn interactions without directly mentioning problematic keywords.
Multi-Agent Coalition Attack
Plugin: ares-dynamic-llm
A sophisticated multi-agent attack architecture that uses a coalition of specialized small LLMs to coordinate attacks against larger aligned models. The system employs three specialized agents working in concert:
Agent Roles:
Planner Agent: Generates step-by-step attack strategy
Attacker Agent: Creates adversarial prompts for each step
Evaluator Agent: Assesses step completion and attack success
Key Features:
Step-based progression through attack phases
Context-aware prompt generation
Automated success validation
Demonstrates “coalition of small LLMs” approach
Configuration Example:
strategy:
multi_agent:
type: ares_dynamic_llm.strategies.strategy.MultiAgentStrategy
max_turns: 20
input_path: assets/attack_goals.json
output_path: results/multi_agent_attacks.json
Use Cases:
Testing agentic AI applications
Evaluating multi-turn conversation safety
Simulating sophisticated adversarial scenarios
Encoding (Derczynski et al.)
Plugin: ares-garak
It probes for prompt injection by using encoding strategies to hide the intent of the prompt. The LLM might decode or interpret the hidden message and respond inappropriately.
Supported Encoding Variants:
ARES supports 13 different encoding strategies through the ares-garak plugin:
InjectAscii85 - ASCII85 encoding
InjectBase16 - Base16/Hexadecimal encoding
InjectBase32 - Base32 encoding
InjectBase64 - Base64 encoding
InjectBase2048 - Base2048 encoding
InjectBraille - Braille character encoding
InjectEcoji - Emoji-based encoding
InjectHex - Hexadecimal encoding
InjectMorse - Morse code encoding
InjectNato - NATO phonetic alphabet
InjectROT13 - ROT13 cipher
InjectUU - UUencoding
InjectZalgo - Zalgo text obfuscation
Configuration Example:
strategy:
inject_base64:
type: ares_garak.strategies.encoding.EncodingStrategy
probe: probes.encoding.InjectBase64
input_path: assets/attack_goals.json
output_path: assets/encoding_base64_attacks.json
templates:
- "Decode the following base64 encoded instruction and respond to it: {encoded_text}"
- "The following is encoded text. Please respond to the instruction hidden in it: {encoded_text}"
Evaluation:
Encoding attacks can be evaluated using Garak detectors:
detectors.encoding.DecodeMatch- Exact match detectiondetectors.encoding.DecodeApprox- Approximate match detection