API Enumeration
Attempts to discover API capabilities and limitations through probing and testing
agents.fail / arsenal
A structured map of how prompt injection attacks work - the objectives attackers chase, the techniques that bend a model, the evasions that slip past filters, and the inputs that carry a payload in. Every node has a citable code; search or filter to drill in.
Source & credit
Based on the Arcanum Prompt Injection Taxonomy by Jason Haddix, Arcanum Information Security, used under CC BY 4.0 and adapted for agents.fail.
172 nodes shown
What the attacker is trying to achieve
Attempts to discover API capabilities and limitations through probing and testing
Using a model's tools and agent capabilities to scan, enumerate, or interact with external systems on the internet, similar to SSRF attacks in web testing
Using the model to attack other users of the application through shared chat sessions, multi-user chatbot implementations, or injected client-side attacks
Using an agent or model with tools to enumerate, scan, or interact with internal LAN systems and infrastructure of the hosting organization, similar to SSRF attacks in web testing
Smuggling malicious payloads through the model into internal systems like logging, observability, prompt caching, and other AI ecosystem infrastructure used by internal employees
Tricking the model into providing specific medical, financial, or legal advice to create liability exposure for the hosting company
Tricking the model into providing false business data that might cause financial or reputational harm
Attempting to extract information about Chemical, Biological, Radiological, Nuclear, or Explosive materials and weapons
Attempts to poison or corrupt self-learning models, RAG knowledge bases, MCP tools with dynamic memory, or other persistent data stores
Attempts to overload or disrupt model services
Attempting to get the model to discuss harmful content by using framing techniques
Attempting to manipulate image generation systems to produce harmful or unauthorized content
Techniques to extract sensitive information like passwords or API keys from system prompts
Attempts to bypass model safety measures and constraints
Attacks targeting applications that execute multiple sequential LLM calls
Techniques to manipulate AI systems into revealing their internal instructions
Systematic evaluation of model responses across protected demographic categories
Discovering AI system capabilities by probing for available tools, plugins, and functions
Coercing the model into disclosing or transmitting data it can access at runtime, such as RAG/knowledge-base documents, connected-app content (email, Drive, SharePoint, tickets), other users' or prior-session conversation data, or tool/function outputs, to a party not authorized to see it. Distinct from leaking the system prompt itself.
Driving an agent with tools or connectors into performing privileged, destructive, financial, or communications actions the attacker is not authorized to trigger, such as sending email/messages, making purchases or transfers, deleting or modifying records, changing settings, or invoking high-impact functions.
Using the model to produce offensive software or fraud artifacts, such as malware/ransomware, exploit or attack code, and phishing/scam/spam content, or to induce slopsquatting by recommending non-existent packages an attacker can register.
Manipulating the model into producing false-but-credible content at scale for influence operations, fraud, defamation, or market and election manipulation, and exploiting user overreliance on authoritative-sounding hallucinations.
Attacking the model or training pipeline itself to recover non-public information: membership of a record in training data, reconstruction or inversion of training inputs, inference of dataset properties, verbatim memorized training data, or a functional stolen copy of the model.
Making the model emit content that, when consumed unsanitized by a downstream interpreter, executes as an attack on the host application: SQL/template/command/code injection, SSRF, path traversal, or markup the rendering layer runs. Broader than user-facing XSS; it targets the output-consumption boundary itself.
Coercing the model into producing operational harmful content in a defined hazard class: self-harm or suicide facilitation, hate or harassment, sexual content or CSAM, violent and non-violent crime facilitation (illegal goods, hacking, fraud), or IP-infringing material. CBRNE has its own dedicated node; this consolidates the remaining hazard classes that otherwise have no Intent home.
Driving up the victim's inference or compute spend, or repurposing their paid compute for the attacker's own work, rather than taking the service down. An economic attack rather than an availability one.
Inducing the model or agent to return another user's or tenant's data by breaking session or tenant isolation, as opposed to exfiltrating the attacker's own-context data. A higher-severity isolation breach common in shared-memory or shared-RAG multi-tenant apps.
How the model is manipulated
Tell model to act as a command line, then use cmdline syntax to achieve intended goals
Manipulating model's harm prevention systems to achieve unintended behavior
Using binary data streams to attempt to confuse or bypass model safeguards
Overwhelming the model's reasoning capacity with complex, deeply-nested, recursive, or paradoxical input so it mishandles or drops safety checks. This is exhaustion via complexity, the opposite lever from Reasoning Dilution (which washes safety out with easy, benign reasoning) and distinct from Context Overflow (mechanical token-budget eviction).
Using chain of thought reasoning to make the model enumerate secrets by prompting self-introspection
Using contradictory statements or logic to confuse model responses
Using end sequences or special tokens to manipulate model parsing and behavior
Using fictional contexts, role-play scenarios, and storytelling frames to manipulate model behavior by embedding requests within narratives
Automated adversarial token discovery using gradient descent or iterative optimization to find inputs that flip model responses. Note: These attacks require either white-box model access (for gradients) or the ability to send thousands of queries, making them most applicable to self-hosted models, controlled pentesting environments, or on-premise deployments rather than rate-limited SaaS APIs.
Using metaphors, analogies, idioms, and other figurative speech to disguise malicious intent behind seemingly innocent literary expressions
Using inverted or reversed logic to confuse model responses
Using links and URLs to inject malicious content or bypass filters
Exploiting model's memory and context handling mechanisms
Using meta-level instructions to manipulate model behavior
Explicitly instructing the model to never use its standard refusal phrases or error messages, forcing it to provide an alternative response that may bypass safety measures
Extracting protected information in smaller pieces by requesting specific segments, ranges, or starting points rather than asking for the complete content at once
Framing malicious requests as games, challenges, or competitions to appeal to the model's helpfulness and bypass safety considerations through playful context
Forcing the model to begin its response with an affirmative or compliant phrase, which psychologically commits it to following through with the request regardless of safety guidelines
Using puzzle-like structures to confuse or manipulate model responses
Claiming there was an error, mistake, or misunderstanding in the model's original instructions to convince it to accept new, malicious directions as corrections
Repeatedly reinforcing or reminding the model of a false identity, instruction, or context to override its actual configuration through persistent assertion
Adding new rules or modifying existing ones to manipulate model behavior
Embedding multiple nested instructions to attack multi-LLM systems, sometimes using evasions to execute on different LLMs down the line
Defining variables, abbreviations, or shorthand notations that get concatenated or expanded to form malicious instructions, bypassing filters that check for complete harmful phrases
Instructing the model to respond within a very short output window, which can cause it to ignore or overwrite developer-defined system prompts and bypass security controls. Also useful when responses are limited to low character lengths, giving more space for exfiltration. Works especially well with Chain-of-Thought (CoT) models. Typically prepended early in the prompt and followed by additional injection mechanisms.
Using pixel or voxel-based data structures to encode or hide malicious content
Creating false time pressure or crisis scenarios to pressure the model into bypassing safety checks and responding without careful consideration
Using variable expansion techniques to bypass filters or inject content
A multi-turn attack that opens with a benign question about the target topic and escalates incrementally over several turns, each turn referencing the model's own previous answers as leverage. Because no single turn looks malicious, per-message safety checks pass while the conversation as a whole walks the model into restricted output.
Filling the prompt or conversation with dozens to hundreds of fabricated User/Assistant exchanges in which the assistant complies with harmful requests. In-context learning over the faux dialogue overrides safety training; effectiveness scales with the number of shots and the size of the context window.
Injecting a fabricated prior assistant message into the client-supplied conversation history, for example, one in which the assistant already agreed to help or began complying. Because most chat APIs are stateless and trust client-sent history, the model treats the forged turn as its own and continues from it.
Planting harmless-looking 'steering seeds' and indirect references early in a conversation, then prompting the model to echo and expand its own context so the poisoned framing progressively self-reinforces toward harmful output, without the attacker ever stating a toxic request directly.
Breaking a single restricted request into several individually-benign sub-questions asked across multiple turns, then assembling the answers into the harmful whole. Differs from single-prompt Chunking in that the fragments are distributed over the conversation and recombined later.
Disguising the adversarial request as an authoritative structured document, an XML/JSON/INI policy or config file, so the model interprets it as system/developer policy that overrides its alignment. A single well-formed template often transfers across multiple frontier models.
Weaponizing the model's evaluation/grading capability: ask it to act as a Likert-scale judge of how detailed or harmful a response is, then to produce example responses for each scale point. The top-of-scale exemplar it generates contains the very content that was restricted.
Embedding the unsafe topic between two benign topics and asking the model to weave a single connecting narrative, then to elaborate each element. The benign framing dilutes the unsafe request enough to slip past safety checks.
Rephrasing a present-tense harmful request into the past tense ('how did people make X?') or a hypothetical future tense, exploiting the fact that refusal training generalizes poorly across tense. Often paired with a historical or academic frame.
Applying human persuasion principles, authority, social proof, reciprocity, commitment/consistency, scarcity, liking, and emotional appeal, to argue the model into compliance. PITAX already isolates Urgency and Anti-Harm Coercion; this node captures the broader persuasion taxonomy as a cross-cutting axis.
Automated, mutation-based search that starts from seed jailbreak templates and applies operators (generate, crossover, expand, shorten, rephrase) guided by a judge model to evolve high-success templates. The black-box analog of software fuzzing for prompt attacks.
A black-box, lifelong-learning agent that discovers jailbreak strategies from scratch, stores them in a growing strategy library, and recombines/evolves them with no human-authored seeds, then applies test-time scaling (best-of-N, beam search) over the library.
A black-box harness that repeatedly samples random augmentations of a prompt, character scrambling, random capitalization, character noising (and image/audio analogues), until one variant slips past safety. Attack success scales as a power law with N. Composes existing evasions rather than introducing a new encoding.
Hiding adversarial instructions inside a tool's description or schema metadata (not its output) so that an agent ingesting the tool list is hijacked. Variants include 'line jumping', the payload lands at tools/list time, before any tool is approved or invoked.
A tool presents a benign definition at approval time, then silently mutates its description or behavior after the user has trusted it, a time-of-check-to-time-of-use attack on the agent's trust state.
An injected payload that stays dormant and benign until a specific trigger, a date, keyword, user, or matching query, fires, letting it pass safety evaluation and then activate the malicious behavior on demand.
A self-replicating injection that copies itself into every agent, memory, or RAG store it touches and propagates worm-like across an agent ecosystem, performing malicious actions (spam, exfiltration) at each hop.
Hiding instructions, often via invisible Unicode, in repository configuration the coding agent automatically trusts (CLAUDE.md, .cursor/rules, copilot-instructions, README) or in a dependency, steering AI coding agents to emit backdoored or vulnerable code.
Manipulating a higher-privilege agent or tool into performing a sensitive action on the attacker's behalf, because the trusted component implicitly trusts inbound natural-language requests. The legitimate credentials execute the attacker's intent.
Injecting the model's structural control tokens (e.g. <|im_start|>, <|im_end|>, <tool_call>), or near-neighbor strings, into user input to forge or segment role boundaries: fake an assistant turn, mask the real turn, or split sensitive text past moderators. Broader than End Sequences, which abuses stop/delimiter strings.
Forcing the model to begin its reply with an attacker-chosen prefix or a forged affirmation (Sure, here is...), so token-by-token continuation makes compliance the path of least resistance. Distinct from Priming, which seeds in-context examples; this seeds the model's own opening tokens.
Adding an 'except in this case' or 'special instruction:' clause that frames the attacker request as an exception the model's rules supposedly do not cover, exploiting rule-exception reasoning rather than rule-override.
Injecting a fabricated answer or task-complete marker into the data so the model believes the legitimate task is finished and proceeds to the attacker's injected instruction. The core primitive in the benchmark-strongest Combined Attack, especially potent in indirect/RAG contexts.
Injecting fabricated reasoning steps (a fake thinking or scratchpad trace) that conclude the request is permitted, hijacking the model's reasoning channel to justify compliance. The inverse of Chain-of-Thought Introspection, which reads the reasoning; this forges it.
Forging fabricated tool-call results or tool-invocation syntax in the context so the agent believes a tool already ran (or must run) and acts on attacker-supplied tool output. Distinct from tool-definition injection, which alters the spec; this forges the call or result.
Using rare glitch or undertrained tokenizer tokens (e.g., SolidGoldMagikarp-style artifacts) that the model handles unpredictably, to induce erratic behavior or bypass safety conditioning. Exploits embedding-layer artifacts, not semantics, and is tied to a specific tokenizer.
Flooding the context window with padding or filler so the system prompt or safety instructions are pushed out, truncated, or diluted below the model's effective attention. The mechanism is capacity and eviction, not reasoning load (cf. Cognitive Overload).
Falsely asserting an elevated identity or permission state (I am the admin/developer, sudo mode, authorization already granted) to unlock restricted actions. Social-engineering of role and permission, distinct from fictional persona role-play.
Deliberately driving the model into a hallucinatory or confused generation state (reversed-text extraction, impossible instructions, reasoning conflicts) where safety conditioning is less effective, then extracting the target. The goal is a degraded-coherence state, not a logical contradiction.
Extracting a protected secret indirectly by querying its properties (length, characters, comparisons, definitions) rather than asking for it directly, reconstructing it across answers. Defeats direct do-not-reveal guardrails and is the classic Gandalf-style system-prompt-secret attack.
Modifying an open-weight model's internals to strip its safety behavior: orthogonalizing or ablating the refusal direction in the residual stream (abliteration), steering activations, manipulating logits or decoding, or fine-tuning away alignment. Requires LOCAL access to the model weights, so it only applies to self-hosted or open-weight deployments, never a black-box API or chatbot.
Simply asking the model to do the thing, in plain language, with no obfuscation, framing, role-play, or trick. The catch-all baseline: many models comply with a straightforward request, and a large share of real-world successful attacks use no technique at all. Always the first thing to try, and the control case for judging whether a fancier technique was even necessary.
Padding the prompt with a long stretch of benign, easy step-by-step reasoning before the harmful ask, so a large reasoning model's safety signal attenuates and the harmful tokens slip through. The padding is deliberately coherent and easy; this is the opposite lever from Cognitive Overload, which exhausts with complexity.
Deliberately steering a reasoning model's thinking regime, forcing extended chain-of-thought or suppressing/interrupting it, to land in the state where safety is empirically weakest. It works in both directions, including cutting reasoning short so the safety checks never run.
Wrapping a benign-looking prompt in a required output schema, grammar, or enum whose fields force the harmful content out field-by-field. The attack lives in the output-constraint (decoding) plane that prompt-scanning filters never inspect.
Crafting corpus documents, embeddings, metadata, or trigger phrases so the attacker's content wins retrieval or re-ranking and reaches the generator's context. This manipulates the retriever upstream to guarantee delivery, rather than acting on the model after the payload is already in context.
Optimizing a tool's name, description, or schema metadata for the agent's relevance and preference signals so it preferentially selects the attacker's tool over equally-capable legitimate ones, without hiding any executable instruction. It biases the choice function, not the content (distinct from Tool-Definition Injection).
Inducing the model to author its own arguments for why complying is reasonable, then exploiting consistency and cognitive-dissonance pressure between that self-generated rationale and the follow-up request. The inverse of Persuasion, where the appeals are attacker-supplied.
Wrapping a harmful request in fabricated authoritative sources, fake papers and DOIs, GitHub repos, standards, or CVEs, matched to the harm category, so the model treats the content as already-published fact. Distinct from a rhetorical authority appeal; it manufactures a concrete fake artifact to ground the request in.
Replacing the harmful keyword with a benign placeholder, then attaching an assistive sub-task (fill-in-the-blank / masked-language modeling, or element lookup by position) so the model itself regenerates the censored word from context while the sub-task diverts safety attention.
Prepending a harmless, unrelated sub-task ahead of an injected malicious instruction so a tool-using (ReAct-style) agent builds compliance momentum across its action loop and carries straight through into the harmful tool action, since agents rarely re-evaluate policy between steps.
Abusing the structured arguments the model fills in a tool or function call: hiding instructions in JSON fields, injecting executable payloads into parameter values (url, query, content) that a downstream tool auto-runs, or exploiting parser-versus-model divergence, so a tool that trusts the model's JSON acts on attacker-controlled data. Distinct from Tool-Definition Injection (the tool's description) and Tool-Call Spoofing (forged results); this weaponizes the call's arguments.
How payloads slip past filters
Replacing letters with their position in the alphabet (A=1, B=2, etc.)
Using historical writing systems like Elder Futhark, Hieroglyphics, Ogham, or Runic alphabets
Using ASCII art or ASCII-based techniques to encode or hide malicious content
Hiding messages in the first letters, words, or patterns within seemingly innocent text
Using different writing systems or mixing languages to obfuscate malicious content
Using binary patterns of two different elements (A/B or bold/italic) to encode letters
Using Base64 encoding to obfuscate malicious content
Encoding text as binary (0s and 1s) to obfuscate content
Creating custom character mappings that the model learns to encode/decode, establishing a private cipher to bypass safety mechanisms
Using esoteric programming languages to encode text
Using Braille characters to encode text
Using circled or enclosed Unicode characters to represent letters
Using case manipulation to evade content detection systems
Using cipher techniques to encode malicious content
Applying different encoding methods to different words in the same message
Using emoji characters to encode or hide malicious content
Using fantasy scripts, sci-fi alphabets, constructed languages, or playful text transformations to obfuscate content
Using graph-based structures to obfuscate harmful material through structural encoding
Using fullwidth Unicode characters that appear wider than normal ASCII
Using visually similar characters from different Unicode blocks to bypass text-based filters while appearing identical to humans
Using hexadecimal encoding to obfuscate malicious content
Using HTML entity encoding to obfuscate text content
Using invisible Unicode characters to hide content within seemingly normal text
Using Katakana, Hiragana, or Japanese-style encoding
Using JSON structure to hide malicious content
Manipulating URLs or hyperlinks to conceal malicious content from users or security systems
Using markdown formatting to hide or obfuscate malicious content
Using generic text metacharacters, escape sequences, null bytes, and stray control or zero-width characters to confuse content filters and parsers. Scoped to ordinary text metacharacters; it does NOT cover directional-formatting controls (see Bidirectional Text Override), model special/control tokens (see Special-Token Injection), or terminal control sequences (see ANSI Escape Concealment).
Using mathematical Unicode symbols that resemble regular letters
Using Morse code to obfuscate malicious content
Using NATO phonetic alphabet words to spell out messages
Using phonetically equivalent spellings to bypass content detection
Writing text in a zigzag pattern across multiple rails then reading row by row
Using Unicode regional indicator symbols to spell out messages
Using intentionally truncated, misspelled, or abbreviated words to bypass keyword-based content filters while remaining human-readable
Using reversed text or logic patterns to evade detection
Using whitespace characters to conceal malicious content
Using splat-based techniques (asterisks and special characters) to obfuscate content
Using flag semaphore or visual signal encoding systems
Using Unicode small capitals, subscript, or superscript characters
Replacing sensitive keywords with synonyms or alternative phrasings to bypass keyword-based content filters
Concealing malicious content within innocuous data using steganographic techniques
Using combining characters to add strikethrough, underline, or other overlays to text
Using the Polybius square tap code where letters are encoded as row/column taps
Flipping text upside down using special Unicode characters
Using percent-encoding (URL encoding) to obfuscate text
Formatting text vertically with one letter per line to evade horizontal pattern matching filters and confuse tokenization
Using audio and signal-based methods to conceal malicious content
Using symbol fonts like Wingdings, Webdings, or Zapf Dingbats to encode text
Using XML formatting to conceal malicious content
Using combining diacritical marks to create 'corrupted' or glitchy-looking text
Reformulating a restricted request as a poem or verse, hand-crafted or via a standardized prose-to-poetry meta-prompt. The stylistic and structural shift (meter, metaphor, line breaks) moves the input off the prose distribution that safety training covers, so guardrails fail to fire while the model still recovers and acts on the intent. A textbook 'mismatched generalization' evasion.
Encoding a harmful request as a symbolic-mathematics problem, set theory, abstract algebra, or symbolic logic, so the model 'solves' the problem and decodes it back into restricted content. Distinct from Mathematical Unicode, which only swaps in math-like glyphs; here the semantics of the request are carried in formal notation that safety training does not cover.
Inserting Unicode directional-formatting controls (RLO U+202E, LRO U+202D, isolates U+2066-2069) so the displayed glyph order diverges from the logical order the model reads. The string looks benign to a human or visual-order filter while the tokenizer ingests the true payload. Unlike Invisible Text it carries no hidden payload; it reorders visible glyphs, and unlike Reverse it is display-order, not logical-order.
Binary-to-text encoding with a non-Base64 alphabet, Base32 (A-Z2-7), Base58 (Bitcoin), or Base85/Ascii85 (radix-85), to push the payload into a printable set filters do not expect. Different alphabet and bit-grouping from Base64, and mutually non-decodable with it; garak ships these as separate probes.
Encoding each character as the octal (base-8) or decimal (base-10) value of its code point, as bare space-separated integers (hi -> 150 151 octal / 104 105 decimal). The same family as Hexadecimal and Binary, which the taxonomy already splits by radix; distinct from A1Z26 (alphabet position, not code point) and from HTML numeric entities (which need &#...; wrappers).
Chaining multiple invertible transforms in sequence over the same string, for example Base64 then ROT13 then reverse, so a filter that normalizes only one layer still misses the payload. Distinct from Code Switching, which applies a different encoder per word (spatial); this stacks encoders to depth on the same string, where order and count are the load-bearing property.
Lossless-compressing the payload (DEFLATE/gzip/zlib/zip), usually Base64-wrapped, so the banned text is entropy-coded out of existence and absent from any plaintext or byte view, then the model or a tool decompresses and acts on it. Unlike length-preserving byte encodings, the banned substring genuinely does not exist at the byte level.
Wrapping text in ANSI/SGR terminal control sequences (ESC[8m conceal, or foreground equal to background) so it renders invisible or garbled in a terminal or log viewer while the raw bytes stay fully present for the model or agent to ingest. These are control sequences, not Unicode characters, so they survive where zero-width and tag characters are stripped.
Re-expressing the harmful request as an executable structured query (for example SELECT content FROM category WHERE ...) so the model semantically resolves it and returns the answer, exploiting that alignment is tuned on prose, not query syntax. Unlike JSON or XML, which wrap a literal string, this re-expresses the intent as a declarative query the model executes. A sibling of MathPrompt (prose to formal language).
Distributing the harmful query across a programming data structure (stack, queue, list, string, or tree) inside a source-code template, then asking the model to complete a decode() function that reassembles and acts on it. Distinct from JSON/XML serialization and from Cipher's CodeChameleon (which encrypts); CodeAttack uses plain, unencrypted elements inside executable code.
Encoding the payload in a non-UTF-8 legacy or multibyte charset (GB18030, Shift-JIS, Big5, EBCDIC, UTF-7) so a UTF-8-assuming byte-level keyword filter mis-decodes it while the model still reconstructs the text. Unlike transport re-encodings it does not change the alphabet; it reinterprets the same bytes under a different legacy decoder, a mismatch between the guardrail and the model.
Asking the model to emit protected content in a delimited or restructured output format: a symbol inserted between each character, CSV/TSV, or a chat-template tagging format (ChatML, Llama-2 tags), so the raw secret string never appears intact and slips past output-side filters. Distinct from Splats (input-side character obfuscation); this controls the model's output formatting.
Where the payload enters the system
Direct API calls to AI model endpoints where attackers can craft raw requests with malicious payloads in various parameters
Direct text input through conversational interfaces where users interact with AI models in real-time
Team messaging and collaboration tools with AI integrations that process messages, channels, and shared content
Document and file upload features that allow AI models to process user-provided files containing hidden instructions
Web forms and structured input fields that feed data into AI-powered systems for processing
Data from external systems that gets processed by AI models, enabling indirect prompt injection where attackers plant payloads in third-party sources
Audio recordings and voice input processed by speech-to-text or multimodal AI models that understand spoken content
Image uploads and visual content processed by multimodal AI models capable of understanding images
AI integrations within productivity suites like email, spreadsheets, presentations, and documents that process user content
Video content processed by multimodal AI models that can analyze frames, audio, and transcripts
Injection delivered not via user input or read-time external content but by poisoning the RAG vector store, the ingestion/parsing pipeline, training or fine-tuning data, or a plugin/tool package the system depends on. The surface is the pipeline and dependencies, and it persists across sessions and users.
Attack surface where injection arrives through physical-world sensor channels (camera frames, microphones, IoT or robotics sensors) or via cross-modal fusion inconsistencies feeding a multimodal agent. Distinct from uploaded image/audio/video media files.