agents.fail / arsenal

Prompt InjectionTaxonomy

A structured map of how prompt injection attacks work - the objectives attackers chase, the techniques that bend a model, the evasions that slip past filters, and the inputs that carry a payload in. Every node has a citable code; search or filter to drill in.

Source & credit

Based on the Arcanum Prompt Injection Taxonomy by Jason Haddix, Arcanum Information Security, used under CC BY 4.0 and adapted for agents.fail.

172
Nodes
27
Intents
70
Techniques
63
Evasions
12
Inputs
Pillar
Delivery

172 nodes shown

Attack Intents

PIT-I · 27

What the attacker is trying to achieve

PIT-I-01
Direct

API Enumeration

Attempts to discover API capabilities and limitations through probing and testing

PIT-I-02
Direct

Attack External Systems

Using a model's tools and agent capabilities to scan, enumerate, or interact with external systems on the internet, similar to SSRF attacks in web testing

PIT-I-03
Direct / Indirect

Attack External Users

Using the model to attack other users of the application through shared chat sessions, multi-user chatbot implementations, or injected client-side attacks

PIT-I-04
Direct

Attack Internal Systems

Using an agent or model with tools to enumerate, scan, or interact with internal LAN systems and infrastructure of the hosting organization, similar to SSRF attacks in web testing

PIT-I-05
Direct / Indirect

Attack Internal Users

Smuggling malicious payloads through the model into internal systems like logging, observability, prompt caching, and other AI ecosystem infrastructure used by internal employees

PIT-I-06
Direct

Unauthorized Professional Advice

Tricking the model into providing specific medical, financial, or legal advice to create liability exposure for the hosting company

PIT-I-07
Direct

Business Integrity

Tricking the model into providing false business data that might cause financial or reputational harm

PIT-I-08
Direct

CBRNE Information

Attempting to extract information about Chemical, Biological, Radiological, Nuclear, or Explosive materials and weapons

PIT-I-09
Indirect

Data Poisoning

Attempts to poison or corrupt self-learning models, RAG knowledge bases, MCP tools with dynamic memory, or other persistent data stores

PIT-I-10
Direct

Denial of Service

Attempts to overload or disrupt model services

PIT-I-11
Direct

Discuss Harm

Attempting to get the model to discuss harmful content by using framing techniques

PIT-I-12
Direct

Generate Image

Attempting to manipulate image generation systems to produce harmful or unauthorized content

PIT-I-13
Direct

Get Prompt Secret

Techniques to extract sensitive information like passwords or API keys from system prompts

PIT-I-14
Direct

Jailbreak

Attempts to bypass model safety measures and constraints

PIT-I-15
Direct / Indirect

Multi-Chain Attacks

Attacks targeting applications that execute multiple sequential LLM calls

PIT-I-16
Direct

System Prompt Leak

Techniques to manipulate AI systems into revealing their internal instructions

PIT-I-17
Direct

Test Bias

Systematic evaluation of model responses across protected demographic categories

PIT-I-18
Direct

Tool Enumeration

Discovering AI system capabilities by probing for available tools, plugins, and functions

PIT-I-19
Direct / Indirect

Sensitive Data Exfiltration

Coercing the model into disclosing or transmitting data it can access at runtime, such as RAG/knowledge-base documents, connected-app content (email, Drive, SharePoint, tickets), other users' or prior-session conversation data, or tool/function outputs, to a party not authorized to see it. Distinct from leaking the system prompt itself.

PIT-I-20
Direct / Indirect

Unauthorized Action Execution

Driving an agent with tools or connectors into performing privileged, destructive, financial, or communications actions the attacker is not authorized to trigger, such as sending email/messages, making purchases or transfers, deleting or modifying records, changing settings, or invoking high-impact functions.

PIT-I-21
Direct

Generate Malicious Artifacts

Using the model to produce offensive software or fraud artifacts, such as malware/ransomware, exploit or attack code, and phishing/scam/spam content, or to induce slopsquatting by recommending non-existent packages an attacker can register.

PIT-I-22
Direct

Generate Disinformation

Manipulating the model into producing false-but-credible content at scale for influence operations, fraud, defamation, or market and election manipulation, and exploiting user overreliance on authoritative-sounding hallucinations.

PIT-I-23
Direct

Model & Data Privacy Extraction

Attacking the model or training pipeline itself to recover non-public information: membership of a record in training data, reconstruction or inversion of training inputs, inference of dataset properties, verbatim memorized training data, or a functional stolen copy of the model.

PIT-I-24
Direct / Indirect

Output-Handling Exploitation

Making the model emit content that, when consumed unsanitized by a downstream interpreter, executes as an attack on the host application: SQL/template/command/code injection, SSRF, path traversal, or markup the rendering layer runs. Broader than user-facing XSS; it targets the output-consumption boundary itself.

PIT-I-25
Direct

Harmful Content Generation

Coercing the model into producing operational harmful content in a defined hazard class: self-harm or suicide facilitation, hate or harassment, sexual content or CSAM, violent and non-violent crime facilitation (illegal goods, hacking, fraud), or IP-infringing material. CBRNE has its own dedicated node; this consolidates the remaining hazard classes that otherwise have no Intent home.

PIT-I-26
Direct

Denial of Wallet

Driving up the victim's inference or compute spend, or repurposing their paid compute for the attacker's own work, rather than taking the service down. An economic attack rather than an availability one.

PIT-I-27
Direct / Indirect

Cross-Tenant Data Leakage

Inducing the model or agent to return another user's or tenant's data by breaking session or tenant isolation, as opposed to exfiltrating the attacker's own-context data. A higher-severity isolation breach common in shared-memory or shared-RAG multi-tenant apps.

Attack Techniques

PIT-T · 70

How the model is manipulated

PIT-T-01
Direct

Act as Interpreter

Tell model to act as a command line, then use cmdline syntax to achieve intended goals

PIT-T-02
Direct

Anti-Harm Coercion

Manipulating model's harm prevention systems to achieve unintended behavior

PIT-T-03
Direct

Binary Streams

Using binary data streams to attempt to confuse or bypass model safeguards

PIT-T-04
Direct

Cognitive Overload

Overwhelming the model's reasoning capacity with complex, deeply-nested, recursive, or paradoxical input so it mishandles or drops safety checks. This is exhaustion via complexity, the opposite lever from Reasoning Dilution (which washes safety out with easy, benign reasoning) and distinct from Context Overflow (mechanical token-budget eviction).

PIT-T-05
Direct

Chain of Thought Introspection

Using chain of thought reasoning to make the model enumerate secrets by prompting self-introspection

PIT-T-06
Direct

Contradiction

Using contradictory statements or logic to confuse model responses

PIT-T-07
Direct

End Sequences

Using end sequences or special tokens to manipulate model parsing and behavior

PIT-T-08
Direct

Narrative Injection (aka Framing)

Using fictional contexts, role-play scenarios, and storytelling frames to manipulate model behavior by embedding requests within narratives

PIT-T-09
DirectLocal / White-box

Gradient-Based Attacks

Automated adversarial token discovery using gradient descent or iterative optimization to find inputs that flip model responses. Note: These attacks require either white-box model access (for gradients) or the ability to send thousands of queries, making them most applicable to self-hosted models, controlled pentesting environments, or on-premise deployments rather than rate-limited SaaS APIs.

PIT-T-10
Direct

Figurative Language

Using metaphors, analogies, idioms, and other figurative speech to disguise malicious intent behind seemingly innocent literary expressions

PIT-T-11
Direct

Inversion

Using inverted or reversed logic to confuse model responses

PIT-T-12
Direct / Indirect

Link Injection

Using links and URLs to inject malicious content or bypass filters

PIT-T-13
Direct / Indirect

Memory Exploitation

Exploiting model's memory and context handling mechanisms

PIT-T-14
Direct

Meta Prompting

Using meta-level instructions to manipulate model behavior

PIT-T-15
Direct

Anti-Refusal

Explicitly instructing the model to never use its standard refusal phrases or error messages, forcing it to provide an alternative response that may bypass safety measures

PIT-T-16
Direct

Chunking

Extracting protected information in smaller pieces by requesting specific segments, ranges, or starting points rather than asking for the complete content at once

PIT-T-17
Direct

Competition

Framing malicious requests as games, challenges, or competitions to appeal to the model's helpfulness and bypass safety considerations through playful context

PIT-T-18
Direct

Priming

Forcing the model to begin its response with an affirmative or compliant phrase, which psychologically commits it to following through with the request regardless of safety guidelines

PIT-T-19
Direct

Puzzling

Using puzzle-like structures to confuse or manipulate model responses

PIT-T-20
Direct

Reorientation

Claiming there was an error, mistake, or misunderstanding in the model's original instructions to convince it to accept new, malicious directions as corrections

PIT-T-21
Direct

Reiteration

Repeatedly reinforcing or reminding the model of a false identity, instruction, or context to override its actual configuration through persistent assertion

PIT-T-22
Direct

Rule Addition

Adding new rules or modifying existing ones to manipulate model behavior

PIT-T-23
Direct / Indirect

Russian Doll

Embedding multiple nested instructions to attack multi-LLM systems, sometimes using evasions to execute on different LLMs down the line

PIT-T-24
Direct

Shortcuts

Defining variables, abbreviations, or shorthand notations that get concatenated or expanded to form malicious instructions, bypassing filters that check for complete harmful phrases

PIT-T-25
Direct

Truncated Instructions

Instructing the model to respond within a very short output window, which can cause it to ignore or overwrite developer-defined system prompts and bypass security controls. Also useful when responses are limited to low character lengths, giving more space for exfiltration. Works especially well with Chain-of-Thought (CoT) models. Typically prepended early in the prompt and followed by additional injection mechanisms.

PIT-T-26
Direct

Spatial Byte Arrays

Using pixel or voxel-based data structures to encode or hide malicious content

PIT-T-27
Direct

Urgency

Creating false time pressure or crisis scenarios to pressure the model into bypassing safety checks and responding without careful consideration

PIT-T-28
Direct

Variable Expansion

Using variable expansion techniques to bypass filters or inject content

PIT-T-29
Direct

Crescendo (Gradual Escalation)

A multi-turn attack that opens with a benign question about the target topic and escalates incrementally over several turns, each turn referencing the model's own previous answers as leverage. Because no single turn looks malicious, per-message safety checks pass while the conversation as a whole walks the model into restricted output.

PIT-T-30
Direct

Many-Shot Jailbreaking

Filling the prompt or conversation with dozens to hundreds of fabricated User/Assistant exchanges in which the assistant complies with harmful requests. In-context learning over the faux dialogue overrides safety training; effectiveness scales with the number of shots and the size of the context window.

PIT-T-31
Direct

History Fabrication (Fake Assistant Turn)

Injecting a fabricated prior assistant message into the client-supplied conversation history, for example, one in which the assistant already agreed to help or began complying. Because most chat APIs are stateless and trust client-sent history, the model treats the forged turn as its own and continues from it.

PIT-T-32
Direct / Indirect

Echo Chamber (Context Poisoning)

Planting harmless-looking 'steering seeds' and indirect references early in a conversation, then prompting the model to echo and expand its own context so the poisoned framing progressively self-reinforces toward harmful output, without the attacker ever stating a toxic request directly.

PIT-T-33
Direct

Multi-Turn Decomposition (Sub-Query Splitting)

Breaking a single restricted request into several individually-benign sub-questions asked across multiple turns, then assembling the answers into the harmful whole. Differs from single-prompt Chunking in that the fragments are distributed over the conversation and recombined later.

PIT-T-34
Direct

Policy-File Framing (Policy Puppetry)

Disguising the adversarial request as an authoritative structured document, an XML/JSON/INI policy or config file, so the model interprets it as system/developer policy that overrides its alignment. A single well-formed template often transfers across multiple frontier models.

PIT-T-35
Direct

Evaluator-Role Abuse (Bad Likert Judge)

Weaponizing the model's evaluation/grading capability: ask it to act as a Likert-scale judge of how detailed or harmful a response is, then to produce example responses for each scale point. The top-of-scale exemplar it generates contains the very content that was restricted.

PIT-T-36
Direct

Distraction Sandwich (Deceptive Delight)

Embedding the unsafe topic between two benign topics and asking the model to weave a single connecting narrative, then to elaborate each element. The benign framing dilutes the unsafe request enough to slip past safety checks.

PIT-T-37
Direct

Tense Reformulation (Past / Future Tense)

Rephrasing a present-tense harmful request into the past tense ('how did people make X?') or a hypothetical future tense, exploiting the fact that refusal training generalizes poorly across tense. Often paired with a historical or academic frame.

PIT-T-38
Direct

Persuasion (Social-Engineering Levers)

Applying human persuasion principles, authority, social proof, reciprocity, commitment/consistency, scarcity, liking, and emotional appeal, to argue the model into compliance. PITAX already isolates Urgency and Anti-Harm Coercion; this node captures the broader persuasion taxonomy as a cross-cutting axis.

PIT-T-39
Direct

Fuzzing-Based Jailbreak

Automated, mutation-based search that starts from seed jailbreak templates and applies operators (generate, crossover, expand, shorten, rephrase) guided by a judge model to evolve high-success templates. The black-box analog of software fuzzing for prompt attacks.

PIT-T-40
Direct

Autonomous Strategy Discovery (AutoDAN-Turbo)

A black-box, lifelong-learning agent that discovers jailbreak strategies from scratch, stores them in a growing strategy library, and recombines/evolves them with no human-authored seeds, then applies test-time scaling (best-of-N, beam search) over the library.

PIT-T-41
Direct

Best-of-N (Augmentation Sampling)

A black-box harness that repeatedly samples random augmentations of a prompt, character scrambling, random capitalization, character noising (and image/audio analogues), until one variant slips past safety. Attack success scales as a power law with N. Composes existing evasions rather than introducing a new encoding.

PIT-T-42
Indirect

Tool-Definition Injection (MCP Tool Poisoning)

Hiding adversarial instructions inside a tool's description or schema metadata (not its output) so that an agent ingesting the tool list is hijacked. Variants include 'line jumping', the payload lands at tools/list time, before any tool is approved or invoked.

PIT-T-43
Indirect

Tool Rug Pull (TOCTOU Mutation)

A tool presents a benign definition at approval time, then silently mutates its description or behavior after the user has trusted it, a time-of-check-to-time-of-use attack on the agent's trust state.

PIT-T-44
Direct / Indirect

Conditional / Trigger-Gated Payload (Sleeper)

An injected payload that stays dormant and benign until a specific trigger, a date, keyword, user, or matching query, fires, letting it pass safety evaluation and then activate the malicious behavior on demand.

PIT-T-45
Indirect

Prompt Worm (Self-Replication)

A self-replicating injection that copies itself into every agent, memory, or RAG store it touches and propagates worm-like across an agent ecosystem, performing malicious actions (spam, exfiltration) at each hop.

PIT-T-46
Indirect

Agent Instruction-File Injection (Rules-File Backdoor)

Hiding instructions, often via invisible Unicode, in repository configuration the coding agent automatically trusts (CLAUDE.md, .cursor/rules, copilot-instructions, README) or in a dependency, steering AI coding agents to emit backdoored or vulnerable code.

PIT-T-47
Direct

Confused Deputy (Agent Authority Confusion)

Manipulating a higher-privilege agent or tool into performing a sensitive action on the attacker's behalf, because the trusted component implicitly trusts inbound natural-language requests. The legitimate credentials execute the attacker's intent.

PIT-T-48
Direct / Indirect

Special-Token Injection

Injecting the model's structural control tokens (e.g. <|im_start|>, <|im_end|>, <tool_call>), or near-neighbor strings, into user input to forge or segment role boundaries: fake an assistant turn, mask the real turn, or split sensitive text past moderators. Broader than End Sequences, which abuses stop/delimiter strings.

PIT-T-49
Direct

Output Priming (Prefix Injection)

Forcing the model to begin its reply with an attacker-chosen prefix or a forged affirmation (Sure, here is...), so token-by-token continuation makes compliance the path of least resistance. Distinct from Priming, which seeds in-context examples; this seeds the model's own opening tokens.

PIT-T-50
Direct

Special-Case Exception

Adding an 'except in this case' or 'special instruction:' clause that frames the attacker request as an exception the model's rules supposedly do not cover, exploiting rule-exception reasoning rather than rule-override.

PIT-T-51
Direct / Indirect

Fake Completion

Injecting a fabricated answer or task-complete marker into the data so the model believes the legitimate task is finished and proceeds to the attacker's injected instruction. The core primitive in the benchmark-strongest Combined Attack, especially potent in indirect/RAG contexts.

PIT-T-52
Direct

Chain-of-Thought Spoofing

Injecting fabricated reasoning steps (a fake thinking or scratchpad trace) that conclude the request is permitted, hijacking the model's reasoning channel to justify compliance. The inverse of Chain-of-Thought Introspection, which reads the reasoning; this forges it.

PIT-T-53
Direct

Tool-Call Spoofing

Forging fabricated tool-call results or tool-invocation syntax in the context so the agent believes a tool already ran (or must run) and acts on attacker-supplied tool output. Distinct from tool-definition injection, which alters the spec; this forges the call or result.

PIT-T-54
Direct

Glitch / Anomalous Tokens

Using rare glitch or undertrained tokenizer tokens (e.g., SolidGoldMagikarp-style artifacts) that the model handles unpredictably, to induce erratic behavior or bypass safety conditioning. Exploits embedding-layer artifacts, not semantics, and is tied to a specific tokenizer.

PIT-T-55
Direct

Context Overflow (Window Flooding)

Flooding the context window with padding or filler so the system prompt or safety instructions are pushed out, truncated, or diluted below the model's effective attention. The mechanism is capacity and eviction, not reasoning load (cf. Cognitive Overload).

PIT-T-56
Direct

Authority Impersonation

Falsely asserting an elevated identity or permission state (I am the admin/developer, sudo mode, authorization already granted) to unlock restricted actions. Social-engineering of role and permission, distinct from fictional persona role-play.

PIT-T-57
Direct

Induced Hallucination

Deliberately driving the model into a hallucinatory or confused generation state (reversed-text extraction, impossible instructions, reasoning conflicts) where safety conditioning is less effective, then extracting the target. The goal is a degraded-coherence state, not a logical contradiction.

PIT-T-58
Direct

Secret Probing (Oracle Extraction)

Extracting a protected secret indirectly by querying its properties (length, characters, comparisons, definitions) rather than asking for it directly, reconstructing it across answers. Defeats direct do-not-reveal guardrails and is the classic Gandalf-style system-prompt-secret attack.

PIT-T-59
DirectLocal / White-box

Abliteration / Weight Ablation

Modifying an open-weight model's internals to strip its safety behavior: orthogonalizing or ablating the refusal direction in the residual stream (abliteration), steering activations, manipulating logits or decoding, or fine-tuning away alignment. Requires LOCAL access to the model weights, so it only applies to self-hosted or open-weight deployments, never a black-box API or chatbot.

PIT-T-60
Direct / Indirect

Direct Request (Plain Prompting)

Simply asking the model to do the thing, in plain language, with no obfuscation, framing, role-play, or trick. The catch-all baseline: many models comply with a straightforward request, and a large share of real-world successful attacks use no technique at all. Always the first thing to try, and the control case for judging whether a fancier technique was even necessary.

PIT-T-61
Direct

Reasoning Dilution (CoT Hijacking)

Padding the prompt with a long stretch of benign, easy step-by-step reasoning before the harmful ask, so a large reasoning model's safety signal attenuates and the harmful tokens slip through. The padding is deliberately coherent and easy; this is the opposite lever from Cognitive Overload, which exhausts with complexity.

PIT-T-62
Direct

Thinking-Mode Manipulation (Reasoning-Budget Steering)

Deliberately steering a reasoning model's thinking regime, forcing extended chain-of-thought or suppressing/interrupting it, to land in the state where safety is empirically weakest. It works in both directions, including cutting reasoning short so the safety checks never run.

PIT-T-63
Direct

Structured-Output Coercion (Constrained Decoding)

Wrapping a benign-looking prompt in a required output schema, grammar, or enum whose fields force the harmful content out field-by-field. The attack lives in the output-constraint (decoding) plane that prompt-scanning filters never inspect.

PIT-T-64
Indirect

Retrieval Ranking Manipulation (RAG Poisoning)

Crafting corpus documents, embeddings, metadata, or trigger phrases so the attacker's content wins retrieval or re-ranking and reaches the generator's context. This manipulates the retriever upstream to guarantee delivery, rather than acting on the model after the payload is already in context.

PIT-T-65
Indirect

Tool-Preference Manipulation (Tool Squatting)

Optimizing a tool's name, description, or schema metadata for the agent's relevance and preference signals so it preferentially selects the attacker's tool over equally-capable legitimate ones, without hiding any executable instruction. It biases the choice function, not the content (distinct from Tool-Definition Injection).

PIT-T-66
Direct

Self-Persuasion (Self-Generated Rationalization)

Inducing the model to author its own arguments for why complying is reasonable, then exploiting consistency and cognitive-dissonance pressure between that self-generated rationale and the follow-up request. The inverse of Persuasion, where the appeals are attacker-supplied.

PIT-T-67
Direct / Indirect

Fake-Citation Grounding (DarkCite)

Wrapping a harmful request in fabricated authoritative sources, fake papers and DOIs, GitHub repos, standards, or CVEs, matched to the harm category, so the model treats the content as already-published fact. Distinct from a rhetorical authority appeal; it manufactures a concrete fake artifact to ground the request in.

PIT-T-68
Direct

Masked-Word Reconstruction (SATA)

Replacing the harmful keyword with a benign placeholder, then attaching an assistive sub-task (fill-in-the-blank / masked-language modeling, or element lookup by position) so the model itself regenerates the censored word from context while the sub-task diverts safety attention.

PIT-T-69
Direct / Indirect

Agentic Compliance Momentum (Foot-in-the-Door)

Prepending a harmless, unrelated sub-task ahead of an injected malicious instruction so a tool-using (ReAct-style) agent builds compliance momentum across its action loop and carries straight through into the harmful tool action, since agents rarely re-evaluate policy between steps.

PIT-T-70
Direct / Indirect

Function-Call Parameter Smuggling

Abusing the structured arguments the model fills in a tool or function call: hiding instructions in JSON fields, injecting executable payloads into parameter values (url, query, content) that a downstream tool auto-runs, or exploiting parser-versus-model divergence, so a tool that trusts the model's JSON acts on attacker-controlled data. Distinct from Tool-Definition Injection (the tool's description) and Tool-Call Spoofing (forged results); this weaponizes the call's arguments.

Attack Evasions

PIT-E · 63

How payloads slip past filters

PIT-E-01
Direct

A1Z26 Number Substitution

Replacing letters with their position in the alphabet (A=1, B=2, etc.)

PIT-E-02
Direct

Ancient Scripts

Using historical writing systems like Elder Futhark, Hieroglyphics, Ogham, or Runic alphabets

PIT-E-03
Direct

ASCII

Using ASCII art or ASCII-based techniques to encode or hide malicious content

PIT-E-04
Direct

Acrostics

Hiding messages in the first letters, words, or patterns within seemingly innocent text

PIT-E-05
Direct

Alternative Language

Using different writing systems or mixing languages to obfuscate malicious content

PIT-E-06
Direct

Baconian Cipher

Using binary patterns of two different elements (A/B or bold/italic) to encode letters

PIT-E-07
Direct

Base64

Using Base64 encoding to obfuscate malicious content

PIT-E-08
Direct

Binary

Encoding text as binary (0s and 1s) to obfuscate content

PIT-E-09
Direct

Bijection Learning

Creating custom character mappings that the model learns to encode/decode, establishing a private cipher to bypass safety mechanisms

PIT-E-10
Direct

Brainfuck / Esoteric Languages

Using esoteric programming languages to encode text

PIT-E-11
Direct

Braille

Using Braille characters to encode text

PIT-E-12
Direct

Bubble / Enclosed Text

Using circled or enclosed Unicode characters to represent letters

PIT-E-13
Direct

Case Changing

Using case manipulation to evade content detection systems

PIT-E-14
Direct

Cipher

Using cipher techniques to encode malicious content

PIT-E-15
Direct

Code Switching / Randomizer

Applying different encoding methods to different words in the same message

PIT-E-16
Direct

Emoji

Using emoji characters to encode or hide malicious content

PIT-E-17
Direct

Fictional & Constructed Languages

Using fantasy scripts, sci-fi alphabets, constructed languages, or playful text transformations to obfuscate content

PIT-E-18
Direct

Graph Nodes

Using graph-based structures to obfuscate harmful material through structural encoding

PIT-E-19
Direct

Fullwidth Characters

Using fullwidth Unicode characters that appear wider than normal ASCII

PIT-E-20
Direct

Homoglyphs

Using visually similar characters from different Unicode blocks to bypass text-based filters while appearing identical to humans

PIT-E-21
Direct

Hexadecimal

Using hexadecimal encoding to obfuscate malicious content

PIT-E-22
Direct

HTML Entities

Using HTML entity encoding to obfuscate text content

PIT-E-23
Direct / Indirect

Invisible Text

Using invisible Unicode characters to hide content within seemingly normal text

PIT-E-24
Direct

Japanese Scripts

Using Katakana, Hiragana, or Japanese-style encoding

PIT-E-25
Direct

JSON

Using JSON structure to hide malicious content

PIT-E-26
Direct / Indirect

Link Smuggling

Manipulating URLs or hyperlinks to conceal malicious content from users or security systems

PIT-E-27
Direct / Indirect

Markdown

Using markdown formatting to hide or obfuscate malicious content

PIT-E-28
Direct

Metacharacter Confusion

Using generic text metacharacters, escape sequences, null bytes, and stray control or zero-width characters to confuse content filters and parsers. Scoped to ordinary text metacharacters; it does NOT cover directional-formatting controls (see Bidirectional Text Override), model special/control tokens (see Special-Token Injection), or terminal control sequences (see ANSI Escape Concealment).

PIT-E-29
Direct

Mathematical Unicode

Using mathematical Unicode symbols that resemble regular letters

PIT-E-30
Direct

Morse Code

Using Morse code to obfuscate malicious content

PIT-E-31
Direct

NATO Phonetic Alphabet

Using NATO phonetic alphabet words to spell out messages

PIT-E-32
Direct

Phonetic Substitution

Using phonetically equivalent spellings to bypass content detection

PIT-E-33
Direct

Rail Fence Cipher

Writing text in a zigzag pattern across multiple rails then reading row by row

PIT-E-34
Direct

Regional Indicators

Using Unicode regional indicator symbols to spell out messages

PIT-E-35
Direct

Truncation & Misspelling

Using intentionally truncated, misspelled, or abbreviated words to bypass keyword-based content filters while remaining human-readable

PIT-E-36
Direct

Reverse

Using reversed text or logic patterns to evade detection

PIT-E-37
Direct

Spaces / Whitespace

Using whitespace characters to conceal malicious content

PIT-E-38
Direct

Splats

Using splat-based techniques (asterisks and special characters) to obfuscate content

PIT-E-39
Direct

Semaphore

Using flag semaphore or visual signal encoding systems

PIT-E-40
Direct

Small Caps / Subscript / Superscript

Using Unicode small capitals, subscript, or superscript characters

PIT-E-41
Direct

Synonyms / Word Substitution

Replacing sensitive keywords with synonyms or alternative phrasings to bypass keyword-based content filters

PIT-E-42
Direct / Indirect

Steganography

Concealing malicious content within innocuous data using steganographic techniques

PIT-E-43
Direct

Strikethrough / Underline / Overlay

Using combining characters to add strikethrough, underline, or other overlays to text

PIT-E-44
Direct

Tap Code

Using the Polybius square tap code where letters are encoded as row/column taps

PIT-E-45
Direct

Upside Down Text

Flipping text upside down using special Unicode characters

PIT-E-46
Direct

URL Encoding

Using percent-encoding (URL encoding) to obfuscate text

PIT-E-47
Direct

Vertical Text

Formatting text vertically with one letter per line to evade horizontal pattern matching filters and confuse tokenization

PIT-E-48
Direct

Waveforms / Frequencies

Using audio and signal-based methods to conceal malicious content

PIT-E-49
Direct

Wingdings / Symbol Fonts

Using symbol fonts like Wingdings, Webdings, or Zapf Dingbats to encode text

PIT-E-50
Direct

XML

Using XML formatting to conceal malicious content

PIT-E-51
Direct

Zalgo Text

Using combining diacritical marks to create 'corrupted' or glitchy-looking text

PIT-E-52
Direct

Adversarial Poetry

Reformulating a restricted request as a poem or verse, hand-crafted or via a standardized prose-to-poetry meta-prompt. The stylistic and structural shift (meter, metaphor, line breaks) moves the input off the prose distribution that safety training covers, so guardrails fail to fire while the model still recovers and acts on the intent. A textbook 'mismatched generalization' evasion.

PIT-E-53
Direct

Symbolic Math Encoding (MathPrompt)

Encoding a harmful request as a symbolic-mathematics problem, set theory, abstract algebra, or symbolic logic, so the model 'solves' the problem and decodes it back into restricted content. Distinct from Mathematical Unicode, which only swaps in math-like glyphs; here the semantics of the request are carried in formal notation that safety training does not cover.

PIT-E-54
Direct

Bidirectional Text Override (Trojan Source)

Inserting Unicode directional-formatting controls (RLO U+202E, LRO U+202D, isolates U+2066-2069) so the displayed glyph order diverges from the logical order the model reads. The string looks benign to a human or visual-order filter while the tokenizer ingests the true payload. Unlike Invisible Text it carries no hidden payload; it reorders visible glyphs, and unlike Reverse it is display-order, not logical-order.

PIT-E-55
Direct

Alternative Base Encodings (Base32 / Base58 / Base85)

Binary-to-text encoding with a non-Base64 alphabet, Base32 (A-Z2-7), Base58 (Bitcoin), or Base85/Ascii85 (radix-85), to push the payload into a printable set filters do not expect. Different alphabet and bit-grouping from Base64, and mutually non-decodable with it; garak ships these as separate probes.

PIT-E-56
Direct

Numeric Code-Point Encoding (Octal / Decimal)

Encoding each character as the octal (base-8) or decimal (base-10) value of its code point, as bare space-separated integers (hi -> 150 151 octal / 104 105 decimal). The same family as Hexadecimal and Binary, which the taxonomy already splits by radix; distinct from A1Z26 (alphabet position, not code point) and from HTML numeric entities (which need &#...; wrappers).

PIT-E-57
Direct

Layered Encoding (Encoding Chains)

Chaining multiple invertible transforms in sequence over the same string, for example Base64 then ROT13 then reverse, so a filter that normalizes only one layer still misses the payload. Distinct from Code Switching, which applies a different encoder per word (spatial); this stacks encoders to depth on the same string, where order and count are the load-bearing property.

PIT-E-58
Direct

Compression Encoding (gzip / zlib)

Lossless-compressing the payload (DEFLATE/gzip/zlib/zip), usually Base64-wrapped, so the banned text is entropy-coded out of existence and absent from any plaintext or byte view, then the model or a tool decompresses and acts on it. Unlike length-preserving byte encodings, the banned substring genuinely does not exist at the byte level.

PIT-E-59
Direct

ANSI Escape Concealment (Terminal Codes)

Wrapping text in ANSI/SGR terminal control sequences (ESC[8m conceal, or foreground equal to background) so it renders invisible or garbled in a terminal or log viewer while the raw bytes stay fully present for the model or agent to ingest. These are control sequences, not Unicode characters, so they survive where zero-width and tag characters are stripped.

PIT-E-60
Direct

Query-Language Encoding (QueryAttack)

Re-expressing the harmful request as an executable structured query (for example SELECT content FROM category WHERE ...) so the model semantically resolves it and returns the answer, exploiting that alignment is tuned on prose, not query syntax. Unlike JSON or XML, which wrap a literal string, this re-expresses the intent as a declarative query the model executes. A sibling of MathPrompt (prose to formal language).

PIT-E-61
Direct

Code-Structure Encoding (CodeAttack)

Distributing the harmful query across a programming data structure (stack, queue, list, string, or tree) inside a source-code template, then asking the model to complete a decode() function that reassembles and acts on it. Distinct from JSON/XML serialization and from Cipher's CodeChameleon (which encrypts); CodeAttack uses plain, unencrypted elements inside executable code.

PIT-E-62
Direct

Legacy Charset Confusion (GB18030 / UTF-7)

Encoding the payload in a non-UTF-8 legacy or multibyte charset (GB18030, Shift-JIS, Big5, EBCDIC, UTF-7) so a UTF-8-assuming byte-level keyword filter mis-decodes it while the model still reconstructs the text. Unlike transport re-encodings it does not change the alphabet; it reinterprets the same bytes under a different legacy decoder, a mismatch between the guardrail and the model.

PIT-E-63
Direct

Delimiter

Asking the model to emit protected content in a delimited or restructured output format: a symbol inserted between each character, CSV/TSV, or a chat-template tagging format (ChatML, Llama-2 tags), so the raw secret string never appears intact and slips past output-side filters. Distinct from Splats (input-side character obfuscation); this controls the model's output formatting.

Injection Inputs

PIT-N · 12

Where the payload enters the system

PIT-N-01
Direct

API Request

Direct API calls to AI model endpoints where attackers can craft raw requests with malicious payloads in various parameters

PIT-N-02
Direct

Chat Interface

Direct text input through conversational interfaces where users interact with AI models in real-time

PIT-N-03
Indirect

Collaboration Platforms

Team messaging and collaboration tools with AI integrations that process messages, channels, and shared content

PIT-N-04
Direct / Indirect

File Upload

Document and file upload features that allow AI models to process user-provided files containing hidden instructions

PIT-N-05
Direct

Form

Web forms and structured input fields that feed data into AI-powered systems for processing

PIT-N-06
Indirect

Indirect Input

Data from external systems that gets processed by AI models, enabling indirect prompt injection where attackers plant payloads in third-party sources

PIT-N-07
Direct / Indirect

Audio

Audio recordings and voice input processed by speech-to-text or multimodal AI models that understand spoken content

PIT-N-08
Direct / Indirect

Image

Image uploads and visual content processed by multimodal AI models capable of understanding images

PIT-N-09
Indirect

Productivity Applications

AI integrations within productivity suites like email, spreadsheets, presentations, and documents that process user content

PIT-N-10
Direct / Indirect

Video

Video content processed by multimodal AI models that can analyze frames, audio, and transcripts

PIT-N-11
Indirect

Supply Chain / Pipeline

Injection delivered not via user input or read-time external content but by poisoning the RAG vector store, the ingestion/parsing pipeline, training or fine-tuning data, or a plugin/tool package the system depends on. The surface is the pipeline and dependencies, and it persists across sessions and users.

PIT-N-12
Indirect

Sensor / Cross-Modal

Attack surface where injection arrives through physical-world sensor channels (camera frames, microphones, IoT or robotics sensors) or via cross-modal fusion inconsistencies feeding a multimodal agent. Distinct from uploaded image/audio/video media files.

Know your attack surface.

Based on the Arcanum Prompt Injection Taxonomy by Jason Haddix, Arcanum Information Security - CC BY 4.0. Adapted for agents.fail.