What is prompt injection?

Prompt injection is when input the model reads — from a user, a retrieved document, an integrated tool — overrides the system prompt or causes the model to take unintended actions. Direct injection comes from the user; indirect injection comes from data the model retrieves or processes.

Can prompt injection be fully prevented?

Not at the model level alone. Layered defenses — input/output filtering, structural separation between instructions and data, restricted tool permissions, output validation — reduce risk significantly. Treating the model as untrusted output, not trusted reasoning, is the right framing.

What is the biggest risk for RAG features?

Indirect prompt injection through retrieved documents. A document the model fetches can contain instructions that override your system prompt. Trust boundaries should treat retrieved content as user-equivalent input, not as authoritative context.

How do attackers exfiltrate data via LLMs?

Common patterns include eliciting system-prompt content, leaking other users' context in shared environments, embedding data in URLs the model is asked to render, and using tool calls to write data to attacker-controlled destinations.

Is the OWASP Top 10 for LLM Applications worth following?

Yes, as a checklist. The 2025 version of the OWASP Top 10 for LLM Applications covers prompt injection, sensitive information disclosure, supply chain, data and model poisoning, improper output handling, excessive agency, system prompt leakage, vector and embedding weaknesses, misinformation, and unbounded consumption.

LLM Security: Preventing Prompt Injection and Data Leakage

Orientation

The security issues that show up in real LLM-feature engagements are not exotic. They are the same shapes of bug that show up in any feature: untrusted input that overrides intended behavior, data leakage across trust boundaries, abuse of authorized actions. The model is the new variable, and it changes how each of these manifests — but the categories are recognizable.

This guide covers the classes you should expect to test for in any feature that ships an LLM in front of users. The framing follows OWASP Top 10 for LLM Applications categories where they apply.

Direct prompt injection

The shape. A user submits input that overrides the system prompt, exfiltrates instructions, or coerces the model into producing content the developer did not intend. The simplest examples are "ignore previous instructions and ..." or jailbreak prompts. Real attacks are more subtle: structured input that mimics system instructions, or input that exploits how the model handles tokenization at the boundary between system and user content.

How to test. Adversarial corpora across categories: instruction override, role assumption, data exfiltration, output coercion, and policy circumvention. Test against the actual feature surface — chat, summarization, agent — because behavior shifts with prompt structure.

What helps. Layered defenses: input filtering, structural separation between system and user content, output validation, sensitive-action gating, and treating the model output as untrusted-by-default.

Indirect prompt injection

The shape. The model retrieves or processes content from a third party — a document the user uploads, a web page the agent fetches, an email the assistant summarizes — and that content contains instructions that override the system prompt. The user does not need to be malicious; the source of the instructions is anywhere the model reads from.

How to test. Craft documents, web pages, and emails that include adversarial content and observe whether the model follows them. Test the realistic ingestion path — the document upload UI, the URL ingestion, the email integration — not just direct prompts.

What helps. Treat retrieved content as user-equivalent input. Strip or escape instruction-like patterns at retrieval time. Restrict the model's ability to take consequential actions based on retrieved content alone (require explicit user confirmation for sensitive actions).

RAG leakage and retrieval boundaries

The shape. Retrieval-augmented generation features fetch documents from your knowledge store and include them in the prompt. Two failure modes:

Cross-tenant retrieval. Retrieval that does not enforce tenant boundaries returns documents owned by other tenants. The model summarizes or quotes them in the response.
Source-poisoning resilience. An attacker who can write documents to the knowledge store (directly or via an upload feature) can poison retrieval for other users.

How to test. Across two tenants, ask questions that should retrieve only the calling tenant's documents and verify nothing from the other tenant appears. Upload poisoned content and verify the system isolates it.

What helps. Per-tenant filtering at the retrieval layer, not in the prompt. Default-deny query patterns scoped to the caller. Provenance tracking so the response can attribute content to specific sources.

Data exposure and tenancy

The shape. Beyond RAG, conversation context can leak across users in shared environments. System prompts can be elicited and shared. Cached context can return for the wrong user. Embeddings can be reversed to reveal source content.

How to test. Ask the model to summarize "the previous conversation" in a fresh session, then in a returning user's session, and observe what it has access to. Try elicitation patterns for system-prompt content. Test cache invalidation across sessions and users.

What helps. Hard isolation of conversation context per user/tenant. No shared memory across users. System prompts treated as data that may be leaked — never store secrets there. Embedding stores audited as sensitive data, not just performance optimization.

Tool-use safety

The shape. Agents that can call tools — APIs, search, file operations, payment actions — have a much larger blast radius. The risk is not just what the model says; it is what the model causes.

How to test. For each tool the model can call, evaluate: can the model be coerced into calling it? With what parameters? Can chains of tool calls amplify damage? Does the system require user confirmation before consequential actions?

What helps. Tool selection through a separate planning layer that is not user-prompt-driven. Per-tool allow-lists at the API layer. Hard limits on tool-chain depth and total cost. Mandatory user confirmation for any action that writes data, sends communication, or moves money.

Improper output handling

The shape. The application takes the model output and uses it directly — rendered as HTML, executed as code, passed as a SQL parameter, sent to another API. Model output is untrusted; treating it as trusted is XSS, SQLi, or arbitrary code execution waiting to happen.

How to test. Observe how output is consumed downstream. Submit prompts that elicit output containing dangerous payloads (HTML, JS, SQL, shell). Verify the downstream consumer escapes appropriately.

What helps. Output handled with the same care as user input. Escape, encode, or validate at every consumption boundary.

Rate limits and cost controls

The shape. LLM calls cost money. An attacker (or buggy client) submitting requests in volume drives up your bill — or pushes you into rate-limit failures that cascade to denial of service.

How to test. Submit requests in volume from a single account, multiple accounts, and unauthenticated paths. Verify per-identity quotas and per-endpoint limits.

What helps. Per-identity cost quotas with hard caps. Token-budget enforcement. Async processing for any operation that could chain expensive calls.

A practical threat-model template

For any LLM feature, walk through these questions:

What does the model read? (User input, retrieved documents, third-party content, tool output)
What can the model do? (Generate text, call tools, write data, send communication, move money)
Who can influence the model's input — and how is that input authenticated?
What happens if the model output is wrong, malicious, or coerced?
What user confirmation is required for consequential actions?
How are conversation, embedding, and cache stores isolated per tenant?
What rate and cost limits prevent abuse?

The honest framing for LLM security is that the model is untrusted output of trusted reasoning, not trusted output of authoritative reasoning. Every protection downstream of the model — output handling, tool gating, user confirmation, tenant isolation — should assume the model can be made to say anything. That assumption is what makes the rest of the system safe.

What we'd test for this

AI security testing

Prompt injection, RAG leakage, tool-use safety, and the surrounding service — aligned to the OWASP Top 10 for LLM Applications.

See the engagement Common in this industry

AI / Machine learning

Prompt injection, RAG leakage, tool-use safety, AI Act readiness.

See industry scope

AI Security

LLM Security: Preventing Prompt Injection and Data Leakage

Orientation

Direct prompt injection

Indirect prompt injection

RAG leakage and retrieval boundaries

Data exposure and tenancy

Tool-use safety

Improper output handling

Rate limits and cost controls

A practical threat-model template

AI security testing

AI / Machine learning

AI Red Teaming: Testing LLMs and Generative AI

OWASP API Security Top 10: A Practical Testing Guide

Authenticated vs Unauthenticated Penetration Testing

LLM security — common questions

Want a credible answer when a customer, auditor, or your board asks how secure you are?

LLM Security: Preventing Prompt Injection and Data Leakage

Orientation

Direct prompt injection

Indirect prompt injection

RAG leakage and retrieval boundaries

Data exposure and tenancy

Tool-use safety

Improper output handling

Rate limits and cost controls

A practical threat-model template

AI security testing

AI / Machine learning

Related articles

AI Red Teaming: Testing LLMs and Generative AI

OWASP API Security Top 10: A Practical Testing Guide

Authenticated vs Unauthenticated Penetration Testing

LLM security — common questions

Want a credible answer when a customer, auditor, or your board asks how secure you are?