Orientation
The security issues that show up in real LLM-feature engagements are not exotic. They are the same shapes of bug that show up in any feature: untrusted input that overrides intended behavior, data leakage across trust boundaries, abuse of authorized actions. The model is the new variable, and it changes how each of these manifests — but the categories are recognizable.
This guide covers the classes you should expect to test for in any feature that ships an LLM in front of users. The framing follows OWASP Top 10 for LLM Applications categories where they apply.
Direct prompt injection
The shape. A user submits input that overrides the system prompt, exfiltrates instructions, or coerces the model into producing content the developer did not intend. The simplest examples are "ignore previous instructions and ..." or jailbreak prompts. Real attacks are more subtle: structured input that mimics system instructions, or input that exploits how the model handles tokenization at the boundary between system and user content.
How to test. Adversarial corpora across categories: instruction override, role assumption, data exfiltration, output coercion, and policy circumvention. Test against the actual feature surface — chat, summarization, agent — because behavior shifts with prompt structure.
What helps. Layered defenses: input filtering, structural separation between system and user content, output validation, sensitive-action gating, and treating the model output as untrusted-by-default.
Indirect prompt injection
The shape. The model retrieves or processes content from a third party — a document the user uploads, a web page the agent fetches, an email the assistant summarizes — and that content contains instructions that override the system prompt. The user does not need to be malicious; the source of the instructions is anywhere the model reads from.
How to test. Craft documents, web pages, and emails that include adversarial content and observe whether the model follows them. Test the realistic ingestion path — the document upload UI, the URL ingestion, the email integration — not just direct prompts.
What helps. Treat retrieved content as user-equivalent input. Strip or escape instruction-like patterns at retrieval time. Restrict the model's ability to take consequential actions based on retrieved content alone (require explicit user confirmation for sensitive actions).
RAG leakage and retrieval boundaries
The shape. Retrieval-augmented generation features fetch documents from your knowledge store and include them in the prompt. Two failure modes:
- Cross-tenant retrieval. Retrieval that does not enforce tenant boundaries returns documents owned by other tenants. The model summarizes or quotes them in the response.
- Source-poisoning resilience. An attacker who can write documents to the knowledge store (directly or via an upload feature) can poison retrieval for other users.
How to test. Across two tenants, ask questions that should retrieve only the calling tenant's documents and verify nothing from the other tenant appears. Upload poisoned content and verify the system isolates it.
What helps. Per-tenant filtering at the retrieval layer, not in the prompt. Default-deny query patterns scoped to the caller. Provenance tracking so the response can attribute content to specific sources.
Data exposure and tenancy
The shape. Beyond RAG, conversation context can leak across users in shared environments. System prompts can be elicited and shared. Cached context can return for the wrong user. Embeddings can be reversed to reveal source content.
How to test. Ask the model to summarize "the previous conversation" in a fresh session, then in a returning user's session, and observe what it has access to. Try elicitation patterns for system-prompt content. Test cache invalidation across sessions and users.
What helps. Hard isolation of conversation context per user/tenant. No shared memory across users. System prompts treated as data that may be leaked — never store secrets there. Embedding stores audited as sensitive data, not just performance optimization.
Tool-use safety
The shape. Agents that can call tools — APIs, search, file operations, payment actions — have a much larger blast radius. The risk is not just what the model says; it is what the model causes.
How to test. For each tool the model can call, evaluate: can the model be coerced into calling it? With what parameters? Can chains of tool calls amplify damage? Does the system require user confirmation before consequential actions?
What helps. Tool selection through a separate planning layer that is not user-prompt-driven. Per-tool allow-lists at the API layer. Hard limits on tool-chain depth and total cost. Mandatory user confirmation for any action that writes data, sends communication, or moves money.
Improper output handling
The shape. The application takes the model output and uses it directly — rendered as HTML, executed as code, passed as a SQL parameter, sent to another API. Model output is untrusted; treating it as trusted is XSS, SQLi, or arbitrary code execution waiting to happen.
How to test. Observe how output is consumed downstream. Submit prompts that elicit output containing dangerous payloads (HTML, JS, SQL, shell). Verify the downstream consumer escapes appropriately.
What helps. Output handled with the same care as user input. Escape, encode, or validate at every consumption boundary.
Rate limits and cost controls
The shape. LLM calls cost money. An attacker (or buggy client) submitting requests in volume drives up your bill — or pushes you into rate-limit failures that cascade to denial of service.
How to test. Submit requests in volume from a single account, multiple accounts, and unauthenticated paths. Verify per-identity quotas and per-endpoint limits.
What helps. Per-identity cost quotas with hard caps. Token-budget enforcement. Async processing for any operation that could chain expensive calls.
A practical threat-model template
For any LLM feature, walk through these questions:
- What does the model read? (User input, retrieved documents, third-party content, tool output)
- What can the model do? (Generate text, call tools, write data, send communication, move money)
- Who can influence the model's input — and how is that input authenticated?
- What happens if the model output is wrong, malicious, or coerced?
- What user confirmation is required for consequential actions?
- How are conversation, embedding, and cache stores isolated per tenant?
- What rate and cost limits prevent abuse?
The honest framing for LLM security is that the model is untrusted output of trusted reasoning, not trusted output of authoritative reasoning. Every protection downstream of the model — output handling, tool gating, user confirmation, tenant isolation — should assume the model can be made to say anything. That assumption is what makes the rest of the system safe.
AI security testing
Prompt injection, RAG leakage, tool-use safety, and the surrounding service — aligned to the OWASP Top 10 for LLM Applications.
See the engagement Common in this industryAI / Machine learning
Prompt injection, RAG leakage, tool-use safety, AI Act readiness.
See industry scopeRelated articles
Preparing for your first pentest? Download the SMB Pentest Readiness Checklist →