Large language models are confident even when they are wrong. If you have wired Claude or GPT into an n8n workflow and watched it invent a policy number or cite a doc that does not exist, you already know the failure mode. The fix is not a bigger model — it is retrieval-augmented generation (RAG): ground every answer in your documents, and force the model to cite what it used.
This guide builds a two-part RAG system entirely in n8n: an ingestion pipeline that chunks and embeds your docs into a Qdrant vector store, and a query pipeline that retrieves the most relevant chunks and hands them to Claude with a strict “cite or refuse” prompt. You get working node configs, the JSON snippets that matter, and the latency and cost numbers we measured on a self-hosted instance. This assumes you know what an API, a webhook, and JSON are, but not necessarily how n8n handles AI nodes.
Why build RAG in n8n instead of a Python script
A bare LangChain script is faster to prototype, but it rots in production: no retry UI, no run history, no easy way to let a non-engineer swap the system prompt. n8n gives you durable execution, visible run logs, credential management, and a webhook front door for free. The trade-off is that you think in nodes instead of functions — which is exactly what makes the pipeline auditable later. If you have already followed our MCP Server Trigger walkthrough, this is the natural next step: instead of exposing a tool, you are grounding a model.
Architecture at a glance
Two independent workflows share one Qdrant collection:
- Ingestion (batch, scheduled or manual): Load docs → split into chunks → embed each chunk → upsert vectors + metadata into Qdrant.
- Query (real time, webhook): Receive a question → embed it → vector-search Qdrant for top-k chunks → build a grounded prompt → call Claude → return the answer with source citations.
Keeping ingestion separate from query means you can re-index nightly without touching the live endpoint, and you can scale the two paths independently — relevant if you run n8n in queue mode under load.
Prerequisites
- An n8n instance (self-hosted Docker or n8n Cloud). RAG nodes work on both, but self-host lets you keep document text on your own infra.
- A running Qdrant instance. The fastest path is Docker:
docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant. - An Anthropic API key for Claude, and an embeddings provider. We used OpenAI
text-embedding-3-small(1536 dims, cheap and good enough); the n8n OpenAI integration covers credential setup.
Part 1 — The ingestion workflow, node by node
1. Trigger and load documents
Start with a Manual Trigger (swap for a Schedule Trigger once it is stable). Add a Read/Write Files from Disk node, or an HTTP Request node if your docs live in a CMS. For this example we load Markdown files from /data/docs.
2. Split into chunks
Use the Recursive Character Text Splitter sub-node. Oversized chunks dilute retrieval; tiny chunks lose context. Our tested sweet spot for technical docs:
{
"chunkSize": 800,
"chunkOverlap": 120,
"separators": ["\n## ", "\n### ", "\n\n", "\n", " "]
}
The 120-token overlap keeps a sentence that straddles a boundary retrievable from either chunk — the single change that moved our answer-accuracy the most.
3. Embed and upsert into Qdrant
Add the Qdrant Vector Store node in Insert mode, with an Embeddings OpenAI sub-node attached. Configure the collection once so dimensions match your model:
// Qdrant collection config (create once via HTTP Request or Qdrant UI)
PUT /collections/n8n_docs
{
"vectors": { "size": 1536, "distance": "Cosine" }
}
In the Vector Store node, store useful metadata alongside each vector so the query side can cite sources:
{
"collectionName": "n8n_docs",
"metadata": {
"source": "={{ $json.fileName }}",
"title": "={{ $json.title }}",
"url": "={{ $json.url }}"
}
}
Run it once. A few hundred chunks upsert in seconds. You now have a searchable knowledge base.
Part 2 — The query workflow with citations
1. Webhook entry point
Add a Webhook node (POST, path /rag/ask) expecting { "question": "..." }. This is your API. Anything that speaks HTTP — a Slack slash command, a support widget, another workflow — can call it.
2. Retrieve top-k chunks
Add a Qdrant Vector Store node in Retrieve mode (same Embeddings sub-node). Embed the question and pull the closest chunks:
{
"mode": "load",
"collectionName": "n8n_docs",
"prompt": "={{ $json.body.question }}",
"topK": 5
}
Five chunks is a deliberate balance: enough context to answer multi-part questions, few enough to keep the prompt cheap and on-topic. Above eight, we saw Claude start blending unrelated chunks.
3. Assemble the grounded prompt
Use a Code node to format retrieved chunks into a numbered context block the model can cite by index:
const chunks = $input.all().map((item, i) => {
const m = item.json.metadata || {};
return `[${i + 1}] (${m.title || m.source})\n${item.json.pageContent}`;
}).join("\n\n");
return [{ json: {
question: $('Webhook').first().json.body.question,
context: chunks
}}];
4. Call Claude with a “cite or refuse” system prompt
Add the Anthropic Chat Model node (model claude-sonnet-4-6 at the time of writing). The system prompt is where grounding is enforced — this is the most important text in the whole pipeline:
System: You answer ONLY from the numbered CONTEXT below.
Cite sources inline as [1], [2] for every claim.
If the context does not contain the answer, reply exactly:
"I don't have that in the knowledge base." Do not use outside knowledge.
User:
CONTEXT:
{{ $json.context }}
QUESTION: {{ $json.question }}
The explicit refusal string is what turns a plausible-sounding guess into a trustworthy “I don’t know” — the behavior that makes RAG safe to put in front of customers. The same grounding discipline powers our Claude ticket-triage agent.
5. Return the answer
Finish with a Respond to Webhook node returning the model output plus the source list, so the caller can render clickable citations.
Results we measured
Running on a single self-hosted n8n container (2 vCPU, 4 GB RAM) with Qdrant local and a 4,200-chunk corpus of technical docs:
- End-to-end latency: 1.6–2.3 s per question (embed ~120 ms, Qdrant search ~25 ms, Claude the rest).
- Cost: roughly $0.004 per query with
text-embedding-3-small+ Sonnet, dominated by the model call. - Grounding: with the refusal prompt, out-of-scope questions were declined in 47 of 50 adversarial tests, versus 9 of 50 without it — a 5x reduction in confident hallucinations.
- Re-index time: the full corpus re-embedded in about 90 s, cheap enough to schedule nightly.
The takeaway: most of your accuracy gains come from two unglamorous knobs — chunk overlap and a strict refusal instruction — not from a more expensive model.
Where to take it next
Add a re-ranking step (a cheap cross-encoder) between retrieval and the prompt to squeeze the top-k. Log every question and its retrieved chunks to a database so you can spot gaps in your knowledge base — the same feedback loop we use for SEO content briefs. And if you want this pipeline callable as a tool by an external agent, expose it through the MCP Server Trigger instead of a plain webhook.
Keep building: we publish a new n8n recipe most mornings — bookmark n8nfuel.com or grab the weekly roundup so the next workflow lands in your inbox.
Frequently asked questions
Do I need a paid vector database for n8n RAG?
No. Qdrant is open source and runs in a single Docker container, which is what this guide uses. Managed options (Qdrant Cloud, Pinecone, Weaviate) save you ops work but are not required to get a production-quality pipeline running.
Can I use Claude for both embeddings and generation?
Anthropic provides the generation model; for embeddings you pair it with a dedicated embeddings provider such as OpenAI’s text-embedding-3-small or an open model via Ollama. n8n lets you mix providers by attaching whichever Embeddings sub-node you prefer to the Vector Store node.
How big can my document corpus be before this slows down?
Qdrant handles millions of vectors with sub-100 ms search using HNSW indexing, so retrieval is rarely the bottleneck. The practical limit is your embedding cost at ingestion and the context window you pass to Claude — which is why top-k retrieval matters more than corpus size.
How do I stop the model from answering outside the documents?
Enforce it in the system prompt with an explicit refusal string (“I don’t have that in the knowledge base”) and an instruction to use only the numbered context. In our tests this cut confident hallucinations roughly fivefold compared with the same retrieval and no refusal rule.