AI Website Chatbots

RAG Chatbot Architecture: Feeding Claude Your Docs Without Hallucinations

The architecture decisions that separate a RAG chatbot demo from one that actually answers customers without making things up — from agency builds in India.

Rehdhil Siyad

Founder · Neogen Media

23 June 2026

8 min read

RAG Chatbot Architecture by Neogen Media

A RAG (retrieval-augmented generation) chatbot answers questions by first searching your own documents, then asking a large language model like Claude or GPT to write an answer using only what it found. That retrieval step is what keeps it honest: instead of guessing from training memory, the model is handed your real product docs and told to ground every reply in them.

We build these for clients every week, and the gap between a weekend tutorial and a chatbot you can put in front of paying customers is enormous. A demo that works on five PDFs falls apart on five thousand. This is the architecture playbook we wish more teams read before they shipped — the decisions that actually decide whether your bot cites a real policy or invents one.

What is a RAG chatbot, and why does it hallucinate less?

A RAG chatbot is an AI assistant that retrieves relevant passages from your knowledge base before generating a reply, so its answers are anchored to your source material rather than the model's general training. It hallucinates less because the model is given the facts at query time and instructed not to answer beyond them.

The technique comes from a 2020 paper by Patrick Lewis and colleagues at Facebook AI Research. They found that RAG models "generate more specific, diverse and factual language" than a model relying on its parameters alone. Five years on, that finding is the backbone of nearly every serious business chatbot — because a model that quotes your refund policy is worth far more than one that confidently makes one up.

The pipeline has three moving parts: ingestion (your docs are chunked and converted to vectors), retrieval (a user question pulls the most relevant chunks from a vector database), and generation (the model writes an answer grounded in those chunks). Every hallucination problem you will ever hit lives in one of these three stages.

What's the difference between an AI chatbot and a RAG chatbot?

A standard AI chatbot answers from the model's frozen training data, so it knows nothing about your business and will invent plausible-sounding details when asked. A RAG chatbot adds a live retrieval layer over your own content, so it answers from your documents and can cite exactly where each fact came from.

The practical difference shows up the moment a customer asks something specific. Ask a plain chatbot "what's your cancellation window in India?" and it guesses. Ask a RAG chatbot the same thing and it retrieves your actual terms, answers in one line, and links the clause. If you want the broader landscape of chatbot types before going deeper, our guide to AI chatbots for websites covers where RAG fits among the alternatives.

How do you choose an embedding model and vector store?

Pick an embedding model based on your content's language and domain, and a vector store based on scale and how much infrastructure you want to run. For most business knowledge bases, a managed embedding API plus a managed vector database beats self-hosting — you are buying reliability, not bragging rights.

Embeddings turn each chunk of text into a numerical fingerprint so the system can find passages by meaning, not keyword. The common choices:

OpenAI text-embedding-3 — strong general-purpose default, cheap, easy to wire up.
Cohere Embed — excellent multilingual coverage, useful when your docs mix English and Indian languages.
Voyage AI — high retrieval accuracy on technical and long-form content.

The vector store is where those fingerprints live and get searched. Pinecone is the fully-managed option that just works at scale; Qdrant is the open-source choice when you want to self-host or keep data on your own infrastructure for compliance. The honest rule: choose the embedding model for accuracy in your domain, choose the vector store for how much operational burden you can carry.

Whatever you pick, the retrieval quality of your embeddings sets the ceiling on the whole system. A weak embedding model means the right passage never gets retrieved — and the model cannot ground an answer in a document it was never shown.

How should you chunk and re-rank documents to avoid wrong answers?

Chunk your documents into passages small enough to be specific but large enough to keep their meaning — usually a few hundred tokens with some overlap — and add a re-ranking step that re-scores the retrieved chunks for true relevance before they reach the model. Chunking and re-ranking are the two cheapest levers for accuracy, and the two most teams skip.

Bad chunking is the silent killer of RAG quality. Split a refund policy mid-sentence and the retriever pulls half a clause; the model fills the gap and hallucinates the rest. We chunk on semantic boundaries — headings, list items, full clauses — not arbitrary character counts, and we attach context to each chunk so an isolated passage still knows which document and section it came from.

Re-ranking is the upgrade that pays for itself. The first retrieval pass is fast but rough; a re-ranker then re-orders the top candidates so the genuinely most relevant passage lands first. Anthropic's engineering team reported that contextual retrieval cut the rate of failed retrievals by 49%, and by 67% once a re-ranking step was added. When two-thirds of your retrieval failures disappear, two-thirds of your hallucinations go with them.

How do you actually stop a RAG chatbot from hallucinating?

You stop hallucinations with grounding rules and visible citations: instruct the model to answer only from retrieved context, to say "I don't know" when the context is thin, and to attach a source link to every claim. Retrieval reduces hallucination; explicit grounding instructions and citation display close the gap.

Three enforcement layers we put on every production build:

Hard grounding in the system prompt — the model is told, in plain terms, to use only the supplied passages and to refuse rather than guess when they don't cover the question.
A confidence floor — if the best retrieved chunk scores below a relevance threshold, the bot hands off to a human or asks a clarifying question instead of answering.
Citation display in the UI — every answer shows the source it used, so a user (and your team) can verify it in one click. A bot that shows its work is a bot people trust.

This is also where most teams underestimate the effort. The retrieval pipeline is maybe 40% of the job; grounding behaviour, fallback logic, and citation UX are the rest. We orchestrate the whole flow in n8n — retrieval, the model call to Claude or Gemini, the confidence check, and the human handoff — so the rules live in one auditable workflow rather than scattered across code. If you'd rather have this built and maintained for you, our AI website chatbot service ships exactly this architecture, grounded in your own documents.

What are the 7 types of RAG?

The seven commonly cited RAG architectures are Naïve, Advanced, Modular, Graph, Hybrid, Agentic, and Multi-Hop RAG. They form a ladder of sophistication — each one adds a way to retrieve more precisely or reason over multiple sources before answering.

Naïve RAG — basic retrieve-then-generate; fine for small, clean knowledge bases.
Advanced RAG — adds pre-retrieval query rewriting and post-retrieval re-ranking for accuracy.
Modular RAG — swappable components so you can tune each stage independently.
Graph RAG — retrieves over a knowledge graph to follow relationships between entities.
Hybrid RAG — blends keyword (BM25) and vector search so exact terms and meaning both count.
Agentic RAG — the model decides when and what to retrieve, and can call tools mid-answer.
Multi-Hop RAG — chains several retrievals to answer questions that span multiple documents.

Most business chatbots only need Advanced or Hybrid RAG done well. Reaching for Agentic or Graph RAG before you have nailed chunking and re-ranking is how projects stall — the Reddit threads full of "my RAG bot became my worst nightmare" are almost always fundamentals skipped, not architecture too simple.

Frequently Asked Questions

Is ChatGPT a RAG chatbot?

Not by default. The base ChatGPT answers from its training data with no retrieval. It becomes a RAG system only when you connect it to external data — through browsing, file uploads, or a custom knowledge base via the API. A purpose-built RAG chatbot wires that retrieval layer to your specific documents so every answer is grounded in your content.

Does a RAG chatbot eliminate hallucinations completely?

No — it reduces them sharply but does not eliminate them. Hallucinations still occur when retrieval misses the right passage or the model strays beyond the context. Strong chunking, re-ranking, hard grounding instructions, and a confidence floor that triggers a human handoff are what push the residual error rate low enough for production use.

Which language model is best for a RAG chatbot?

Claude, GPT, and Gemini all work well; the model matters less than your retrieval quality. We pick based on context window, cost, and how strictly a model follows grounding instructions — Claude and Gemini both handle long retrieved context and refusal behaviour reliably. Invest your effort in retrieval first; a great model on weak retrieval still hallucinates.

How long does it take to build a production RAG chatbot?

A working prototype takes days; a production system that handles real customer traffic without embarrassing answers usually takes a few weeks. The extra time goes into chunking strategy, re-ranking, grounding rules, citation display, fallback logic, and testing against real questions. The retrieval pipeline is the fast part — trustworthy behaviour is the work.

Can a RAG chatbot work with documents in Indian languages?

Yes. With a multilingual embedding model such as Cohere Embed and a capable model like Gemini or Claude, a RAG chatbot can retrieve and answer across English and major Indian languages. The key is choosing embeddings that represent your languages well, since retrieval accuracy in each language depends entirely on the embedding model's coverage.

Build it once, build it right

A RAG chatbot is only as trustworthy as its architecture. Get chunking, retrieval, re-ranking, and grounding right and you have a bot that answers from your real documents and shows its sources. Skip them and you have a confident liar with your logo on it. If you want a grounded, citation-backed chatbot built on your own product docs, talk to our team and we'll architect it with you.

Rehdhil SiyadFounder · Neogen Media

Founder and Director at Neogen Media. Writing field notes on AI automation, growth systems, and the integrated playbook we ship for Indian SMBs. Based in Kochi.

Follow on LinkedIn