Skip to content

Design: RAG System for Customer Support

This is the #1 AI system design question. A must-know for any LLM/AI role.


The Question

"Design a RAG-based customer support chatbot that answers questions from a company's knowledge base."


Step 1: Clarify

  • Knowledge base: ~500-5000 articles (help docs, FAQs, product guides)
  • Users: Customer-facing, ~1000 concurrent users at peak
  • Latency: < 3 seconds for response
  • Accuracy: Must be grounded — no hallucination on policy/pricing questions
  • Features: Citations, "I don't know" for gaps, multi-turn conversation
  • Updates: Knowledge base updated weekly

Step 2: High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                     INGESTION (Offline)                       │
│  Documents → Chunking → Embedding → Vector Store (FAISS)     │
│  (Scheduled pipeline, runs on KB updates)                    │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                     SERVING (Online)                          │
│  User Query                                                  │
│    → Embedding                                               │
│    → Vector Search (top 10)                                  │
│    → Cross-encoder Reranking (top 3)                         │
│    → Relevance Filter (threshold)                            │
│    → LLM (system prompt + context + query)                   │
│    → Response + Citations                                    │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                     API LAYER                                │
│  FastAPI → Auth → Rate Limiting → Chat Endpoint              │
│  WebSocket for streaming responses                           │
└─────────────────────────────────────────────────────────────┘

Step 3: Deep Dive

Ingestion Pipeline

  • Document loading: Parse HTML/MD/PDF → plain text
  • Chunking: Semantic chunking (group by meaning, not fixed size)
  • Why: "In my project, semantic chunking reduced 322 noisy chunks to 82 high-quality ones"
  • Embedding: Sentence Transformers (all-MiniLM-L6-v2) for cost-efficiency, or OpenAI ada-002 for quality
  • Storage: FAISS for < 100K chunks. Pinecone/Qdrant for scale + filtering
  • Schedule: Re-run ingestion on KB updates (cron or event-triggered)

Retrieval

  • Two-stage pipeline:
  • FAISS bi-encoder search → top 10 candidates (fast, ~50ms)
  • Cross-encoder reranking → top 3 (accurate, ~200ms)
  • Relevance threshold: If best score < threshold, return "I don't have information about that" instead of hallucinating
  • Why two stages: "Cross-encoder is too slow for full corpus, bi-encoder too imprecise for final ranking. Two stages give you speed + accuracy."

LLM Layer

  • Model: GPT-4 / Claude for quality, Llama 3 via Groq for cost/speed
  • System prompt: Persona + rules (only use provided context, cite sources, say "I don't know")
  • Context window management: Only inject top 3 chunks to avoid noise
  • Streaming: Token-by-token via SSE/WebSocket for perceived speed

API Layer

  • FastAPI with async endpoints
  • JWT authentication for user sessions
  • Rate limiting (10 req/min per user)
  • Conversation history stored in Redis (TTL: 30 min)

Step 4: Edge Cases

Scenario Handling
KB doesn't have the answer Relevance threshold → "I don't know" + suggest contacting support
Hallucination risk System prompt grounding + only use retrieved context + citations
Adversarial queries Content filter on input + output, refuse policy-violating queries
Stale KB Scheduled re-ingestion + cache invalidation
High latency Cache frequent queries, pre-compute common embeddings
Multi-language Multilingual embedding model, or translate → retrieve → translate back

Step 5: Scale & Monitor

Scaling

  • Vector DB: Move from FAISS to Pinecone (serverless, auto-scales)
  • API: Horizontal scaling behind load balancer
  • Caching: Redis cache for frequent queries (cache key = embedding hash)
  • Async: Background ingestion with Celery/Airflow

Monitoring

  • Latency: P50, P95, P99 per component (retrieval, reranking, LLM)
  • Quality: User feedback (thumbs up/down), automated eval (RAGAS)
  • Cost: Token usage per query, cost per conversation
  • Alerts: Latency > 5s, error rate > 1%, low user satisfaction

Improvement Loop

User Feedback → Identify Bad Responses → Improve Chunks/Prompts → A/B Test → Deploy

How This Maps to Bubble

Design Component Your Bubble Implementation
Chunking Semantic chunking (322 → 82 chunks)
Embedding Sentence Transformers (HuggingFace)
Vector Store FAISS
Two-stage retrieval Bi-encoder + Cross-encoder reranking
Relevance filtering Configurable threshold
LLM Groq (Llama 3)
Agent orchestration LangGraph
Config-driven Pydantic + YAML (swap models without code changes)

Interview tip: "I've actually built this. Let me walk you through my implementation and the decisions I made..."