Design: RAG System for Customer Support¶
This is the #1 AI system design question. A must-know for any LLM/AI role.
The Question¶
"Design a RAG-based customer support chatbot that answers questions from a company's knowledge base."
Step 1: Clarify¶
- Knowledge base: ~500-5000 articles (help docs, FAQs, product guides)
- Users: Customer-facing, ~1000 concurrent users at peak
- Latency: < 3 seconds for response
- Accuracy: Must be grounded — no hallucination on policy/pricing questions
- Features: Citations, "I don't know" for gaps, multi-turn conversation
- Updates: Knowledge base updated weekly
Step 2: High-Level Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ INGESTION (Offline) │
│ Documents → Chunking → Embedding → Vector Store (FAISS) │
│ (Scheduled pipeline, runs on KB updates) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ SERVING (Online) │
│ User Query │
│ → Embedding │
│ → Vector Search (top 10) │
│ → Cross-encoder Reranking (top 3) │
│ → Relevance Filter (threshold) │
│ → LLM (system prompt + context + query) │
│ → Response + Citations │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ API LAYER │
│ FastAPI → Auth → Rate Limiting → Chat Endpoint │
│ WebSocket for streaming responses │
└─────────────────────────────────────────────────────────────┘
Step 3: Deep Dive¶
Ingestion Pipeline¶
- Document loading: Parse HTML/MD/PDF → plain text
- Chunking: Semantic chunking (group by meaning, not fixed size)
- Why: "In my project, semantic chunking reduced 322 noisy chunks to 82 high-quality ones"
- Embedding: Sentence Transformers (all-MiniLM-L6-v2) for cost-efficiency, or OpenAI ada-002 for quality
- Storage: FAISS for < 100K chunks. Pinecone/Qdrant for scale + filtering
- Schedule: Re-run ingestion on KB updates (cron or event-triggered)
Retrieval¶
- Two-stage pipeline:
- FAISS bi-encoder search → top 10 candidates (fast, ~50ms)
- Cross-encoder reranking → top 3 (accurate, ~200ms)
- Relevance threshold: If best score < threshold, return "I don't have information about that" instead of hallucinating
- Why two stages: "Cross-encoder is too slow for full corpus, bi-encoder too imprecise for final ranking. Two stages give you speed + accuracy."
LLM Layer¶
- Model: GPT-4 / Claude for quality, Llama 3 via Groq for cost/speed
- System prompt: Persona + rules (only use provided context, cite sources, say "I don't know")
- Context window management: Only inject top 3 chunks to avoid noise
- Streaming: Token-by-token via SSE/WebSocket for perceived speed
API Layer¶
- FastAPI with async endpoints
- JWT authentication for user sessions
- Rate limiting (10 req/min per user)
- Conversation history stored in Redis (TTL: 30 min)
Step 4: Edge Cases¶
| Scenario | Handling |
|---|---|
| KB doesn't have the answer | Relevance threshold → "I don't know" + suggest contacting support |
| Hallucination risk | System prompt grounding + only use retrieved context + citations |
| Adversarial queries | Content filter on input + output, refuse policy-violating queries |
| Stale KB | Scheduled re-ingestion + cache invalidation |
| High latency | Cache frequent queries, pre-compute common embeddings |
| Multi-language | Multilingual embedding model, or translate → retrieve → translate back |
Step 5: Scale & Monitor¶
Scaling¶
- Vector DB: Move from FAISS to Pinecone (serverless, auto-scales)
- API: Horizontal scaling behind load balancer
- Caching: Redis cache for frequent queries (cache key = embedding hash)
- Async: Background ingestion with Celery/Airflow
Monitoring¶
- Latency: P50, P95, P99 per component (retrieval, reranking, LLM)
- Quality: User feedback (thumbs up/down), automated eval (RAGAS)
- Cost: Token usage per query, cost per conversation
- Alerts: Latency > 5s, error rate > 1%, low user satisfaction
Improvement Loop¶
How This Maps to Bubble¶
| Design Component | Your Bubble Implementation |
|---|---|
| Chunking | Semantic chunking (322 → 82 chunks) |
| Embedding | Sentence Transformers (HuggingFace) |
| Vector Store | FAISS |
| Two-stage retrieval | Bi-encoder + Cross-encoder reranking |
| Relevance filtering | Configurable threshold |
| LLM | Groq (Llama 3) |
| Agent orchestration | LangGraph |
| Config-driven | Pydantic + YAML (swap models without code changes) |
Interview tip: "I've actually built this. Let me walk you through my implementation and the decisions I made..."