I Built a Production RAG System from Scratch
Real-world lessons from building, deploying, and scaling a knowledge retrieval system
I’ve been writing on Substack for five months now. I started with 0 readers. Today, more than 1,300 of you hear what I have to say each week, and I couldn’t be more grateful. 🙏
The more I write, the more I find myself repeating the same answers in DMs or emails or LinkedIn messages:
“How did you break into data science without a PhD?”
”How can I learn AI without a technical background?”
”What projects should I build to stand out?”
I was asking these questions myself six years ago, and I’ve written extensively about them. But it gets tedious digging through my own archive, copy-pasting advice, trying to remember what I said where.
So I thought: What if I built a RAG system that could surface my past insights instantly?
Not just another ChatGPT wrapper, but something that actually understands the context of my writing and responds in my voice. An AI chatbot that references my relevant context to prevent hallucinations.
The Technical Challenge
Building a RAG system that feels conversational turned out to be harder than I expected. Here's how I did it:
System Architecture
Embeddings: OpenAI's text-embedding-3-small (a good sweet spot between cost and performance)
Vector DB: Supabase with pgvector
LLM: GPT-4o for generation
Frontend: Streamlit with streaming responses
Data Pipeline: RSS parsing + manual exports (Substack has no public API)
Here’s what happens under the hood:
User types a question
System generates embedding for the query
Cosine similarity search retrieves top 5 relevant documents
Retrieved context gets fed to GPT-4o with custom prompts designed to maintain my writing voice and expertise
Response streams back in real-time
The Chunking Decision
I kept this simpler than most RAG tutorials suggest. Instead of breaking posts into paragraphs or fixed token chunks, I treat each blog post as one unit. This works for my content because:
My posts are typically 500-1500 words (fits comfortably in context window)
Each post covers one coherent topic
Retrieval happens at the post level
For massive archives or very long-form content, you’d definitely want smarter chunking, but I haven’t run into this problem yet.
For a more technical deep-dive, check out my complete guide with code snippets here.
Early Results and Learnings
This is launching with this post, so I don’t have usage data yet. But I've been testing it myself, and here's what I've noticed:
Query Types that Work Well
Career transition questions (pulls from my personal story posts)
Technical project advice (references my portfolio posts)
Industry insights (draws from my FinTech experience posts)
Where It Struggles
Time-based queries: "What did you write about in 2023?" or "Your latest thoughts on X" don't work well since embeddings don't capture temporal relationships. A hybrid approach with metadata filtering would solve this.
Relevance boundaries: The system sometimes tries to answer questions I've never written about, leading to generic responses. I handle this with similarity score thresholds: if no content scores above this threshold, it returns "I haven't written about this yet" instead of hallucinating.
Technical Performance
Response time: ~2-3 seconds including retrieval
Retrieval relevance: Pretty good for topics I've covered extensively
Cost: Minimal so far (~$2/month in API calls during testing)
Why RAG Matters
I started this as a personal productivity project, but I came away with an important learning:
Most valuable knowledge is locked away in unstructured content, spread out across multiple sources.
Whether you’re a content creator with dozens of blog posts, or an enterprise with years of data, you face the same problem. The challenge is to make expertise searchable and accessible through effective information retrieval systems.
The technical barriers are getting lower, but the implementation details matter a lot to get a functional and effective system, in terms of:
Having good retrieval accuracy
Maintaining a consistent voice
Building for scale
Achieving all this requires understanding both the technology and business context.
Try It & Give Me Feedback
Test the system: assistant.ds-claudia.com
I'd love your feedback on:
What works
What doesn't
What questions it can't answer yet.
Comment below! Your input will help me improve the system and guide future content.



You’re taking a really different approach from how I built my RAG, and I love it!
The way you’re handling hallucination is super considerate. Definitely something I want to try in my own setup.
Do you think OpenAI will bake this into version 5, it seems like a no brainer?