Introduction
Retrieval-Augmented Generation (RAG) is revolutionizing AI applications by combining the power of retrieval-based search with generative models. But how do you ensure fast, scalable, and efficient AI-driven knowledge retrieval? In this guide, we explore a powerful open-source stack featuring Couchbase and Gemma 3 to build a high-performance RAG system.
Why Couchbase?
Couchbase is a NoSQL database designed for speed, scalability, and flexibility, making it a perfect fit for AI-powered applications. Unlike traditional relational databases, Couchbase offers:
- Memory-First Architecture — Ensures ultra-fast queries by keeping active data in memory.
- Multi-Model Support — Supports key-value, document, and SQL++ (N1QL) queries.
- Built-in Full-Text Search — Ideal for retrieving contextually relevant data in RAG pipelines.
- Automatic Sharding & Replication — Enhances fault tolerance and scalability.
- Edge Computing Ready — Works seamlessly in distributed environments.
Compared to databases like PostgreSQL or MongoDB, Couchbase stands out by providing real-time performance and scalability tailored for AI-driven use cases.
What is Gemma 3?
Gemma 3 is a lightweight, open-source generative AI model developed by Google DeepMind. It is optimized for efficiency and can run on consumer hardware, making it an ideal choice for RAG-based applications.
Why Choose Gemma 3 Over Other LLMs?
- Optimized for Low Compute — Unlike GPT models, Gemma 3 can run efficiently on local machines.
- Open-Source & Customizable — No vendor lock-in, with full flexibility for fine-tuning.
- Strong Performance in Embeddings — Excellent at generating dense vector representations for search.
- Privacy-Friendly — Can be deployed on-prem, ensuring data security.
Tech Stack Overview
1. Couchbase (Vector Database)
Used to store embeddings and indexed documents for fast retrieval.
2. Google’s Gemma 3 (LLM)
Generates embeddings and powers the generative response system.
3. FAISS (Vector Similarity Search)
Optimizes nearest-neighbor searches for retrieval.
4. FastAPI (Backend Framework)
Used to serve the RAG pipeline efficiently.
5. Docker (Containerization)
Deploys the entire stack in an isolated, reproducible environment.
Implementation: Building the RAG System
Step 1: Setting Up Couchbase
Install and run Couchbase using Docker:
docker run -d --name couchbase -p 8091-8096:8091-8096 -p 11210:11210 couchbase
Once running, access the Couchbase Web UI at http://localhost:8091 and configure a bucket to store documents.
Step 2: Installing Dependencies
pip install couchbase google-generativeai fastapi faiss-cpu numpy uvicorn
Step 3: Storing Documents & Embeddings
from couchbase.cluster import Cluster, ClusterOptions
from couchbase.auth import PasswordAuthenticator
from google.generativeai import GenerativeModel
import numpy as np
import faiss
from fastapi import FastAPI, HTTPException
# FastAPI Initialization
app = FastAPI()
# Connect to Couchbase
cluster = Cluster('couchbase://localhost', ClusterOptions(PasswordAuthenticator('admin', 'password')))
bucket = cluster.bucket('rag_data')
collection = bucket.default_collection()
# Initialize Gemma 3 for embeddings
model = GenerativeModel('embedding-001')
# FAISS Index for fast similarity search
d = 768 # Embedding size (example)
index = faiss.IndexFlatL2(d)
# Function to store embeddings
def store_embedding(doc_id, text):
embedding = np.array(model.generate_embedding(text), dtype='float32')
collection.upsert(doc_id, {"text": text, "embedding": embedding.tolist()})
index.add(np.array([embedding]))
@app.post("https://medium.com/store")
def store_document(doc_id: str, text: str):
try:
store_embedding(doc_id, text)
return {"message": "Document stored successfully"}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
Step 4: Querying with RAG
@app.get("https://medium.com/search")
def retrieve_relevant_docs(query: str):
try:
query_embedding = np.array(model.generate_embedding(query)).astype('float32')
_, indices = index.search(np.array([query_embedding]), k=5)
results = [collection.get(f"doc_{i}").content_as[str] for i in indices[0] if i != -1]
return {"results": results}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
How FAISS Enhances Retrieval
FAISS (Facebook AI Similarity Search) is a library optimized for fast nearest-neighbor searches in high-dimensional spaces. It ensures that our RAG system can quickly fetch the most relevant embeddings from Couchbase. By integrating FAISS, we:
- Speed up similarity searches using optimized indexing techniques.
- Handle large-scale embeddings efficiently.
- Improve accuracy in document retrieval by ranking results based on vector distance.
Optimizing FAISS for Large-Scale Use
- IVF Indexing — IndexFlatL2 works well for small datasets, but for scalability, using IVF indexes helps optimize performance.
- HNSW — Graph-based indexing improves recall for large embeddings.
- On-Disk Storage — Use FAISS’s disk-based indexing for handling billions of vectors efficiently.
Benchmarks & Comparisons
Model Query Latency Memory Usage Gemma 3 + FAISS ~5ms Low OpenAI + Pinecone ~10ms Medium PostgreSQL + pgvector ~20ms High
Deployment with Docker
Create a Dockerfile to containerize the API:
FROM python:3.9
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
Run the container:
docker build -t rag-api .
docker run -d -p 8000:8000 rag-api
Conclusion
By combining Couchbase’s scalable vector search with Gemma 3’s efficient embeddings and FAISS’s fast similarity search, you can build a powerful and production-ready RAG system. Whether you’re a beginner or an AI expert, this stack provides flexibility, speed, and open-source freedom.
🔗 Check out the full implementation on GitHub: RAG-Couchbase-Gemma 🚀
Next-Gen RAG with Couchbase and Gemma 3: Building a Scalable AI-Powered Knowledge System was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.