Introducing our team’s latest creation – a revolutionary approach to local RAG applications
TL;DR: We built LEANN, the world’s most “lightweight” semantic search backend that achieves 97% storage savings compared to traditional solutions while maintaining high accuracy and performance. Perfect for privacy-focused RAG applications on your local machine.
🚀 Quick Start
Want to try it right now? Run this single command on your MacBook:
uv pip install leann
📚 Repository & Paper
- GitHub: https://github.com/yichuan-w/LEANN ⭐ (Star us!)
- Paper: Available on arXiv
What is RAG Everything?
RAG (Retrieval-Augmented Generation) has become the first true “killer application” of the LLM era. It seamlessly integrates private data that wasn’t part of the training set into large model inference pipelines.
Privacy scenarios are absolutely the most important deployment direction – especially for your personal data and in highly sensitive domains like healthcare and finance.
RAG Everything starts from the most essential needs of personal laptops. We natively support a bunch of out-of-the-box scenarios (currently supporting macOS and Linux, Windows users need WSL):
🔍 Supported Applications
1. File System RAG
Replace Spotlight Search entirely. Spotlight not only consumes disk space but only does keyword matching. We transform it into a semantic search powerhouse.
2. Apple Mail RAG
Easily find answers to personal questions (like “How many courses should Berkeley EECS freshmen take in their first semester?”).
3. Google Browser History RAG
Track down those vague search records you suddenly forgot – the ones you only have a fuzzy impression of.
4. WeChat Chat History RAG
This is what I use most! I’ve used LEANN to summarize conversations with friends and extract research ideas + slides. We implemented a small hack to bypass WeChat’s encrypted database and extract chat records – don’t worry, everything stays local with zero leakage.
5. Claude Code Semantic Search Enhancement 🔥
One of Claude Code’s biggest pain points is that it’s always grepping and finding nothing. LEANN is one of the first open-source projects to bring true semantic search to Claude Code through an MCP server – enabling it with just one line of code.
These are just the scenarios we think have the most “potential” – we’ll continuously integrate more features based on user feedback until it becomes a personalized local Agent that remembers your LLM memory and masters all your private data.
Why LEANN? The Technical Deep Dive
The Problem with Current Vector Databases
Current mainstream vector databases excel in latency – most queries complete within 10ms-100ms even with millions of data points. In RAG’s search + generation pipeline, search time is “far below” generation time, especially with reasoning models and long chain-of-thought processes.
Latency isn’t the bottleneck in RAG – storage is.
The most important RAG deployment scenario is privacy, especially on personal computers where resources are naturally scarce. Consider this reality check:
For high recall in text RAG, you need fine chunk sizes → embedding storage becomes 3-10x the original text size → Real example: 70GB raw data → 220GB+ index storage
Our Solution: Trade Storage for Compute
LEANN makes a bold design choice: replace storage with recomputation.
Core Innovation
Key Observation: In graph-based indices, a query actually accesses very few nodes → Why store all embeddings?
Our pipeline:
- Build a normal vector store
- Delete all embeddings, keeping only the Proximity Graph to record relationships between data chunks
- Convert memory loading to recomputation during inference
- Leverage lightweight embedding models for efficient graph-based recomputation
Graph Structure Pruning
We observed significant visit skewness patterns in post-RNG graphs. Our strategy:
- Keep high-degree nodes to ensure connectivity
- Limit out-edges for low-degree nodes while allowing unlimited in-edges
- Use heuristics to preserve only essential high-degree nodes
Results That Matter
✅ 97%+ reduction in index size
✅ <2 seconds retrieval time on 3090-level hardware
✅ 90%+ Top-3 recall on real RAG benchmarks
✅ Zero vector storage – all in 200GB+ embedding spaces
Note: Under this high compression rate, PQ, OPQ, and even state-of-the-art RaBitQ cannot guarantee high accuracy – proven in our paper.
Performance Optimizations
- Adaptive pipeline combining coarse-grained and accurate search
- Efficient GPU batching for better utilization
- ZMQ communication using distances instead of embeddings
- CPU/GPU overlapping
- Selective caching of high-degree nodes
The Vision: RAG Everything
We’re continuously maintaining this open-source project at Berkeley SkyLab with full-stack optimization across algorithms, applications, system design, vector databases, and kernel acceleration.
Our Goals
🎯 Seamlessly connect all your private data
🧠 Build long-term local AI memory and agents
💻 Zero cloud dependency, low-cost operation
Technical Details & Future Work
If you want to dive deeper into implementation details, check our arXiv paper and repository. I can write a follow-up post covering all implementation specifics if there’s interest.
We hope LEANN inspires more vector search researchers to think about vector databases from a different angle, especially in popular RAG settings. We were fortunate to discuss our work at SIGMOD/ICML vector search workshops this year and received great recognition from the community.
Get Involved
- ⭐ Star our repository
- 🤝 Contribute to the project
- 🔗 Join our Berkeley SkyLab team
Ready to transform your local machine into a RAG powerhouse?
uv pip install leann
What private data would you want to RAG first? Drop a comment below! 👇
Tags
#rag
#vectordatabase
#semanticsearch
#privacy
#opensource
#machinelearning
#ai