On-Device RAG for App Developers: Embeddings, Vector Search, and Beyond

Giving your Offline AI agent memory — the ability to search and retrieve your private data

We’ve explored why offline AI agents matter and how to give them tools with function calling. Now let’s complete the picture by giving them memory — the ability to search and retrieve your private data using RAG (Retrieval-Augmented Generation).

When I started building Flutter Gemma, the first question developers asked was: “How do I make the AI know about MY data?” Not Wikipedia, not general knowledge — their app’s data. Users, contacts, documents. That’s the gap RAG fills. Imagine building a personal CRM app with an AI assistant. Your user asks: ”Who should I follow up with from last week’s meetings?” The challenge? The AI needs access to your data — your contacts, your conversation history, your business context — not general knowledge from its training.

The Problem: LLM Doesn’t Know Your Data

Large Language Models are trained on vast amounts of public data Wikipedia, books, websites. They’re remarkably capable at general reasoning. But they have no idea about your contacts, your meeting notes, or yesterday’s conversation with a client.

Ask a vanilla LLM “Who should I follow up with from last week?” and you’ll get either “I don’t have access to your contacts” — honest but useless — or a hallucinated answer with confidently wrong names and details.

There are several ways to give an LLM access to your data. Let’s look at each of them.

Approach A: Stuff Everything in the Prompt

The obvious fix — dump all your data into the context. Prompt: “Here are all my contacts: – Sarah Chen, Google, enterprise lead…, – Mike Ross, Meta, prospect… – … (498 more) Who should I follow up with?”

Problems:

  • Context limits: On-device models typically support 8K-32K tokens. If you have a lot of data, it simply won’t fit.
  • Slow and expensive: Every token in context adds to computation time and memory. Longer context = slower response = drained battery.
  • Lost in the noise: When asking about a contact, why process 499 other contacts? Worse, LLMs tend to forget details buried in the middle of long contexts — the one contact you need will likely be overlooked.

Approach B: Fine-tune the Model

Train the model on your data so it “knows” it. But:

  • Stale quickly: Your CRM data changes daily. Retraining every day?
  • Expensive: Fine-tuning costs compute and time.
  • Doesn’t scale: Each user has different data. Technically, you can fine-tune per user with LoRA, but it’s complex and expensive to maintain.

For static knowledge, fine-tuning works great — I covered this in detail in Fine-tuning Gemma with LoRA for On-Device Inference. But for frequently changing data, it’s not the answer.

Approach C: RAG — Retrieve, Then Generate (WINNER)

Instead of giving the LLM everything, retrieve only relevant information for each query:

  • Ask about a contact → retrieve only that contact’s info → send to LLM
  • Ask about a company → retrieve only that company’s contacts → send to LLM
  • Ask about follow-ups → retrieve contacts needing follow-up → send to LLM

Why it works:

  • Always fresh: Data is retrieved at query time, not baked into the model.
  • Scales to any user: Same model, different data — each device has its own local storage.
  • Focused context: LLM only sees what it needs — no noise, no “lost in the middle”.

This is RAG (Retrieval-Augmented Generation) — and it’s the solution.

The best part? RAG can run fully on-device — both retrieval and generation happen locally, no cloud required.

But what does “retrieve relevant information” actually mean? How do we find the right data among thousands of records? Let’s dig into the mechanics.

RAG Fundamentals

So, RAG — retrieve relevant data, then generate a response. But how do we actually find the “relevant” data?

Finding the Right Records

To answer a user’s question, we need to find the right records in our database. There are many retrieval techniques — keyword search, hybrid search, graph-based retrieval, and more. In this article, I’ll focus on one that’s easy to implement with Flutter Gemma right now: semantic search (also called vector search).

What is Semantic Search?

Traditional keyword search is brittle. Search for “big company” and you won’t find “enterprise client” — even though they mean the same thing. Semantic search solves this by understanding meaning, not just matching words. The idea: convert text into numbers (vectors) that capture its meaning. Similar meanings → similar vectors. Then finding relevant records becomes a math problem — find vectors closest to your query.

Embeddings: Converting Text to Vectors

To make semantic search work, we need to convert our data into vectors. This is where embedding models come in. An embedding model takes text and outputs a vector — typically 768 or more numbers that represent the text’s meaning:

"enterprise sales lead"  →  [0.12, -0.45, 0.78, 0.23, ...]  (768 numbers)
"big company executive" → [0.11, -0.44, 0.77, 0.25, ...] (similar!)
"banana smoothie recipe" → [0.89, 0.12, -0.34, 0.56, ...] (very different)

In September 2025, Google released EmbeddingGemma — a 308M parameter embedding model based on Gemma 3, designed specifically for on-device use. You can find the model on HuggingFace. As soon as I saw the announcement, I knew Flutter Gemma needed embedding support — and that meant implementing embedding generation for three platforms (desktop support wasn’t there yet).

On Android, Google provides an official AI Edge LocalAgents RAG SDK, so integration was straightforward. iOS and Web were a different story — no official SDK, so I built the entire pipeline from scratch: TensorFlow Lite interpreter for iOS, LiteRT.js for Web, SentencePiece tokenizers for both, manual tensor handling, and careful memory management. Desktop support is coming soon (I hope)

While I was at it, I decided that supporting just EmbeddingGemma wasn’t enough — so I also added Gecko, an older embedding model from Google released in March 2024. Gecko uses knowledge distillation from large LLMs to pack strong retrieval performance into just 110M parameters. It’s 2.6x faster than EmbeddingGemma but less accurate — useful when you need real-time search on resource-constrained devices and can trade some quality for speed.

Here’s how to generate embeddings with Flutter Gemma:

final embedder = await FlutterGemma.getActiveEmbedder();
final vector = await embedder.generateEmbedding("enterprise sales lead");
// vector: [0.12, -0.45, 0.78, ...] — ready for search

Storing Vectors: Vector Databases

Once you have vectors, you need somewhere to store them and search efficiently. This is what vector databases do.

Cloud solutions like Pinecone, Chroma, or Qdrant are popular — but they require network calls, which defeats the purpose of on-device AI. For embedded use, ObjectBox is a solid choice — it’s an on-device database with built-in HNSW vector search, designed specifically for mobile and IoT. ObjectBox has a Flutter SDK, so you can easily use it together with Flutter Gemma for embeddings. But if you don’t want to add another database dependency to your app, Flutter Gemma offers a simpler approach out of the box: plain SQLite with HNSW indexing. No extra dependencies — SQLite is already baked into every platform Flutter supports.

Under the hood, the implementation is consistent across platforms but uses platform-specific APIs:

  • Android uses `SQLiteOpenHelper` with embeddings stored as binary BLOBs (float32). Each 768-dimensional vector takes just 3KB — compact and fast to read.
  • iOS uses the SQLite3 C API directly, with the same BLOB storage format. Vectors are encoded as little-endian float32 arrays, identical to Android, so the database files are binary-compatible.
  • Web was trickier — browsers don’t have native SQLite. I used wa-sqlite, a WebAssembly port of SQLite, with OPFS (Origin Private File System) for persistent storage. The catch: OPFS requires a Web Worker, so all database operations run in a dedicated worker thread with message passing to the main thread.
Fluter Gemma RAG

All three platforms use the same schema, the same BLOB encoding, and the same cosine similarity calculation — so your search results are consistent regardless of where the app runs.

How Search Works

When a user asks a question, we convert it to a vector using the same embedding model, then find the closest vectors in our database.

“Closest” is measured by cosine similarity — it calculates the angle between two vectors in high-dimensional space. The score ranges from -1 to 1: 1 means the vectors point in the same direction (identical meaning), 0 means they’re perpendicular (unrelated), and -1 means opposite directions. In practice, anything above 0.3–0.4 is usually relevant — the exact threshold depends on your data and use case.

Cosine similarity

The simplest approach is brute-force: compare your query against every document in the database, calculate cosine similarity for each, return the top matches. This works fine for small datasets — hundreds, even thousands of records. But with 100,000 contacts, every search requires 100,000 similarity calculations. That’s slow.

HNSW (Hierarchical Navigable Small World) solves this with a clever multi-layer graph structure. Think of it like a subway map: express lines connect major hubs (top layers), while local lines connect every station (bottom layer). Search starts at the top, takes big jumps to get close, then descends to local layers for precision.

HNSW

The result: O(log n) search instead of O(n). For 100,000 contacts, that’s ~17 comparisons instead of 100,000. Flutter Gemma’s HNSW implementation uses an “over-fetch and rerank” strategy — HNSW returns 2x candidates (fast, approximate), then exact cosine similarity filters to top-K. Speed of approximate search, accuracy of exact search.

The API stays the same — Flutter Gemma handles the optimization under the hood:

final results = await FlutterGemmaPlugin.instance.searchSimilar(
query: "enterprise contacts",
topK: 5,
threshold: 0.3,
);
// results: records semantically similar to "enterprise contacts"

Chunking: Preparing Your Data

Before indexing, you need to decide how to split your data into searchable units — this is called chunking. For a CRM, the natural chunk is one contact = one document. But for longer content like meeting notes or emails, you need a strategy: fixed-size splits (simple but may cut mid-sentence), semantic splits (respects meaning boundaries), or document structure (paragraphs, sections). A good practice is 10–20% overlap between chunks to prevent losing context at boundaries.

Implementing RAG with Flutter Gemma

Now that we understand the theory, let’s see how it all comes together in practice. The Flutter Gemma example app includes a complete RAG implementation — check it out for the full code.

There are two main workflows: data ingestion (getting your data into the vector store) and retrieval (searching at query time).

Data Ingestion

Before you can search anything, all your data needs to be vectorized and stored. This happens once when the app first loads, and then incrementally as data changes.

Initial indexing — on first launch, process all existing records:

Future indexAllContacts(List contacts) async {
final embedder = await FlutterGemma.getActiveEmbedder();

for (final contact in contacts) {
final embedding = await embedder.generateEmbedding(
contact.toSearchableText(),
);

await FlutterGemmaPlugin.instance.addDocumentWithEmbedding(
id: contact.id,
content: contact.toSearchableText(),
embedding: embedding,
metadata: jsonEncode(contact.toJson()),
);
}
}

Keeping data in sync — when records change, the vector store must follow. Set up listeners for your data source:

// Listen for changes and update vector store
contactsRepository.onContactChanged.listen((contact) async {
final embedder = await FlutterGemma.getActiveEmbedder();
final embedding = await embedder.generateEmbedding(
contact.toSearchableText(),
);

// addDocumentWithEmbedding uses INSERT OR REPLACE —
// same ID overwrites the old record
await FlutterGemmaPlugin.instance.addDocumentWithEmbedding(
id: contact.id,
content: contact.toSearchableText(),
embedding: embedding,
);
});

The key insight: your vector store is a derived cache of your primary data. If it gets out of sync, search results become stale or wrong.

Retrieval

When the user asks a question, convert it to a vector and find similar documents:

Future answerQuestion(String userQuery) async {
// 1. Search for relevant context
final results = await FlutterGemmaPlugin.instance.searchSimilar(
query: userQuery,
topK: 5,
threshold: 0.3,
);

// 2. Build context from results
final context = results
.map((r) => r.content)
.join('nn');

// 3. Pass to LLM with context
final prompt = '''
Based on the following information:
$context

Answer the user's question: $userQuery
''';

final chat = await model.createChat();
await chat.addQuery(Message(text: prompt, isUser: true));
final response = await chat.generateChatResponse();
return (response as TextResponse).token;
}

The LLM now has access to relevant data from your database — without stuffing everything into the prompt.

From Search to Understanding

Semantic search opens doors that keyword matching never could. “Find my enterprise contacts” retrieves records tagged with “Fortune 500”, notes mentioning “big deal”, contacts at “Alphabet” — even though none contain the word “enterprise”. The embedding model understands meaning, not just text.

But try “Who did I talk to yesterday?” and the magic breaks.

“Yesterday” produces an embedding for the concept of yesterday — similar to “recent”, “past”, “before”. Your contact records contain dates: “2026–01–19”, “January 19th”. These aren’t semantically similar — there’s no vector space where “yesterday” and “2026–01–19” point in the same direction.

And dates aren’t the only problem:

These queries require reasoning — field filtering, sorting, joins — not similarity matching. We need the LLM to understand intent and translate natural language into structured queries.

LLM-Enhanced RAG with Function Calling

We’ve seen that semantic search fails when queries require reasoning — date parsing, field filtering, sorting. The solution? Let the LLM handle the reasoning, then call a function to execute the structured query.

This is Agentic RAG — the dominant pattern in 2025–2026 where the LLM dynamically decides how and when to retrieve information.

Instead of passing the user’s query directly to vector search, we let the LLM parse it first. For this agentic behavior to work on-device, you need a model that understands function calling — not just chat. Google released FunctionGemma specifically for this: a 270M parameter model tuned to parse user intent and call functions with the right parameters. I covered how to use it in detail in On-Device Function Calling with FunctionGemma.

The LLM handles reasoning (date math, intent extraction), while functions handle data retrieval.

Hybrid Search: Combining RAG with Filters

Once FunctionGemma extracts parameters from the user’s query, your function combines semantic search with structured filtering:

Future> searchContacts(Map params) async {
List contacts;

// If semantic query provided, start with RAG
if (params['semantic_query'] != null) {
final results = await FlutterGemmaPlugin.instance.searchSimilar(
query: params['semantic_query'],
topK: 50,
threshold: 0.25,
);
contacts = results.map((r) => Contact.fromJson(jsonDecode(r.metadata!))).toList();
} else {
contacts = await contactsRepository.getAll();
}

// Apply structured filters extracted by LLM
if (params['company'] != null) {
contacts = contacts.where((c) =>
c.company.toLowerCase().contains(params['company'].toLowerCase())
).toList();
}

if (params['last_contact_before'] != null) {
final before = DateTime.parse(params['last_contact_before']);
contacts = contacts.where((c) => c.lastContact.isBefore(before)).toList();
}

return contacts.take(params['limit'] ?? 10).toList();
}

The key insight: RAG handles the semantic part (“interested in enterprise pricing”), while structured filters handle the logical part (dates, status, company). FunctionGemma decides which parameters to use based on the query.

Now Everything Works

// ✅ Temporal queries
await query("Who did I talk to yesterday?");
// LLM extracts: last_contact_after="2026-01-18", last_contact_before="2026-01-19"

// ✅ Structured filters
await query("Leads assigned to me without email");
// LLM extracts: status="lead", owner="current_user", has_email=false

// ✅ Combined semantic + structured
await query("Google contacts interested in enterprise pricing");
// LLM extracts: company="Google", semantic_query="interested in enterprise pricing"

// ✅ Complex reasoning
await query("Prospects I should follow up with from Q4 last year");
// LLM calculates Q4 2025 dates, adds semantic_query="follow up needed"

The LLM does what it’s good at (understanding language, reasoning about dates), while structured queries do what they’re good at (filtering, sorting, joins).

Conclusion

RAG bridges the gap between general-purpose LLMs and your private data. With Flutter Gemma, the entire pipeline runs on-device — embeddings, vector search, and generation — no cloud required.

Start with semantic search for queries where meaning matters. When you hit the limits — dates, filters, sorting — add function calling with FunctionGemma to let the LLM orchestrate structured retrieval.

The Flutter Gemma example app includes a complete RAG implementation. Clone it, try it with your data, and see what’s possible when AI runs entirely in your users’ pockets.


On-Device RAG for App Developers: Embeddings, Vector Search, and Beyond was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

Diseñando memoria narrativa trazable para agentes conversacionales

Next Post

[Jan 2026] AI Community — Activity Highlights and Achievements

Related Posts