
Phase 1: The Beginner’s Guide — The “Why”
The Cloud is Overrated
For years, AI on mobile meant sending your data to a remote server for processing. It worked, but it came with serious drawbacks:
- Latency: Even the speed of light introduces delay.
- Privacy: Your data leaves your device.
- Cost: Every API call costs money.
On-Device AI flips the model entirely.
Instead of shipping your prompt to the cloud, the model runs on your phone. It works offline, feels instant, and your data never leaves your pocket.
Meet the Star: Gemma 3n
Google’s Gemma family is open-source and designed for edge use. Gemma 3n is an instructed model with 2 billion parameters, small enough to fit on-device but powerful enough to handle reasoning, context, and code generation.
Key advantages:
- Approximately 2 GB in int4 form
- High instruction-following capability
- Efficient enough for real-time mobile inference
Phase 2: The Intermediate Guide — The “How”
Understanding the Tooling: Google LiteRT
TensorFlow Lite has evolved into LiteRT (Lite Runtime): Google’s high-performance inference engine for generative AI. It provides a Kotlin/Java-friendly API on top of a highly optimized C++ backend.
LiteRT handles:
- Tokenization
- Model execution
- Hardware acceleration
- Streaming text generation
Building the App
A proper LLM app is more than a terminal. We want a modern, polished chat experience.
1. The Setup
Add the LiteRT-LM dependency:
implementation("com.google.ai.edge.litert:litert-genai:1.0.0-beta01")
2. The Engine Layout
Loading a 2 GB model is expensive. You should not rebuild it during configuration changes. Wrap it in a Singleton to keep the engine alive:
object LiteRTLMManager {
private var engine: GenAI? = null
fun initialize(modelPath: String, context: Context) {
val options = GenAIModelOptions().apply {
setModelPath(modelPath)
setBackend(Backend.GPU) // GPU for broad compatibility
}
engine = GenAI.createModel(options)
}
}
3. UX Design: Streaming and Adapters
A modern LLM interface must stream partial tokens, similar to ChatGPT. LiteRT provides token streams; Kotlin Flows make integration seamless.
UI essentials:
- A floating input bar that handles keyboard insets
- A RecyclerView or LazyColumn that scrolls smoothly as tokens arrive
- Markdown rendering for code blocks and formatting
Phase 3: The Expert’s Guide — The “Wow”
Understanding Gemma 3n Architecture: MatFormer
Gemma 3n is not merely “smaller.” It uses a more flexible Transformer derivative called MatFormer (Nested Transformer).
Differences from traditional Transformers:
- Traditional models use fixed dimensions for all layers.
- MatFormer allows hierarchical nesting of dimensions.
- Enables more efficient scaling and slicing of the model.
With Per-Layer Embedding (PLE), Gemma 3n achieves a level of parameter efficiency that allows a 2B model to perform closer to a 7B model in many tasks.
The Qualcomm QNN Factor
Running LLMs on CPU is slow. GPUs (via OpenCL) provide strong performance but use more power. NPUs are designed specifically for high-throughput AI workloads.
Qualcomm’s QNN (Qualcomm Neural Network) runtime allows offloading LLM computation to the Hexagon DSP.
Key components:
- HTP Backend: The Hexagon Tensor Processor accelerates matrix operations.
- Quantization: NPU-friendly models must be in int8 or int4 formats. Gemma 3n generally uses int4.
Integrating QNN Through LiteRT
LiteRT supports QNN through hardware delegates. Selecting Backend.NPU activates the flow:
- LiteRT checks for QNN drivers (libQnnHtp.so).
- It converts the model graph to a QNN-compatible format.
- Execution is pushed to the Hexagon DSP.
- If something is incompatible, LiteRT silently falls back to CPU or GPU.
QNN is sensitive to:
- Model export formats
- Delegate versions
- Firmware differences
- Operator coverage
Because of this, for a general-release Android app, GPU (OpenCL) is currently the most stable backend. However, the architecture supports NPU switching for devices with compatible binaries.
Performance Metrics
Real-world performance varies by backend:
Backend Tokens/sec Notes CPU 4–5 Usable only for tests GPU (Adreno, OpenCL) 30–50 Smooth, real-time conversation NPU (Hexagon QNN) 50–80+ Fastest and most efficient
Conclusion
We built more than a chat interface.
We built a private, offline, hardware-accelerated AI assistant using:
- Google’s latest GenAI runtime (LiteRT)
- A cutting-edge 2B model (Gemma 3n with MatFormer)
- Qualcomm’s mobile AI silicon (QNN/HTP)
The result demonstrates a clear shift: the future of AI is not just in the cloud. It is also on-device, running where users are.
Source Code
Repository:
https://github.com/carrycooldude/Gemma3n-QNN-LiteRT
Clone it, open it in Android Studio, and run a state-of-the-art LLM directly on your phone.
From Zero to Hero: Running Google’s Gemma 3n on Android with LiteRT & Qualcomm QNN was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.