From Zero to Hero: Running Google’s Gemma 3n on Android with LiteRT & Qualcomm QNN

A complete journey from “What is On-Device AI?” to deploying a state-of-the-art 2B LLM on your phone using Neural Processing Units.

Phase 1: The Beginner’s Guide — The “Why”

The Cloud is Overrated

For years, AI on mobile meant sending your data to a remote server for processing. It worked, but it came with serious drawbacks:

  • Latency: Even the speed of light introduces delay.
  • Privacy: Your data leaves your device.
  • Cost: Every API call costs money.

On-Device AI flips the model entirely.
Instead of shipping your prompt to the cloud, the model runs on your phone. It works offline, feels instant, and your data never leaves your pocket.

Meet the Star: Gemma 3n

Google’s Gemma family is open-source and designed for edge use. Gemma 3n is an instructed model with 2 billion parameters, small enough to fit on-device but powerful enough to handle reasoning, context, and code generation.

Key advantages:

  • Approximately 2 GB in int4 form
  • High instruction-following capability
  • Efficient enough for real-time mobile inference

Phase 2: The Intermediate Guide — The “How”

Understanding the Tooling: Google LiteRT

TensorFlow Lite has evolved into LiteRT (Lite Runtime): Google’s high-performance inference engine for generative AI. It provides a Kotlin/Java-friendly API on top of a highly optimized C++ backend.

LiteRT handles:

  • Tokenization
  • Model execution
  • Hardware acceleration
  • Streaming text generation

Building the App

A proper LLM app is more than a terminal. We want a modern, polished chat experience.

1. The Setup

Add the LiteRT-LM dependency:

implementation("com.google.ai.edge.litert:litert-genai:1.0.0-beta01")

2. The Engine Layout

Loading a 2 GB model is expensive. You should not rebuild it during configuration changes. Wrap it in a Singleton to keep the engine alive:

object LiteRTLMManager {
private var engine: GenAI? = null

fun initialize(modelPath: String, context: Context) {
val options = GenAIModelOptions().apply {
setModelPath(modelPath)
setBackend(Backend.GPU) // GPU for broad compatibility
}
engine = GenAI.createModel(options)
}
}

3. UX Design: Streaming and Adapters

A modern LLM interface must stream partial tokens, similar to ChatGPT. LiteRT provides token streams; Kotlin Flows make integration seamless.

UI essentials:

  • A floating input bar that handles keyboard insets
  • A RecyclerView or LazyColumn that scrolls smoothly as tokens arrive
  • Markdown rendering for code blocks and formatting

Phase 3: The Expert’s Guide — The “Wow”

Understanding Gemma 3n Architecture: MatFormer

Gemma 3n is not merely “smaller.” It uses a more flexible Transformer derivative called MatFormer (Nested Transformer).

Differences from traditional Transformers:

  • Traditional models use fixed dimensions for all layers.
  • MatFormer allows hierarchical nesting of dimensions.
  • Enables more efficient scaling and slicing of the model.

With Per-Layer Embedding (PLE), Gemma 3n achieves a level of parameter efficiency that allows a 2B model to perform closer to a 7B model in many tasks.

The Qualcomm QNN Factor

Running LLMs on CPU is slow. GPUs (via OpenCL) provide strong performance but use more power. NPUs are designed specifically for high-throughput AI workloads.

Qualcomm’s QNN (Qualcomm Neural Network) runtime allows offloading LLM computation to the Hexagon DSP.

Key components:

  • HTP Backend: The Hexagon Tensor Processor accelerates matrix operations.
  • Quantization: NPU-friendly models must be in int8 or int4 formats. Gemma 3n generally uses int4.

Integrating QNN Through LiteRT

LiteRT supports QNN through hardware delegates. Selecting Backend.NPU activates the flow:

  1. LiteRT checks for QNN drivers (libQnnHtp.so).
  2. It converts the model graph to a QNN-compatible format.
  3. Execution is pushed to the Hexagon DSP.
  4. If something is incompatible, LiteRT silently falls back to CPU or GPU.

QNN is sensitive to:

  • Model export formats
  • Delegate versions
  • Firmware differences
  • Operator coverage

Because of this, for a general-release Android app, GPU (OpenCL) is currently the most stable backend. However, the architecture supports NPU switching for devices with compatible binaries.

Performance Metrics

Real-world performance varies by backend:

Backend Tokens/sec Notes CPU 4–5 Usable only for tests GPU (Adreno, OpenCL) 30–50 Smooth, real-time conversation NPU (Hexagon QNN) 50–80+ Fastest and most efficient

Conclusion

We built more than a chat interface.
We built a private, offline, hardware-accelerated AI assistant using:

  • Google’s latest GenAI runtime (LiteRT)
  • A cutting-edge 2B model (Gemma 3n with MatFormer)
  • Qualcomm’s mobile AI silicon (QNN/HTP)

The result demonstrates a clear shift: the future of AI is not just in the cloud. It is also on-device, running where users are.

Source Code

Repository:
https://github.com/carrycooldude/Gemma3n-QNN-LiteRT

Clone it, open it in Android Studio, and run a state-of-the-art LLM directly on your phone.


From Zero to Hero: Running Google’s Gemma 3n on Android with LiteRT & Qualcomm QNN was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

VIDEO PODCAST | Accelerate Your Career with ASQ Certifications

Next Post

The Secret Life of JavaScript: Understanding Closures

Related Posts