Software

3 minute read

From Zero to Hero: Running Google’s Gemma 3n on Android with LiteRT & Qualcomm QNN

December 12, 2025

A complete journey from “What is On-Device AI?” to deploying a state-of-the-art 2B LLM on your phone using Neural Processing Units.

Phase 1: The Beginner’s Guide — The “Why”

The Cloud is Overrated

For years, AI on mobile meant sending your data to a remote server for processing. It worked, but it came with serious drawbacks:

Latency: Even the speed of light introduces delay.
Privacy: Your data leaves your device.
Cost: Every API call costs money.

On-Device AI flips the model entirely.
Instead of shipping your prompt to the cloud, the model runs on your phone. It works offline, feels instant, and your data never leaves your pocket.

Meet the Star: Gemma 3n

Google’s Gemma family is open-source and designed for edge use. Gemma 3n is an instructed model with 2 billion parameters, small enough to fit on-device but powerful enough to handle reasoning, context, and code generation.

Key advantages:

Approximately 2 GB in int4 form
High instruction-following capability
Efficient enough for real-time mobile inference

Phase 2: The Intermediate Guide — The “How”

Understanding the Tooling: Google LiteRT

TensorFlow Lite has evolved into LiteRT (Lite Runtime): Google’s high-performance inference engine for generative AI. It provides a Kotlin/Java-friendly API on top of a highly optimized C++ backend.

LiteRT handles:

Tokenization
Model execution
Hardware acceleration
Streaming text generation

Building the App

A proper LLM app is more than a terminal. We want a modern, polished chat experience.

1. The Setup

Add the LiteRT-LM dependency:

implementation("com.google.ai.edge.litert:litert-genai:1.0.0-beta01")

2. The Engine Layout

Loading a 2 GB model is expensive. You should not rebuild it during configuration changes. Wrap it in a Singleton to keep the engine alive:

object LiteRTLMManager {
    private var engine: GenAI? = null
    
    fun initialize(modelPath: String, context: Context) {
        val options = GenAIModelOptions().apply {
            setModelPath(modelPath)
            setBackend(Backend.GPU)   // GPU for broad compatibility
        }
        engine = GenAI.createModel(options)
    }
}

3. UX Design: Streaming and Adapters

A modern LLM interface must stream partial tokens, similar to ChatGPT. LiteRT provides token streams; Kotlin Flows make integration seamless.

UI essentials:

A floating input bar that handles keyboard insets
A RecyclerView or LazyColumn that scrolls smoothly as tokens arrive
Markdown rendering for code blocks and formatting

Phase 3: The Expert’s Guide — The “Wow”

Understanding Gemma 3n Architecture: MatFormer

Gemma 3n is not merely “smaller.” It uses a more flexible Transformer derivative called MatFormer (Nested Transformer).

Differences from traditional Transformers:

Traditional models use fixed dimensions for all layers.
MatFormer allows hierarchical nesting of dimensions.
Enables more efficient scaling and slicing of the model.

With Per-Layer Embedding (PLE), Gemma 3n achieves a level of parameter efficiency that allows a 2B model to perform closer to a 7B model in many tasks.

The Qualcomm QNN Factor

Running LLMs on CPU is slow. GPUs (via OpenCL) provide strong performance but use more power. NPUs are designed specifically for high-throughput AI workloads.

Qualcomm’s QNN (Qualcomm Neural Network) runtime allows offloading LLM computation to the Hexagon DSP.

Key components:

HTP Backend: The Hexagon Tensor Processor accelerates matrix operations.
Quantization: NPU-friendly models must be in int8 or int4 formats. Gemma 3n generally uses int4.

Integrating QNN Through LiteRT

LiteRT supports QNN through hardware delegates. Selecting Backend.NPU activates the flow:

LiteRT checks for QNN drivers (libQnnHtp.so).
It converts the model graph to a QNN-compatible format.
Execution is pushed to the Hexagon DSP.
If something is incompatible, LiteRT silently falls back to CPU or GPU.

QNN is sensitive to:

Model export formats
Delegate versions
Firmware differences
Operator coverage

Because of this, for a general-release Android app, GPU (OpenCL) is currently the most stable backend. However, the architecture supports NPU switching for devices with compatible binaries.

Performance Metrics

Real-world performance varies by backend:

Backend Tokens/sec Notes CPU 4–5 Usable only for tests GPU (Adreno, OpenCL) 30–50 Smooth, real-time conversation NPU (Hexagon QNN) 50–80+ Fastest and most efficient

Conclusion

We built more than a chat interface.
We built a private, offline, hardware-accelerated AI assistant using:

Google’s latest GenAI runtime (LiteRT)
A cutting-edge 2B model (Gemma 3n with MatFormer)
Qualcomm’s mobile AI silicon (QNN/HTP)

The result demonstrates a clear shift: the future of AI is not just in the cloud. It is also on-device, running where users are.

Source Code

Repository:
https://github.com/carrycooldude/Gemma3n-QNN-LiteRT

Clone it, open it in Android Studio, and run a state-of-the-art LLM directly on your phone.

From Zero to Hero: Running Google’s Gemma 3n on Android with LiteRT & Qualcomm QNN was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

VIDEO PODCAST | Accelerate Your Career with ASQ Certifications

December 12, 2025

Software

The Secret Life of JavaScript: Understanding Closures

December 12, 2025

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

Applying Statistics in the Supply Chain

I Replaced My Entire Business Stack with 4 Notion Templates

DuckDuckGo installs are up 30% as users reject being ‘force-fed’ Google’s AI Search

Trending Tags

From Zero to Hero: Running Google’s Gemma 3n on Android with LiteRT & Qualcomm QNN

Phase 1: The Beginner’s Guide — The “Why”

The Cloud is Overrated

Meet the Star: Gemma 3n

Phase 2: The Intermediate Guide — The “How”

Understanding the Tooling: Google LiteRT

Building the App

1. The Setup

2. The Engine Layout

3. UX Design: Streaming and Adapters

Phase 3: The Expert’s Guide — The “Wow”

Understanding Gemma 3n Architecture: MatFormer

The Qualcomm QNN Factor

Integrating QNN Through LiteRT

QNN is sensitive to:

Performance Metrics

Conclusion

Source Code

Leave a Reply Cancel reply

Previous Post

VIDEO PODCAST | Accelerate Your Career with ASQ Certifications

Next Post

The Secret Life of JavaScript: Understanding Closures

From Zero to Hero: Running Google’s Gemma 3n on Android with LiteRT & Qualcomm QNN

Phase 1: The Beginner’s Guide — The “Why”

The Cloud is Overrated

Meet the Star: Gemma 3n

Phase 2: The Intermediate Guide — The “How”

Understanding the Tooling: Google LiteRT

Building the App

1. The Setup

2. The Engine Layout

3. UX Design: Streaming and Adapters

Phase 3: The Expert’s Guide — The “Wow”

Understanding Gemma 3n Architecture: MatFormer

The Qualcomm QNN Factor

Integrating QNN Through LiteRT

QNN is sensitive to:

Performance Metrics

Conclusion

Source Code

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts