Prologue
Lately, I’ve become deeply interested in working with Edge AI. What fascinates me most is the potential it unlocks. From my perspective, the future belongs to this approach. AI models are becoming increasingly powerful and compact, reaching a point where they deliver impressive results while requiring far fewer resources. At the same time, devices are growing more capable, making it possible to run even resource-intensive models directly on them.
Imagine having AI agents that don’t need an internet connection to function effectively. No recurring costs to keep them operational. They could learn from your actions while keeping all your data safely on your device, never transmitting it elsewhere. A dream, right? No — it’s no longer just a dream. This is already on the verge of becoming a practical reality, and I firmly believe this field will evolve at an incredible pace in the coming time.
While experimenting with these technologies, I realized how fragmented the available information is. For mobile developers just beginning to dive into the world of machine learning, navigating it all can be a daunting task. That’s why I decided to write a series of articles where I’ll explain in details, step by step, with examples, how to work with open AI models and deploy them on mobile devices and browsers.
To start this series, I’ll focus on a use case I’ve wanted to explore in detail for a long time: how to fine-tune a model with your own data for use on mobile devices. But instead of uploading the fine-tuned model every time, you’ll update only the additional LoRA weights, keeping the base model unchanged. This approach is not only efficient but also highly practical, and I’m excited to share all the details with you.
The terms first
Before diving into the topic, let’s clarify the terms in the title, just in case some of them are unfamiliar: Gemma, Fine-Tuning, LoRA, On-Device Inference, and MediaPipe.
Each of these points could easily be the subject of an entire article, but I won’t do that because such articles already exist. You can check them out by following the links provided, I’ll just stick to brief definitions.
- Gemma: A powerful and flexible open-source AI model designed for various use cases. It’s compact enough to be used on devices while still offering great performance, making it ideal for edge AI applications.
- Fine-Tuning: A process of training an existing model on new, specific data to adapt it for a particular task. Fine-tuning allows you to leverage the knowledge of a pre-trained model while tailoring it to your needs without retraining the model from scratch.
- LoRA (Low-Rank Adaptation): A clever technique for fine-tuning large models. Instead of modifying the entire model, LoRA adds small, efficient layers to adapt the model for specific tasks. This keeps the base model unchanged, making updates lightweight and efficient.
- On-Device Inference: The ability to run AI models directly on a device (like a smartphone or laptop) rather than relying on a cloud server. This approach improves privacy, reduces latency, and eliminates the need for an internet connection.
- MediaPipe: A versatile framework developed by Google for building AI pipelines, especially for on-device applications. MediaPipe makes it easy to integrate machine learning models into real-time mobile or desktop apps.
So, the Title of the Article Can Be Expanded Like This:
In this article, you will learn how to customize a powerful and flexible AI model (Gemma) by training it on your own data (fine-tuning) to adapt it for a specific task. Instead of modifying the entire model, you’ll add small, efficient layers that make updates lightweight and keep the base model unchanged (LoRA). Once the model is adapted, you will be able to run it directly on a device (On-Device Inference), ensuring privacy, reducing latency, and removing the need for an internet connection, using a versatile framework (Mediapipe) to integrate the customized model into real-time applications running locally.
Chapter I: On-Device Inference
Let’s start from the end. To run a model locally on your device, you’ll need MediaPipe. It provides SDKs for Android, iOS, and JavaScript, which you can use in native mobile applications and in the browser. Additionally, for Flutter applications, there’s a plugin called flutter_gemma (developed by myself as well). This plugin allows you to integrate MediaPipe into Flutter apps seamlessly.
MediaPipe works with models in the LiteRT format (formerly known as TensorFlow Lite). The good news is that the team at Google has already prepared models in this format for you! You can find Gemma and Gemma-2 on Kaggle, download them to your computer, and then transfer them to your phone for local use.
Kaggle is an online platform for data science and machine learning competitions, where users can access datasets, write and share code, collaborate with other data scientists, and participate in challenges to build predictive models. It also offers free cloud-based Jupyter notebooks and educational resources for learning AI and data science.

I covered running models on devices in more detail in my other article, so for detailed instructions on how to upload models to a device, please read “The Dawn of Offline AI Agents in Your Pocket”. But in this article, I want to focus on something different: what if you have your own dataset that you want to train a model on, and only then use it on your device? How can you achieve that?
Chapter II: Fine-Tuning — Theory
I’ve already given a definition of fine-tuning earlier, but let’s go over the process once again.
We start with a pre-trained model, which has already learned general patterns from a large dataset. Then, we train it further using additional data from our own dataset. As a result, the fine-tuned model adapts to our specific task and provides responses that take this new data into account.
There are many different methods for fine-tuning, including full fine-tuning, top-layer fine-tuning, adapter-based fine-tuning and others. Each approach has its own advantages and drawbacks, making it suitable for different types of tasks.
For our case, LoRA (Low-Rank Adaptation) is the best choice. This method allows us to achieve excellent results while keeping resource usage minimal. Additionally, LoRA offers several other advantages that make it particularly useful for our scenario — I’ll go over those in more detail later.
Low Rank Adaptation (LoRA) is a fine-tuning technique which greatly reduces the number of trainable parameters for downstream tasks by freezing the weights of the model and inserting a smaller number of new weights into the model. This makes training with LoRA much faster and more memory-efficient, and produces smaller model weights (a few hundred MBs), all while maintaining the quality of the model outputs.
Take a look at this visualization: Imagine the large frozen robot as a pre-trained AI model. It’s powerful and packed with knowledge, but it’s static and not specialized for a specific task. The small robot represents LoRA, a lightweight adaptive layer that guides the large model without modifying its core. Instead of retraining the entire massive robot (which would require enormous resources), the small robot helps fine-tune and adjust its behavior with minimal effort and computational cost.
In this way, LoRA enables us to achieve great results without “melting” the entire frozen model — just by adding small but effective adjustments in the right places. That’s why this method is so efficient for adapting large models to new tasks.
But the most exciting feature of this approach is that it allows LoRA weights to be stored separately from the model itself. This means we don’t have to download the entire multi-gigabyte model every time we fine-tune it — a game-changer for mobile applications.
In a standard fine-tuning scenario, each time we gather new data and retrain the model, we would have to re-download the entire model.
There are two articles “Fine Tune — Gemma 2b-it model” by Aashi Dutt and “Deploy Gemma on Android” by Nitin Tiwari describing this approach, highly recommend to take a look too.
However, with LoRA, I can download the base model only once and, after each fine-tuning session, update only a small file containing LoRA weights.

This is an incredible advantage, especially in resource-constrained environments like mobile or web applications, and this is exactly the approach I want to explore in detail here!
MediaPipe supports several open models for the LLM Inference Task, making it easy to run powerful AI models directly on devices. Below are some of the supported models:
- Gemma 2B & Gemma 7B — Open-weight language models developed by Google, optimized for efficiency and performance in NLP tasks. These models are great for text generation, chatbots, and summarization.
- Gemma-2 2B — An updated version of Gemma with improvements in efficiency and adaptability for on-device inference.
- Phi-2 — A compact yet powerful model designed for reasoning and general-purpose NLP tasks, optimized for smaller hardware while maintaining strong performance.
- Falcon-RW-1B — A lightweight model from the Falcon series, ideal for resource-efficient language generation and understanding.
- StableLM-3B — A model designed for open-ended text generation, providing robust language capabilities with a relatively small footprint.
In my example, I am using Gemma-2B, but the fine-tuning and deployment approach remains similar for all these models, allowing flexibility based on the use case and hardware constraints.
Chapter III: Fine-Tuning — Practice
You can, of course, fine-tune a model on your laptop, but this is a resource-intensive process, and it’s better to use more powerful machines. I used Google Colab, a cloud-based platform that provides free and paid access to GPUs and TPUs, making it much easier to train AI models without needing high-end hardware.
Google Colab (or Colaboratory) is a cloud-based Jupyter notebook environment that allows you to write and execute Python code in your browser. It provides free access to GPUs and TPUs, making it ideal for machine learning, deep learning, and AI model training without needing a powerful local computer.
✅ No setup required — Runs entirely in the cloud.
✅ Free GPU/TPU access — Great for training AI models.
✅ Supports Python and popular ML libraries like TensorFlow, PyTorch, and Hugging Face.
✅ Collaboration-friendly — Share and edit notebooks with others in real time
🔗 Here’s the link to my Colab workshop, where I use an open dataset of famous quotes to train the model to continue a prompt in a specific style if it starts like a quote. You can follow this workshop step by step, either replicating my process or tweaking it to fit your own dataset and goals.
However, for convenience, I’ll also break down the steps in detail below.
Step 1: Selecting the Runtime
To run this tutorial efficiently, you’ll need access to a GPU in Google Colab:
- Click on Runtime → Change runtime type
- Under Hardware Accelerator, select T4 GPU or A100 GPU (recommended, if available)
Step 2: Setting Up Environment Variables
Before running the code, ensure you have added your Hugging Face Token to Colab’s User-Defined Variables:
- In the Colab menu, go to Tools → Preferences → User-defined variables
- Add HF_TOKEN as the key and your Hugging Face token as the value. For the workshop, we will be loading the model from Hugging Face.
import os
from google.colab import userdata
# Set environment variables
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')
os.environ["WANDB_MODE"] = "offline"
Hugging Face is a leading platform for AI and NLP models, providing a vast repository of pre-trained models, datasets, and tools for machine learning. It offers Transformers, an open-source library for working with deep learning models, and the Model Hub, where users can find, share, and deploy models for various AI tasks
Step 3: Installing Required Libraries
Install all necessary dependencies for fine-tuning, exporting weights, and integrating them with MediaPipe:
!pip install -q
transformers
mediapipe
bitsandbytes
peft
trl
datasets
fsspec==2024.6.1
gcsfs==2024.9.0
Here’s a brief explanation of each library:
- transformers — Provides pre-trained AI models and tools for natural language processing (NLP) tasks.
- mediapipe — A framework for running AI models efficiently on mobile and web applications.
- bitsandbytes — Provides memory-efficient optimizers and quantization techniques, helping run large models efficiently on limited hardware.
- peft — A library for parameter-efficient fine-tuning (LoRA, adapters, etc.) of large models.
- trl — A library for reinforcement learning (RLHF) and fine-tuning large language models.
- datasets — A Hugging Face library for accessing and managing datasets for ML training.
- fsspec — A file system abstraction library for handling storage across different environments.
- gcsfs — A library for interacting with Google Cloud Storage (GCS) using Python.
These libraries together enable fine-tuning, optimizing, and deploying AI models efficiently for on-device and cloud-based inference.
Step 4: Loading and Saving the Pretrained Gemma Model
To fine-tune Gemma, we first need to load the pre-trained model and its associated tokenizer from Hugging Face’s Model Hub. The model will be saved locally so that it can be used for training later.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Define model ID
model_id = "google/gemma-2b"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.save_pretrained("https://medium.com/content/gemma2b")
# Load model
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Save model locally
model.save_pretrained("https://medium.com/content/gemma2b")
Remark: If you’re wondering why I’m not using 4-bit quantization, it’s because this model currently has issues with saving LoRA weights for Mediapipe. Once I find a solution, I’ll update this step accordingly.
Step 5: Checking Inference with the Pre-Trained Model
Before fine-tuning, let’s test the pre-trained Gemma-2B model to see how it generates text based on a given prompt. This helps us understand the model’s baseline performance before any modifications.
text = "Quote: Imagination is"
device = "cuda:0" # Ensure the model runs on GPU if available
# Tokenize the input and move it to the correct device
inputs = tokenizer(text, return_tensors="pt").to(device)
# Generate text with the model
outputs = model.generate(**inputs, max_new_tokens=20)
# Decode and print the generated text
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Expected Output Example:
Quote: Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world. - Albert Einstein
This step helps verify that the model is correctly loaded and functioning as expected before we begin fine-tuning.
Remark: I took example with quotes from a great article “Your Ultimate Guide to Instinct Fine-Tuning and Optimizing Google’s Gemma 2B Using LoRA” by Mohammed Ashraf
Step 6: Loading a Dataset for Fine-Tuning
To fine-tune the Gemma-2B model, we need a dataset that contains text examples related to our target task. In this step:
- Load a dataset — Retrieve an open dataset from Hugging Face’s datasets library.
- Format the dataset — Convert the raw data into a format that our model can use.
- Prepare for tokenization — Transform the text into a structured format for fine-tuning.
In this case, we use a dataset of English quotes, but you can replace it with any dataset relevant to your use case.
from datasets import load_dataset
# Load a dataset of English quotes
data = load_dataset("Abirate/english_quotes")
# Function to format dataset properly
def formatting_func(example):
if isinstance(example["quote"], list):
return [
f"Quote: {quote}nAuthor: {author}"
for quote, author in zip(example["quote"], example["author"])
]
return [f"Quote: {example['quote']}nAuthor: {example['author']}"]
print(formatting_func(data["train"]))
There is a nice article “Step-by-Step Dataset Creation- Unstructured to Structured” by Aashi Dutt with an explanation of how to prepare your own dataset
Step 7: Configuring and Applying LoRA for Fine-Tuning
LoRA (Low-Rank Adaptation) is a method that enables efficient fine-tuning of large models by adding small trainable layers instead of modifying the entire model.
Steps:
- Define LoRA parameters — Configure key settings such as rank, target modules, and task type.
- Apply LoRA to the model — Attach LoRA layers to specific transformer components to minimize memory usage.
from peft import LoraConfig, get_peft_model
# Configure LoRA settings
lora_config = LoraConfig(
r=8, # Rank of the LoRA adaptation matrices
target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"], # Layers to apply LoRA
task_type="CAUSAL_LM", # Task type: Causal Language Modeling
)
# Apply LoRA to the model
model = get_peft_model(model, lora_config)
Step 8: Fine-Tuning the Model
Once LoRA has been applied to the model, we move on to fine-tuning, where we update only the LoRA-adapted layers while keeping the base model frozen. This allows us to efficiently adapt the model to a specific dataset without the need for extensive computational resources.
Breaking Down the Fine-Tuning Process:
1️⃣ Define Training Arguments
We set up key parameters that control the training process:
- Batch Size (per_device_train_batch_size) – Defines how many samples are processed at once.
- Gradient Accumulation Steps (gradient_accumulation_steps) – Accumulates gradients over multiple steps to simulate a larger batch size.
- Warmup Steps (warmup_steps) – Gradually increases the learning rate at the beginning to stabilize training.
- Max Steps (max_steps) – Specifies the total number of training iterations.
- Learning Rate (learning_rate) – Controls how much the model updates its parameters with each step.
- FP16 Training (fp16=True) – Uses mixed precision for faster and more memory-efficient training.
- Logging Steps (logging_steps) – Determines how often training metrics are logged.
- Optimizer (optim=”paged_adamw_8bit”) – Uses a memory-efficient 8-bit version of AdamW to optimize training.
2️⃣ Use SFTTrainer for Efficient Training
SFTTrainer is a utility from trl designed for supervised fine-tuning of large language models, simplifying training and integrating seamlessly with LoRA.
3️⃣ Train the Model
Only the LoRA layers are updated while the base model remains unchanged. This dramatically reduces training time and memory usage.
4️⃣ Save the Fine-Tuned Model
After training, we store the LoRA fine-tuned model weights so they can be used later without retraining.
from transformers import TrainingArguments
from trl import SFTTrainer
# Define the training arguments
trainer = SFTTrainer(
model=model, # LoRA-enhanced model
train_dataset=data["train"], # Training dataset
args=TrainingArguments(
per_device_train_batch_size=1, # Batch size per device
gradient_accumulation_steps=4, # Accumulate gradients over 4 steps
warmup_steps=10, # Steps to warm up the learning rate
max_steps=100, # Total number of training steps
learning_rate=2e-4, # Learning rate
fp16=True, # Use mixed precision for faster training
logging_steps=1, # Log metrics every step
output_dir="outputs", # Directory for saving checkpoints and logs
optim="paged_adamw_8bit" # Use 8-bit AdamW optimizer for memory efficiency
),
peft_config=lora_config, # LoRA configuration
formatting_func=formatting_func,
)
# Train the model
trainer.train()
# Save the fine-tuned model
trainer.model.save_pretrained("https://medium.com/content/gemma2b/lora")
Step 9: Testing the Fine-Tuned Model with Inference
After fine-tuning the Gemma-2B model with LoRA, it’s crucial to test it by running inference. We are using the exact same prompt as before tuning and will verify that the result is now different. This will allow us to clearly compare how the changes have impacted the model’s output and whether it has become closer to the expected result.
text = "Quote: Imagination is"
device = "cuda:0" # Ensure the model runs on GPU if available
# Tokenize the input and move it to the correct device
inputs = tokenizer(text, return_tensors="pt").to(device)
# Generate text with the model
outputs = model.generate(**inputs, max_new_tokens=20)
# Decode and print the generated text
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Expected Output Example:
Quote: Imagination is more important than knowledge.
Author: Albert Einstein
Why is the output different now?
Since we fine-tuned the model using LoRA, it has learned from the new dataset and adjusted its responses accordingly. Previously, the model generated a generic response based on its original training data. However, after fine-tuning, the model’s output is now influenced by the new patterns, style, and information it was trained on.
In this case, if we fine-tuned it on a dataset of philosophical quotes, we might now see responses that better reflect those ideas. If we trained it on technical documentation, its completions would be more structured and factual. This confirms that fine-tuning successfully adjusted the model’s behavior while keeping the base model unchanged!
Step 10: Converting the Fine-Tuned Model to MediaPipe Format
After fine-tuning, the next step is to convert the model into a format compatible with MediaPipe, specifically LiteRT (former TFLite). This ensures that the model can run efficiently on mobile devices and web browsers without requiring heavy computational resources.
1️⃣ Define Conversion Configuration
We create a ConversionConfig object that specifies:
- Model Checkpoint Path (input_ckpt) – The directory where the base model is saved.
- Checkpoint Format (ckpt_format) – Defines the format of the saved model (e.g., safetensors).
- Model Type (model_type) – Identifies the AI model being converted (GEMMA_2B).
- Inference Backend (backend) – Defines the computation backend (e.g., gpu or cpu).
- Tokenizer Path (vocab_model_file) – Ensures the tokenizer is linked to the converted model.
- LoRA Checkpoint Path (lora_ckpt) – The fine-tuned LoRA weights to be merged.
- LoRA Rank (lora_rank) – Ensures the conversion matches the LoRA configuration.
- Output Paths — Specifies where the final TFLite models should be stored.
2️⃣ Convert the Model and LoRA Weights
The base model and the fine-tuned LoRA weights are converted separately to ensure flexibility in updates.
3️⃣ Save the Converted Models
Both the base TFLite model and the LoRA-adapted weights are stored as .bin files, which can be loaded dynamically on devices.
import mediapipe as mp
from mediapipe.tasks.python.genai import converter
# Define conversion config
conversion_config = converter.ConversionConfig(
input_ckpt="https://medium.com/content/gemma2b", # Path to the original Gemma-2B model checkpoint
ckpt_format="safetensors", # Format of the checkpoint
model_type="GEMMA_2B", # Model type
backend="gpu", # Backend for inference (gpu or cpu)
combine_file_only=False, # Whether to merge files into one binary
output_tflite_file="https://medium.com/content/output/gemma2b.bin", # Path for the converted base model
vocab_model_file="https://medium.com/content/gemma2b/tokenizer.model", # Path to tokenizer vocab file
output_dir="https://medium.com/content/output", # Directory to save the converted outputs
lora_ckpt="https://medium.com/content/gemma2b/lora", # Path to the fine-tuned LoRA checkpoint
lora_rank=8, # Rank of the LoRA configuration
lora_output_tflite_file="https://medium.com/content/output/lora.bin" # Path for the converted LoRA weights
)
# Convert the model to TensorFlow Lite format
converter.convert_checkpoint(conversion_config)
Step 10: Uploading the Converted Model to Firebase Cloud Storage 🚀
Alright, we’ve fine-tuned our Gemma-2B model, converted it to LiteRT, and now it’s just sitting there in Colab. But we need to get it out somehow, right? There are tons of ways to do this — downloading it manually, using Google Drive, sending it via email (please, don’t 😂)…
from google.colab import drive
# Mount your Google Drive
drive.mount('/content/drive')
# Copy the converted base model to Google Drive
!cp /content/output/gemma2b.bin /content/drive/MyDrive/gemma2b-base.bin
# Copy the converted LoRA weights to Google Drive
!cp /content/output/lora.bin /content/drive/MyDrive/gemma2b-lora.bin
In the current example, I’m saving the model to Google Drive, but in the near future, I plan to add an option to save it directly to Firebase Storage. The reason for this is that Firebase Storage makes it much easier to load the model directly onto a mobile device. Instead of manually transferring files or downloading them separately, you’ll be able to fetch the model straight from Firebase into your app with just a few lines of code. This will streamline the deployment process and make on-device AI even more seamless.
And don’t worry — I’ll explain all the details in the next article!

Stay tuned! 📲🔥
Conclusion
Congratulations! 🎉 You’ve successfully:
- Fine-tuned the Gemma-2B model using LoRA
- Converted it into LiteRT(former TensorFlow Lite) format for MediaPipe
- Prepared it for on-device inference on mobile and web applications
This workflow enables privacy-preserving, cost-efficient AI without relying on cloud infrastructure. Now you can integrate your fine-tuned model into real applications and optimize further based on your needs.
If you have any questions, feel free to reach out:
📧 Email: denisov.shureg@gmail.com
🔗 LinkedIn: Sasha Denisov
Fine-Tuning Gemma with LoRA for On-Device Inference (Android, iOS, Web) with Separate LoRA Weights was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.