Pipeline to create .task files for MediaPipe LLM Inference API

Pipeline to create .task files for MediaPipe LLM Inference API

Written by Georgios Soloupis, AI and Android GDE.

MediaPipe logo

MediaPipe Solutions offers a powerful suite of libraries and tools designed to help you quickly integrate artificial intelligence (AI) and machine learning (ML) into your applications. These solutions are ready to use out of the box, fully customizable, and compatible across multiple development platforms.

The LLM Inference API enables on-device execution of large language models (LLMs), allowing your applications to perform a wide range of tasks, including text generation, natural language information retrieval, and document summarization, without relying on cloud services.

In this showcase, we’ll demonstrate how to create a .task bundle format, which packages and configures your model for use with the LLM Inference API. This format ensures efficient on-device deployment and seamless integration with your applications.

AI Edge Torch

AI Edge Torch is a python library that supports converting PyTorch models into a .tflite format, which can then be run with TensorFlow Lite and MediaPipe. This enables applications for Android, iOS and IoT that can run models completely on-device. AI Edge Torch offers broad CPU coverage, with initial GPU and NPU support. AI Edge Torch seeks to closely integrate with PyTorch, building on top of torch.export()

All steps have been verified at a Google Colab environment.

  1. Clone the repository.
!git clone https://github.com/google-ai-edge/ai-edge-torch.git

2. Version of TensorFlow to use.

!pip install tensorflow==2.19.0

3. Install additional libraries.

!pip install torch-xla2
!pip install ai_edge_quantizer
!pip install ai_edge_litert

4. Install huggingface_hub library.

!pip install -U "huggingface_hub[cli]"

5. You can use login to HuggingFace in case you want to download models that needs verification.

from huggingface_hub import login
from google.colab import userdata
HF_TOKEN=userdata.get('HF_READ')

if HF_TOKEN:
login(HF_TOKEN)
print("Successfully logged in to Hugging Face!")
else:
print("Token is not set. Please save the token first.")

6. Download Gemma 3 1B model.

!huggingface-cli download google/gemma-3-1b-pt --repo-type model --local-dir ./models/google/gemma-3-1b-pt

7. Use the script to convert the model and create the .tflite file.

!python -m ai_edge_torch.generative.examples.gemma3.convert_gemma3_to_tflite 
--quantize="dynamic_int8"
--checkpoint_path=/content/ai-edge-torch/models/google/gemma-3-1b-pt --output_path="https://medium.com/content/ai-edge-torch"
--prefill_seq_lens=2048 --kv_cache_max_len=4096

8. Create the tokenizer.model that is going to be used by the LLM Inference API.

%cd ai_edge_torch/generative/tools

!python tokenizer_to_sentencepiece.py
--checkpoint=google/gemma-3-1b-pt
--output_path=/content/tokenizer.model

9. Install MediaPipe library.

!pip install mediapipe

10. Bundle everything together at a .task file

TFLITE_MODEL = "https://medium.com/content/ai-edge-torch/gemma3-1b_q8_ekv4096.tflite"
TOKENIZER_MODEL = "https://medium.com/content/tokenizer.model"
START_TOKEN=""
STOP_TOKENS=["", ""]
from mediapipe.tasks.python.genai import bundler
config = bundler.BundleConfig(
tflite_model=TFLITE_MODEL,
tokenizer_model=TOKENIZER_MODEL,
start_token=START_TOKEN,
stop_tokens=STOP_TOKENS,
output_filename="https://medium.com/content/gemma3.task",
prompt_prefix="usern",
prompt_suffix="nmodeln",
)
bundler.create_bundle(config)

Full notebook with all the steps can be found here.

An android project to immediately use the generated .task file is this.

Conclusion
MediaPipe Solutions offers a comprehensive suite of tools that make it easy to integrate AI and ML into applications across multiple platforms. Among these, the LLM Inference API enables large language models (LLMs) to run fully on-device, unlocking real-time capabilities without relying on the cloud.
To support this, we showcased the creation of the .task bundle format, MediaPipe’s deployable package for on-device inference. Using steps that run seamlessly on Colab notebooks, we outline the full pipeline for converting a pre-trained language model from Hugging Face into a .task file. The workflow involves four stages: setting up the environment with dependencies such as tensorflow, ai-edge-torch, and mediapipe; downloading the model using a Hugging Face token; converting it to an optimized TensorFlow Lite (.tflite) format with quantization for mobile efficiency; and finally, bundling the model with its tokenizer into a .task file using MediaPipe’s bundler. This step also includes defining critical metadata like prompt format and start/stop tokens to ensure correct inference behavior. The resulting .task file enables seamless integration into GenAI applications on mobile and edge devices.


Pipeline to create .task files for MediaPipe LLM Inference API was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

I Got Tired of cd-ing Into Nested Folders, So I Built a File Finder in Rust

Next Post

Why Your Competitive Intelligence Scrapers Fail: A Deep Dive into Browser Fingerprinting

Related Posts