PaliGemma on Android using Hugging Face API

Introduction

At Google I/O 2024, Google unveiled a new addition to the Gemma family: PaliGemma, alongside several new variants of the Gemma model. I had been eagerly anticipating the release of an open vision language model from Google, and the arrival of PaliGemma was an exciting development.

Why is this significant? Unlike the Gemini models, which require API keys, Gemma models are completely open and provide a perfect opportunity for developers to build exciting projects. While we’re already familiar with Gemma’s impressive text understanding capabilities, it’s time to explore what PaliGemma offers in terms of vision capabilities.

Before we start, let’s understand the technical details starting with its architecture.

PaliGemma Architecture

PaliGemma, a 3B parameter VLM, is based on the SigLIP vision model and Gemma as the underlying language model. It supports a range of tasks, including image captioning, visual question-answering, optical character recognition (OCR), zero-shot object detection, and segmentation.

SigLIP, which stands for Sigmoid Loss for Language Image Pre-Training, is an enhanced version of CLIP. It replaces the softmax loss function with the sigmoid loss function, leading to improved accuracy and performance.

Below is the architecture of a typical Vision Language Model.

Vision Language Model Architecture
  • Image Encoder: It encodes the input image into a dense vector representation and processes visual information. PaliGemma utilizes the SigLIP image encoder.
  • Multimodal Projector: A fully connected layer that projects high-dimensional encoded vectors into a common feature space, enabling the model to integrate visual and textual data for various tasks.
  • Text Decoder: The base language model responsible for generating the final textual output. In the case of PaliGemma, this is Gemma 2B.

Now that you understand the basics of how VLMs work, let’s move on to practical implementation.

In this article, I’ll walk you through some interesting examples of PaliGemma for various tasks. By the end, you’ll have an Android app that lets you explore the PaliGemma model right at your fingertips.

Alright, let’s get in action.

PaliGemma on Colab

I’ve created a repository of Colab notebooks to try out PaliGemma. Before you can use it, you need to have its access on Hugging Face 🤗.

Clone the below GitHub repository and explore the notebooks one-by-one.

# Clone the repository.
git clone https://github.com/NSTiwari/PaliGemma.git

# Install dependencies and libraries.
!pip install bitsandbytes transformers accelerate peft -q
!pip install -U "huggingface_hub[cli]"

Import dependencies and load the pre-trained PaliGemma model.

# Import dependencies.
from transformers import AutoTokenizer, PaliGemmaForConditionalGeneration, PaliGemmaProcessor
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "google/paligemma-3b-mix-224"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16)
processor = PaliGemmaProcessor.from_pretrained(model_id)

Pass input image and text prompt to the model.

from PIL import Image, ImageDraw, ImageFont

# Configure input image and text prompt.
input_image = "input_image.jpg" # @param {type: "string"}
input_img = Image.open(input_image)

prompt = "detect person, car" # @param {type: "string"}

# Pass the input image and prompt to PaliGemma.
inputs = processor(text=prompt, images=input_img,
padding="longest", do_convert_rgb=True, return_tensors="pt").to("cuda")
model.to(device)
inputs = inputs.to(dtype=model.dtype)
# Get model response.
with torch.no_grad():
output = model.generate(**inputs, max_length=496)

# Parse the model response for further processing.
paligemma_response = processor.decode(output[0], skip_special_tokens=True)[len(prompt):].lstrip("n")

Zero-shot Object Detection

PaliGemma has an interesting capability for zero-shot object detection. It accepts prompts in the format: detect [object-name], with each object name separated by a semicolon. It then returns the bounding box coordinates of the detected objects in the format normalized to an image size of 1024 x 1024.

Prompt: Detect person; phone; bottle

Zero-shot Object Detection using Python and OpenCV

Reference Expression Segmentation

Another interesting feature of PaliGemma is reference expression segmentation. It uses a prompt format similar to object detection: segment [object-name], with each object separated by a semicolon.

The response is of the format:  …

Prompt: Segment person; mug; book

Reference Image Segmentation using Python and OpenCV

Get the complete code here: https://github.com/NSTiwari/PaliGemma

PaliGemma on Android

While Colab notebooks are a great way to play with PaliGemma, they aren’t always practical, especially when you don’t have your laptop available.

Sagar Malhotra, Savio Rodrigues and I thought of creating a small project that allows you to explore PaliGemma directly on your Android phone.

Note: This isn’t an on-device deployment of PaliGemma, but rather an implementation of inferring the model on Android via a custom Hugging Face 🤗 REST API hosted on a Python server.

Pipeline

In this project, we’ve developed a pipeline that sends an image and a text prompt from an Android app to a Django server. The server then communicates with the Big_Vision HF Spaces via the Gradio Client API, processes the model’s response, and returns the results to the Android app for inference.

Let’s break down the steps to understand the tools and approaches we used to make this whole thing work.

  • Retrofit2 and OkHttp: Used for sending HTTP POST requests from the Android app to the Python server.
interface CoordinatesModelApi {

@POST("https://medium.com/api/detect")
@Multipart
suspend fun getCoordinatesModel(
@Part("prompt") text: RequestBody?,
@Part("width") width: RequestBody?,
@Part("height") height: RequestBody?,
@Part image : MultipartBody.Part,
): Response

companion object {
private val client: OkHttpClient =
OkHttpClient
.Builder()
.connectTimeout(120, TimeUnit.SECONDS)
.writeTimeout(240, TimeUnit.SECONDS)
.readTimeout(240, TimeUnit.SECONDS)
.build()
val instance by lazy {
Retrofit.Builder()
.baseUrl("paligemma.onrender.com")
.addConverterFactory(GsonConverterFactory.create())
.client(client)
.build()
.create(CoordinatesModelApi::class.java)
}
}
}
  • Django: Employed to create the Python server.
  • Gradio Client: Utilized for sending API requests to Big_Vision HF Spaces.
@api.post('/detect')
def detect(request, prompt: Form[str], image: File[UploadedFile], width: Form[int], height: Form[int]):

client = Client("big-vision/paligemma")
prompt_obj = ImageDetection.objects.create(
prompt=prompt,
image=image
)
cwd = pathlib.Path(os.getcwd())
image_path = pathlib.Path(prompt_obj.image.url[1:]) #skipping the forward slash so pathlib doesnt consider it an absolute URL.
img_path = pathlib.Path(cwd , image_path)
media_path = os.getcwd() + '/media/images/'

# Resize image with width, height parameters.
img = Image.open(img_path)
img = img.convert('RGB')
resized_img = img.resize((width, height), Image.Resampling.LANCZOS)
resized_img_path = media_path + 'resized_' + str(image)
resized_img.save(resized_img_path)

result = client.predict(
handle_file(resized_img_path),
prompt,
"paligemma-3b-mix-224", # str in 'Prompt' Textbox component # Literal[] in 'Model' Dropdown component
"greedy", # Literal['greedy', 'nucleus(0.1)', 'nucleus(0.3)', 'temperature(0.5)'] in 'Decoding' Dropdown component
api_name="https://medium.com/compute"
)

# Post-process the model response further.

Note: This project relies on the Big_Vision HF Spaces. If this Space is unavailable, the app will not get a response to its API requests.

  • Render: Serves as the hosting platform for the Django server 🚀.

Based on the prompt type — whether detection, captioning, or visual Q&A, the model response is post-processed and then sent back to the Android app for inference.

Here’s how the final app looks like:

PaliGemma Android App — [Demo 1]
PaliGemma Android App — [Demo 2]

Note: Currently, we’ve added support for image captioning, visual question-answering, and zero-shot object detection on the Android app. Segmentation is still under development.

You can find the APK, along with the complete code for the Android app and Python server, at: https://github.com/NSTiwari/PaliGemma-Android-HF

We hope you liked this project and learned something valuable. If you appreciated our work, please consider giving the repo a ⭐.

If you have any questions, feel free to reach out to us — [Sagar Malhotra], [Savio Rodrigues], [Nitin Tiwari].

Until then, stay tuned for more engaging blogs.

References & Resources

Acknowledgment

This project was developed during Google’s ML Developer Programs AI Sprint. Thanks to the MLDP team for providing Google Cloud credits to support this project.


PaliGemma on Android using Hugging Face API was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

4 plays for competitive intelligence automation

Next Post

[Oct 2025] AI Community — Activity Highlights and Achievements

Related Posts