Software

4 minute read

[Tutorial] Running Google’s Gemma-4b locally with Google ADK and dual A40 GPUs

March 23, 2026

The Google Agent Development Kit (ADK) is a modular framework for building agentic workflows. While it integrates deeply with Gemini, its model-agnostic design allows you to plug in local models served via vLLM.

Getting the LiteLLM configuration right for this setup, however, can be surprisingly tricky. To save you the trial and error, this guide provides a battle-tested approach to connecting local inference with Google’s agent tooling.

*The complete local agent stack — harnessing Gemma-3–4b and Google ADK on NVIDIA A40 hardware.*

In this tutorial, you’ll learn how to serve Gemma 3 (4B) with vLLM on high-performance GPUs (e.g., single or dual NVIDIA A40s) and integrate it with the Google ADK using a custom LiteLLM configuration.

I’ve also used this approach in a more complex project, whose source code can be found here; what we cover in this tutorial is just the prototype version.

🛠️ Phase 0: Installation & Environment Setup (using uv)

To handle the heavy-duty dependencies for vLLM and Google ADK efficiently, we’ll use uv, the extremely fast Python package manager and resolver.

1. Install uv and create the Environment

First, ensure uv is installed and create a synchronized virtual environment from the terminal.

# Install uv if you haven't already
curl -LsSf [https://astral.sh/uv/install.sh](https://astral.sh/uv/install.sh) | sh

# Create a new project environment
uv venv --python 3.11
source .venv/bin/activate

2. Install Dependencies

We need the Google ADK for the agent logic and vllm for the inference server.

# Install the core stack
uv pip install google-adk vllm huggingface_hub pydantic

🏗️ Phase 1: Serving Gemma 3 with vLLM

Gemma 3 4B is a highly capable multimodal model. To use it for agentic tasks (such as tool calling), we need to serve it via an OpenAI-compatible API and specify the tool parser.

1. Environment Preparation

Before launching the server, configure your environment to optimize for the A40s. On enterprise GPUs like the A40, standard P2P/IB communications can sometimes cause NCCL timeouts depending on your PCIe topology. We disable these for maximum stability.

export VLLM_USE_V1=0 
export NCCL_P2P_DISABLE=1 
export NCCL_IB_DISABLE=1

2. Download the Model Weights

As per the official Hugging Face documentation for Gemma 3, you should use the huggingface-cli to fetch the weights.

💡 Since Gemma 3 is a gated model, ensure you have accepted the license agreement on the model card and are authenticated in your terminal.

3. Launching the Inference Server

We will use the downloaded weights. If you are running on two A40s, we set tensor-parallel-size to 2 to split the model across both GPUs.

💡 To ensure the server keeps running even if your SSH connection breaks, we will use nohup to run the process in the background. However, of course this is optional.

nohup vllm serve "google/gemma-3-4b-it" 
--port 8000 
--gpu-memory-utilization 0.90 
--max-model-len 32768 
--tensor-parallel-size 2 
--trust-remote-code 
--enable-auto-tool-choice 
--tool-call-parser openai 
--distributed-executor-backend mp > gemma-4b.log 2>&1

Note: You can monitor the logs in real-time using tail -f vllm.log.

4. Verification

To ensure everything is wired correctly, you can run a quick test curl from your terminal to the vLLM server independently of the ADK:

curl http://localhost:8000/v1/chat/completions 
    -H "Content-Type: application/json" 
    -d '{
        "model": "google/gemma-3-4b-it",
        "messages": [{"role": "user", "content": "What is the best city for tech in Germany?"}],
        "temperature": 0.7
    }'

🐍 Phase 2: Connecting Google ADK to Local vLLM

The Google ADK uses LiteLlm as its bridge to various backends. To force the agent to use your local vLLM instance instead of a cloud API, we use a helper function to configure the model parameters, including Guided JSON for structured outputs.

1. The Model Configuration Function

Create a file named agent.py.

💡 Note the use of extra_body to enforce schema adherence via guided_json and is completely optional if you’re just building a prototype.

import asyncio
from google.adk.agents import LlmAgent
from google.adk.models.lite_llm import LiteLlm
from pydantic import BaseModel

# 1. Define the expected response structure for Guided Decoding
class AgentResponse(BaseModel):
    answer: str
    confidence: float

# 2. Custom LiteLLM configuration function
def _litellm(
    model: str,
    api_base: str,
    max_tokens: int,
    temperature: float = 0.7
) -> LiteLlm:
    return LiteLlm(
        model=model,
        api_base=api_base,
        custom_llm_provider="openai",
        api_key="not-needed", # vLLM doesn't require a key by default
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=0.95,
        extra_body={
            "guided_json": AgentResponse.model_json_schema(),
        },
    )

# 3. Initialize the Model
# Point this to your local vLLM instance
local_model = _litellm(
    model="google/gemma-3-4b-it",
    api_base="http://localhost:8000/v1",
    max_tokens=8192,
)

# 4. Initialize the Agent
# Note: Use 'instruction' (singular) for the system prompt
root_agent = LlmAgent(
    name="GemmaLocalAgent",
    model=local_model,
    instruction="You are travel expert. Answer the user's query." # this is a placeholder instruction
)

# 5. Simple Runner for testing
async def main():
    print("Querying local Gemma 3...")
    # ADK uses async streaming or simple run
    async for event in root_agent.run_async("What is the capital of Germany?"):
        if event.is_final_response():
             print(f"nResponse: {event.content.parts[0].text}")

if __name__ == "__main__":
    asyncio.run(main())

Troubleshooting Tips

OOM (Out of Memory): If you are running on a single A40, set –tensor-parallel-size 1. If the model still doesn’t fit, reduce –max-model-len.
Tool Calling Issues: If the agent fails to trigger tools, check the vLLM logs. Gemma 3 requires specific prompt formatting; ensure your –tool-call-parser matches the model’s expected format.
Port Conflicts: If port 8000 is taken, use –port 8080 in the vLLM command and update the api_base in your Python code accordingly.

Ready to build the API layer? Head over to my FastAPI + ADK deep dive to see how to connect everything we built today to a production-grade backend.
Although this tutorial focuses on Gemma-4b, you can use any open-sourced HuggingFace model supported by LiteLLM.
I’ve also used this approach in a more complex project, whose source code can be found here; what we cover in this tutorial is just the prototype version.

✨ If you like the article, please subscribe to my latest ones.
To get in touch, contact me on LinkedIn or via ashmibanerjee.com.

GenAI usage disclosure: GenAI models were used to check for grammar inconsistencies in the blog and to refine the text for greater clarity. The authors accept full responsibility for the content presented in this blog.

[Tutorial] Running Google’s Gemma-4b locally with Google ADK and dual A40 GPUs was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

Claude Code Secretly Hoards 140+ Config Files Behind Your Back. Here’s How to Take Control.

March 23, 2026

Software

[Feb 2026] AI Community — Activity Highlights and Achievements

March 23, 2026

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

The Shift That Already Happened

Anthropic wins injunction against Trump administration over Defense Department saga

How to Build a Trend Forecasting Tool with Social Scraping

Trending Tags

[Tutorial] Running Google’s Gemma-4b locally with Google ADK and dual A40 GPUs

🛠️ Phase 0: Installation & Environment Setup (using uv)

1. Install uv and create the Environment

2. Install Dependencies

🏗️ Phase 1: Serving Gemma 3 with vLLM

1. Environment Preparation

2. Download the Model Weights

3. Launching the Inference Server

4. Verification

🐍 Phase 2: Connecting Google ADK to Local vLLM

1. The Model Configuration Function

Troubleshooting Tips

Leave a Reply Cancel reply

Previous Post

Claude Code Secretly Hoards 140+ Config Files Behind Your Back. Here’s How to Take Control.

Next Post

[Feb 2026] AI Community — Activity Highlights and Achievements

[Tutorial] Running Google’s Gemma-4b locally with Google ADK and dual A40 GPUs

🛠️ Phase 0: Installation & Environment Setup (using uv)

1. Install uv and create the Environment

2. Install Dependencies

🏗️ Phase 1: Serving Gemma 3 with vLLM

1. Environment Preparation

2. Download the Model Weights

3. Launching the Inference Server

4. Verification

🐍 Phase 2: Connecting Google ADK to Local vLLM

1. The Model Configuration Function

Troubleshooting Tips

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts