The Google Agent Development Kit (ADK) is a modular framework for building agentic workflows. While it integrates deeply with Gemini, its model-agnostic design allows you to plug in local models served via vLLM.
Getting the LiteLLM configuration right for this setup, however, can be surprisingly tricky. To save you the trial and error, this guide provides a battle-tested approach to connecting local inference with Google’s agent tooling.

In this tutorial, you’ll learn how to serve Gemma 3 (4B) with vLLM on high-performance GPUs (e.g., single or dual NVIDIA A40s) and integrate it with the Google ADK using a custom LiteLLM configuration.
I’ve also used this approach in a more complex project, whose source code can be found here; what we cover in this tutorial is just the prototype version.
🛠️ Phase 0: Installation & Environment Setup (using uv)
To handle the heavy-duty dependencies for vLLM and Google ADK efficiently, we’ll use uv, the extremely fast Python package manager and resolver.
1. Install uv and create the Environment
First, ensure uv is installed and create a synchronized virtual environment from the terminal.
# Install uv if you haven't already
curl -LsSf [https://astral.sh/uv/install.sh](https://astral.sh/uv/install.sh) | sh
# Create a new project environment
uv venv --python 3.11
source .venv/bin/activate
2. Install Dependencies
We need the Google ADK for the agent logic and vllm for the inference server.
# Install the core stack
uv pip install google-adk vllm huggingface_hub pydantic
🏗️ Phase 1: Serving Gemma 3 with vLLM
Gemma 3 4B is a highly capable multimodal model. To use it for agentic tasks (such as tool calling), we need to serve it via an OpenAI-compatible API and specify the tool parser.
1. Environment Preparation
Before launching the server, configure your environment to optimize for the A40s. On enterprise GPUs like the A40, standard P2P/IB communications can sometimes cause NCCL timeouts depending on your PCIe topology. We disable these for maximum stability.
export VLLM_USE_V1=0
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
2. Download the Model Weights
As per the official Hugging Face documentation for Gemma 3, you should use the huggingface-cli to fetch the weights.
💡 Since Gemma 3 is a gated model, ensure you have accepted the license agreement on the model card and are authenticated in your terminal.
3. Launching the Inference Server
We will use the downloaded weights. If you are running on two A40s, we set tensor-parallel-size to 2 to split the model across both GPUs.
💡 To ensure the server keeps running even if your SSH connection breaks, we will use nohup to run the process in the background. However, of course this is optional.
nohup vllm serve "google/gemma-3-4b-it"
--port 8000
--gpu-memory-utilization 0.90
--max-model-len 32768
--tensor-parallel-size 2
--trust-remote-code
--enable-auto-tool-choice
--tool-call-parser openai
--distributed-executor-backend mp > gemma-4b.log 2>&1
Note: You can monitor the logs in real-time using tail -f vllm.log.
4. Verification
To ensure everything is wired correctly, you can run a quick test curl from your terminal to the vLLM server independently of the ADK:
curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "google/gemma-3-4b-it",
"messages": [{"role": "user", "content": "What is the best city for tech in Germany?"}],
"temperature": 0.7
}'
🐍 Phase 2: Connecting Google ADK to Local vLLM
The Google ADK uses LiteLlm as its bridge to various backends. To force the agent to use your local vLLM instance instead of a cloud API, we use a helper function to configure the model parameters, including Guided JSON for structured outputs.
1. The Model Configuration Function
Create a file named agent.py.
💡 Note the use of extra_body to enforce schema adherence via guided_json and is completely optional if you’re just building a prototype.
import asyncio
from google.adk.agents import LlmAgent
from google.adk.models.lite_llm import LiteLlm
from pydantic import BaseModel
# 1. Define the expected response structure for Guided Decoding
class AgentResponse(BaseModel):
answer: str
confidence: float
# 2. Custom LiteLLM configuration function
def _litellm(
model: str,
api_base: str,
max_tokens: int,
temperature: float = 0.7
) -> LiteLlm:
return LiteLlm(
model=model,
api_base=api_base,
custom_llm_provider="openai",
api_key="not-needed", # vLLM doesn't require a key by default
max_tokens=max_tokens,
temperature=temperature,
top_p=0.95,
extra_body={
"guided_json": AgentResponse.model_json_schema(),
},
)
# 3. Initialize the Model
# Point this to your local vLLM instance
local_model = _litellm(
model="google/gemma-3-4b-it",
api_base="http://localhost:8000/v1",
max_tokens=8192,
)
# 4. Initialize the Agent
# Note: Use 'instruction' (singular) for the system prompt
root_agent = LlmAgent(
name="GemmaLocalAgent",
model=local_model,
instruction="You are travel expert. Answer the user's query." # this is a placeholder instruction
)
# 5. Simple Runner for testing
async def main():
print("Querying local Gemma 3...")
# ADK uses async streaming or simple run
async for event in root_agent.run_async("What is the capital of Germany?"):
if event.is_final_response():
print(f"nResponse: {event.content.parts[0].text}")
if __name__ == "__main__":
asyncio.run(main())
Troubleshooting Tips
- OOM (Out of Memory): If you are running on a single A40, set –tensor-parallel-size 1. If the model still doesn’t fit, reduce –max-model-len.
- Tool Calling Issues: If the agent fails to trigger tools, check the vLLM logs. Gemma 3 requires specific prompt formatting; ensure your –tool-call-parser matches the model’s expected format.
- Port Conflicts: If port 8000 is taken, use –port 8080 in the vLLM command and update the api_base in your Python code accordingly.
Ready to build the API layer? Head over to my FastAPI + ADK deep dive to see how to connect everything we built today to a production-grade backend.
Although this tutorial focuses on Gemma-4b, you can use any open-sourced HuggingFace model supported by LiteLLM.
I’ve also used this approach in a more complex project, whose source code can be found here; what we cover in this tutorial is just the prototype version.
✨ If you like the article, please subscribe to my latest ones.
To get in touch, contact me on LinkedIn or via ashmibanerjee.com.
GenAI usage disclosure: GenAI models were used to check for grammar inconsistencies in the blog and to refine the text for greater clarity. The authors accept full responsibility for the content presented in this blog.
[Tutorial] Running Google’s Gemma-4b locally with Google ADK and dual A40 GPUs was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.