
TL;DR: We built Gemini PowerPoint Sage, a production-ready, 12-agent AI system that transforms static slides into engaging, multilingual video lessons. Powered by Google’s Agent Development Kit (ADK) and Gemini models, it automates speaker notes, generates visuals, synthesizes “voice-acted” audio, and creates videos in styles ranging from Corporate to Cyberpunk — all while supporting 16+ languages.
🎓 The Learning Challenge We All Face
We live in a globalized world, yet our educational materials often remain stuck in the past.
- For Educators: You have brilliant knowledge, but you’re bogged down by slide design and writing speaker notes, leaving little energy for actual teaching.
- For Students: If you don’t speak the language of instruction fluently, you’re left behind. Even translated content often feels robotic and dry.
- For Global Teams: Training materials lose their “soul” when translated manually. A fun, engaging onboarding deck in English becomes a dry policy document in Japanese.
Demo Hong Kong Comic Cyber Security Essentials Information Security Concepts Lecture
English
https://medium.com/media/e58e687b9e64ebe1039e4da5216cb525/href
Hong Kong Cantonese
https://medium.com/media/43f95124864445145276a60df5eab449/href
Mandarin (華語/普通話/國語)
https://medium.com/media/be42c22b82af35b3ec456e7b8b309509/href
Playlist 港漫講IT
https://medium.com/media/46bc0c76196a372b58cb2475a9fb1c2d/href
How does it work?

Phase 1: Initialization & Context Building
- Style Configuration: The system loads a style.yaml file to define the presentation’s thematic identity (visual aesthetic, tone of voice, and persona).
- Asset Ingestion: The pipeline reads the source PPTX and generates a PDF reference. The PDF is used to extract high-fidelity images of each slide for computer vision analysis.
- Global Analysis: Gemini processes all slide images and existing speaker notes. Crucially, it uses Grounding to cross-reference facts with Google Search, ensuring all information is up-to-date.
- Context Generation: A “Global Context” summary is created to maintain narrative consistency and flow throughout the entire presentation.
Phase 2: Content & Visual Generation
5. Meta-Prompting (Writer): The system dynamically generates a Styled Writer Prompt by merging the style.yaml rules with the base writer instructions.
6. Narrative Creation: Utilizing the Supervisor-Worker Pattern, the system generates speaker notes for each slide. It triangulates data from the slide image, original notes, and Global Context to ensure accuracy and engagement.
7. Visual Consistency: A Styled Image Prompt is generated. The system creates new visuals for each slide by analyzing the current content and referencing previous slide images (context retention) to ensure the design style remains consistent across the deck.
8. Multi-language Translation: Translate the generated style speaker notes into multiple languages and recreate the image in the selected language.
Phase 3: Multimedia Synthesis
9. Style-Aware TTS: A Styled TTS Prompt directs the audio generation. The system converts the new speaker notes into emotive MP3 speech that matches the presentation’s persona.
10. Video Assembly: The pipeline renders each slide as a video segment (MP4), synchronizing the generated image with its specific audio track.
11. Final Production: All segments are stitched together into the final, styled presentation video.
The process iterates through all style files.
🌟 The AI Production Studio
We believe learning should be fun, accessible, and culturally relevant. Whether you are teaching quantum physics or quarterly goals, the content should capture the imagination.
To achieve this, we didn’t just build a chatbot; we built a Digital Production Studio. We created Gemini PowerPoint Sage, a sophisticated orchestration of 12 specialized AI agents working in harmony.
Think of it as a film crew:
- 🎬 The Supervisor directs the workflow.
- 🎭 The Storyteller writes engaging scripts.
- 🎨 The Designer creates stunning visuals.
- 🌍 The Translator culturally adapts the message.
- 🎥 The Editor stitches it all into a video.
🧠 The Architecture: Meet Your AI Dream Team
The Supervisor-Worker Pattern
Forget the “one AI to rule them all” approach. Complex tasks require specialists. Our architecture uses the Supervisor-Worker pattern via Google’s Agent Development Kit (ADK).
Instead of asking one overwhelmed model to “fix the presentation,” our Supervisor Agent orchestrates a precise, 5-step workflow for every single slide:
- 🔍 Audit: The Auditor Agent checks existing notes. If they are good, we keep them (saving time/cost).
- 👁️ Analyze: The Analyst Agent (using Gemini Vision) “looks” at the slide to understand context that text misses.
- ✍️ Write: The Writer Agent drafts a narrative based on the visual analysis and global context.
- 🎨 Design: The Designer Agent generates prompt-aligned visuals.
- 📤 Output: The Supervisor compiles the results without hallucinating extra commentary.
Here is the actual Python code defining our Supervisor:
supervisor_agent = LlmAgent(
name="supervisor",
model="gemini-2.5-flash",
description="The orchestrator that manages the slide generation workflow",
instruction=SUPERVISOR_PROMPT,
tools=[
tool_factory.create_auditor_tool(),
tool_factory.create_analyst_tool(),
tool_factory.create_writer_tool(),
tool_factory.create_translator_tool()
]
)
🎨 The “Style Engine”: AI Inception
How do you make 12 different AI agents all adhere to a specific “Cyberpunk” or “Star Wars” aesthetic? Hardcoding prompts is a maintenance nightmare.
Our Innovation: The Prompt Rewriter Agent We built a meta-agent that rewrites other agents’ instructions. Before processing starts, this agent analyzes your chosen style and reprograms the Writer, Designer, and Translator agents to embody that persona.
🔄 How It Works: AI Inception
At startup, this meta-agent:
1. Reads – base instructions for Writer, Designer, Translator agents
2. Analyzes – your chosen style (Cyberpunk, Star Wars, Corporate, etc.)
3. Rewrites – each agent’s core personality to embody that style
4. Deploys – the newly styled agents to process your presentation
Here’s the actual “prompt for prompts” code:
PROMPT_REWRITER_PROMPT = """You are an expert prompt engineer...
YOUR TASK:
Take a base agent prompt and style guidelines, then rewrite the prompt to
deeply integrate the style throughout the instructions.
REWRITING PRINCIPLES:
1. Deep Integration: Don't just append the style - weave it throughout
2. Contextual Placement: Insert style requirements where most relevant
3. Emphasis: Make style adherence feel mandatory, not optional...
"""
🤖 The Result: Style-Obsessed AI Agents
Choose ”Gundam” style? Your Writer agent becomes obsessed with:
– ⚔️ Dramatic flair and philosophical musings
– 🤖 Mecha terminology and epic narratives
– 🌟 Heroic themes and technological wonder
Meanwhile, your Designer agent gets reprogrammed to demand:
– ⚡ High-contrast mecha aesthetics
– 🔥 Bold, angular visual elements
– 🎯 Futuristic color schemes
Every agent becomes a style specialist — automatically.
🎭 Style Transformation Examples
The original slide

Choose ”Cyberpunk” style? Your AI narrator becomes a Night City edgerunner with street attitude:
Alright chooms, listen up — got a preem gig to discuss. These corpos been running legacy rigs while we’re packing nova cloud tech. Time to jack in and show ’em what real edgerunner innovation looks like!

Switch to ”Gundam” style? The same content becomes an epic mecha pilot briefing:
Pilots, prepare for sortie! Our mobile suit specifications demonstrate superior performance metrics — 15% increased combat effectiveness. This is the power that will determine humanity’s future!

Star War

Hong Kong Comics

🎵 Audio & Video: Beyond Robotic Text-to-Speech
Accessibility isn’t just about translation; it’s about delivery. We engineered a Dual-Engine TTS (Text-to-Speech) System that acts as an AI voice actor.
The Logic
We use a TTS Style Adapter that analyzes the speaker notes and generates voice directions (e.g., “Speak with a confident, tech-savvy tone, emphasizing the data spike”).
- Gemini TTS Engine (Primary): Handles 19+ languages with full style awareness. It creates emotive, character-driven narration.
- Traditional TTS Engine (Fallback): Provides specialized support for Chinese Cantonese languages and acts as a rock-solid backup for production reliability.
🎭 Gemini TTS Engine Style-Aware Voice Generation
The magic happens in our TTS Style Adapter — an AI system that analyzes your speaker notes and generates voice acting instructions:
Input (Boring):
This slide shows our quarterly performance metrics with a 15% revenue increase.
Style Analysis:
– Content type: Business metrics
– Tone: Professional but engaging
– Theme: Cyberpunk (from presentation style)
Generated Voice Prompt:
Speak like a tech-savvy data analyst in a cyberpunk world. Use confident, slightly edgy tone with technical metaphors. Emphasize the 15% spike like it’s breaking through digital barriers.
Result: AI voice delivers with appropriate cyberpunk attitude and technical confidence.
🚀 Real-World Impact: Inclusive Education
The true power of Gemini PowerPoint Sage is seen in mixed-ability classrooms.
The Inclusive Classroom Imagine a lecture hall with:
- Visually Impaired Students: Need rich audio descriptions of charts.
- International Students: Need content in their native language.
- Neurodiverse Students: Engage better with narrative-driven content.
Our Solution Delivers:
- 📝 Audio Descriptions: The Analyst Agent sees the charts and writes detailed descriptions for the TTS to read.
- 🌍 Native Language: The Translator Agent adapts content culturally, not just linguistically.
- 🎭 Engagement: The Style Engine turns dry lectures into stories, keeping attention high.
- By automating this, we let educators focus on teaching, not formatting.
🏆 Conclusion
Gemini PowerPoint Sage demonstrates that AI in education is about more than just summarizing text. It’s about orchestrating a team of specialists to create immersive, accessible, and fun learning experiences.
By combining the coordination power of the ADK, the visual intelligence of Google Cloud Platform Vertex AI Gemini, and a creative multi-agent architecture, we are breaking down barriers to education — one slide at a time.
About the Author

Cyrus Wong is the senior lecturer of Hong Kong Institute of Information Technology (HKIIT) @ IVE(Lee Wai Lee).and he focuses on teaching public Cloud technologies. He is a passionate advocate for the adoption of cloud technology across various media and events. With his extensive knowledge and expertise, he has earned prestigious recognitions such as AWS AI Hero, Microsoft Azure AI MVP, and Google Developer Expert for AI & Google Cloud Platform.
#AISprintH2
🌍 Making Learning Fun and Accessible: How Gemini 3.0 Transforms Presentations for Global Education was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.