Skip to content

Conversation

@mkmeral
Copy link

@mkmeral mkmeral commented Oct 10, 2025

Add Gemini Live API Support for Bidirectional Streaming

Description

This PR adds support for Google's Gemini Live API as a bidirectional streaming model provider, enabling real-time audio conversations with native audio input/output, image/video input, and automatic transcription.

Key Features

Gemini Live Model Provider (gemini_live.py)

  • Uses official google-genai SDK for robust WebSocket communication
  • Native audio streaming with 16kHz input and 24kHz output
  • Real-time audio transcription (both input and output)
  • Image/video frame input support for multimodal conversations
  • Automatic VAD-based interruption handling
  • Tool calling integration
  • Message history support

Enhanced Bidirectional Streaming

  • Added ImageInputEvent type for sending images/video frames
  • Added TranscriptEvent type for audio transcriptions (separate from text output)
  • Extended BidirectionalAgent.send() to accept text, audio, and image inputs
  • Updated abstract BidirectionalModelSession interface with send_image_content()

Test Suite Enhancements

  • Updated test to support both Gemini Live and Nova Sonic
  • Added camera capture for real-time video frame streaming (1 FPS)
  • Demonstrates audio + video multimodal interaction
  • Falls back to Nova Sonic if no Gemini API key provided

Implementation Details

The implementation follows the same architectural patterns as Nova Sonic:

  • Provider-agnostic event conversion
  • Clean separation between session management and model interface
  • Simplified configuration - all Gemini Live API parameters pass through directly
  • Proper async/await patterns with context manager for connection lifecycle

Configuration Example

from strands.experimental.bidirectional_streaming.models.gemini_live import GeminiLiveBidirectionalModel

model = GeminiLiveBidirectionalModel(
    model_id="gemini-2.5-flash-native-audio-preview-09-2025",
    api_key="your-api-key",
    params={
        "response_modalities": ["AUDIO"],
        "input_audio_transcription": {},   # Enable input transcription
        "output_audio_transcription": {},  # Enable output transcription
    }
)

Related Issues

Documentation PR

Type of Change

New feature

Testing

How have you tested the change?

  • Tested real-time audio conversations with Gemini Live API
  • Verified audio transcription (input and output) works correctly
  • Tested image/video frame streaming from camera
  • Verified tool calling integration
  • Tested message history support
  • Confirmed interruption handling via VAD
  • Verified fallback to Nova Sonic when no API key provided
  • Ran hatch fmt for code formatting

Test Environment

  • Python 3.12+
  • Dependencies: google-genai, pyaudio, opencv-python, pillow
  • Tested with GOOGLE_AI_API_KEY environment variable

Files Changed

  1. New: src/strands/experimental/bidirectional_streaming/models/gemini_live.py (501 lines)
  2. Modified: src/strands/experimental/bidirectional_streaming/agent/agent.py - Added image input support
  3. Modified: src/strands/experimental/bidirectional_streaming/models/bidirectional_model.py - Added abstract send_image_content() method
  4. Modified: src/strands/experimental/bidirectional_streaming/models/novasonic.py - Added stub for image input (not supported)
  5. Modified: src/strands/experimental/bidirectional_streaming/types/bidirectional_streaming.py - Added ImageInputEvent and TranscriptEvent types
  6. Modified: src/strands/experimental/bidirectional_streaming/tests/test_bidirectional_streaming.py - Enhanced test with Gemini Live and camera support

Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Murat Kaan Meral added 8 commits October 6, 2025 15:22
- Add input_audio_transcription and output_audio_transcription parameter pass-through in _build_live_config()
- These parameters enable real-time transcription of both user speech (input) and model audio responses (output)
- Remove debug logging and temporary debug files (gemini_live_events.jsonl, debug_transcripts.py)
- Clean up unused json import

The transcription parameters were being set in the test configuration but weren't being passed through to the SDK because _build_live_config() only handled specific parameters. Now transcription events will be properly emitted via the transcript event type.
Instead of cherry-picking specific parameters, just pass through all config from params directly to the SDK. This is simpler and more flexible - users can configure any Gemini Live API parameter without us having to explicitly handle each one.

The previous approach was unnecessarily complicated with manual parameter filtering.
- Add proper error logging in close() method
- Remove empty line in send_tool_result() try block
- Add newline at end of file
- Improve code consistency
- Add GeminiLiveBidirectionalModel and GeminiLiveSession to models __init__.py
- Add ImageInputEvent and TranscriptEvent to types __init__.py
- Ensures new types and model are properly exported for external use
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I think for testing purposes i would prefer a new test file dedicated for the gemini model provider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants