-
Notifications
You must be signed in to change notification settings - Fork 461
Description
Problem Statement
The Strands SDK's ContentBlock type currently supports image, video, and document content but lacks audio support. As multimodal AI models increasingly support audio input/output, this gap forces developers to use untyped workarounds, breaking type safety and SDK consistency.
The recently merged LlamaCpp provider (PR #585) demonstrates this limitation - it must cast to Dict[str, Any] to handle audio for models like Qwen2.5-Omni, losing the type safety that makes Strands reliable.
Proposed Solution
Proposed Solution
Extend the SDK's media types to include audio, following the established pattern for other media types:
# src/strands/types/media.py
AudioFormat = Literal["wav", "mp3", "flac"]
class AudioSource(TypedDict):
"""Contains the content of audio data."""
bytes: bytes
class AudioContent(TypedDict):
"""Audio to include in a message."""
format: AudioFormat
source: AudioSource
# src/strands/types/content.py
class ContentBlock(TypedDict, total=False):
# ... existing fields ...
audio: AudioContent # Add alongside image, video, documentUse Case
This enhancement would benefit:
- Model Providers: Bedrock (Nova Sonic), LlamaCpp (Qwen2.5-Omni)
- Applications: Voice assistants, transcription services, audio analysis, real-time conversation
Alternatives Solutions
No response
Additional Context
The LlamaCpp implementation currently handles audio as:
# Current workaround in llamacpp.py
if "audio" in content:
audio_content = cast(Dict[str, Any], content) # Loss of type safety
audio_data = base64.b64encode(audio_content["audio"]["source"]["bytes"])With native support, all model providers could handle audio consistently and safely within the SDK's type system.