SteerMoE enables powerful audio-language models that understand both speech and text while preserving the full reasoning capabilities of large language models.
Unlike traditional approaches that compromise language understanding for audio processing, our method keeps the LLM completely frozen, ensuring your audio-language model maintains sophisticated textual inference, reasoning, and generation abilitiesβwhile achieving state-of-the-art performance on audio understanding tasks.
Our models can:
- β Transcribe speech with high accuracy (4.5% CER on LibriSpeech)
- β Answer questions about audio (72.1% accuracy on ClothoAQA)
- β Reason about audio content using the LLM's powerful inference
- β Maintain full textual capabilities (frozen LLM preserves all language understanding)
- β Work across multiple languages (English, Chinese, etc.)
Problem: Traditional audio-language models fine-tune the LLM, which degrades its sophisticated language reasoning abilities.
Our Solution: Keep both the audio encoder AND the language decoder completely frozen. Train only a lightweight alignment module (~2M parameters) that bridges the two modalities.
Result: Best of both worldsβexcellent audio understanding + preserved LLM reasoning.
| Approach | CER β | WER β | Textual Reasoning | Trainable Params |
|---|---|---|---|---|
| Whisper-large-v3 (frozen) | 8.2% | 15.3% | β No LLM | 0M |
| Audio-LLM (fine-tuned LLM) | 5.8% | 10.5% | 7000M | |
| Simple Linear Adapter | 6.8% | 12.1% | β Preserved | 1.1M |
| SteerMoE (Ours) | 4.5% | 8.2% | β Fully Preserved | 1.8M |
| Model | Test CER β | Trainable Params |
|---|---|---|
| Conformer + Simple Adapter | 8.3% | 1.1M |
| SteerMoE + Conformer (Ours) | 6.2% | 1.8M |
| Model | Accuracy β | Trainable Params |
|---|---|---|
| Simple Adapter | 58.3% | 1.1M |
| SteerMoE (Ours) | 72.1% | 1.8M |
Key Insight: We achieve near state-of-the-art audio performance with fully preserved LLM reasoning and only 1.8M trainable parameters (~0.025% of the full model size).
Your audio-language model maintains ALL the LLM's abilities:
# After training on audio tasks, the LLM still excels at pure text:
# Complex reasoning (preserved)
prompt = "If Alice has twice as many apples as Bob, and Bob has 3 apples,
considering a 15% tax, how much would Alice pay for her apples at $2 each?"
model.generate(prompt) # β
Works perfectly - LLM reasoning intact
# Code generation (preserved)
prompt = "Write a Python function to implement binary search"
model.generate(prompt) # β
Still generates correct code
# Audio understanding (newly acquired)
audio = load_audio("speech.wav")
prompt = "Transcribe and summarize the main points: "
model.generate(audio, prompt) # β
Understands audio + reasons about contentWhy this is important:
- Deploy ONE model for both audio and text tasks
- No compromise on language understanding quality
- LLM's common-sense reasoning helps with audio understanding
- Safe to deploy in production (no unexpected behavior changes)
To bridge frozen audio encoders and frozen LLMs without fine-tuning either, we introduce SteerMoEβa lightweight, trainable alignment module that dynamically "steers" audio features into the LLM's representation space.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Audio Input (e.g., "Hello world" speech) β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β Frozen Audio Encoder β β Whisper/Conformer
β (1.5B params, frozen) β NO training
βββββββββββββββββ¬ββββββββββββββββ
β Audio features
βΌ
βββββββββββββββββββββββββββββββββ
β SteerMoE Aligner β β Our innovation
β Layer-wise Steering + MoE β ~2M params
β (ONLY trainable part) β Dynamic adaptation
βββββββββββββββββ¬ββββββββββββββββ
β Aligned features
βΌ
βββββββββββββββββββββββββββββββββ
β Linear Projection β β Simple adapter
β (1280 β 896 dimensions) β ~1M params
βββββββββββββββββ¬ββββββββββββββββ
β LLM-compatible embeddings
βΌ
βββββββββββββββββββββββββββββββββ
β Frozen Language Decoder β β Qwen/LLaMA
β (7B params, frozen) β NO training
β Reasoning preserved β β All capabilities intact
βββββββββββββββββ¬ββββββββββββββββ
β
βΌ
Text output: "Hello world" (+ reasoning/QA/etc.)
Instead of learning a single static transformation, SteerMoE applies adaptive adjustments at each encoder layer based on the input content:
# For each audio encoder layer l:
for layer_idx in range(num_layers):
# 1. Process through frozen encoder layer
h_l = frozen_encoder_layer[layer_idx](h_l_minus_1)
# 2. MoE router decides which experts to use (depends on audio content)
expert_weights = Router(h_l) # Different for speech/music/noise/etc.
# 3. Apply dynamic steering adjustment
steering = Ξ£ expert_weights[k] * steering_vectors[layer_idx, k]
# 4. Adjust the features
h_l = h_l + layer_scale[layer_idx] * steeringWhy this works:
- π― Content-adaptive: Router learns to select different experts for different audio types
- π Layer-specific: Early layers focus on acoustic features, later layers on semantic alignment
- π Efficient: Single router for all layers (32Γ fewer parameters than naive MoE)
- ποΈ Controllable: Layer scales allow fine-grained adjustment strength per layer
Each layer has multiple expert steering vectors (typically 8):
Layer 0: [Expert_0: acoustic patterns] [Expert_1: noise handling] [Expert_2: music] ...
Layer 1: [Expert_0: phonetic features] [Expert_1: pitch variation] ...
...
Layer 31: [Expert_0: semantic concepts] [Expert_1: context alignment] ...
The router network learns to:
- Select Expert_0 for clean speech
- Select Expert_1 for noisy audio
- Select Expert_2 for background music
- Mix experts for complex audio scenes
This dynamic specialization is why SteerMoE outperforms static adapters.
Traditional fine-tuning:
Trainable: 1.5B (audio encoder) + 7B (LLM) = 8.5B parameters
Training time: ~500 GPU hours
GPU memory: 8Γ A100 80GB
SteerMoE:
Trainable: 1.8M parameters (steering + projection only)
Training time: ~10 GPU hours
GPU memory: 1Γ A100 40GB
Risk: Minimal (LLM behavior unchanged)
Breakdown of 1.8M parameters:
- Steering vectors:
32 layers Γ 8 experts Γ 1280 dim= 327K params - Router network:
1280 dim β (8Γ32)= 327K params - Layer scales:
32= 32 params - Linear projection:
1280 β 896= 1.1M params - Total: ~1.8M params (0.025% of full model)
Because the audio encoder stays frozen:
- β Keeps Whisper's robustness to accents, noise, etc.
- β No overfitting to your specific dataset
- β Works on out-of-domain audio without degradation
Because the LLM stays frozen:
- β All textual reasoning capabilities preserved
- β No catastrophic forgetting
- β Safe for production deployment
- π Experiment with different audio encoders (Whisper, Conformer, etc.)
- π Swap LLM backbones (Qwen, LLaMA, Mistral, etc.)
- π Train for new languages in hours, not weeks
- π Easily adapt to new tasks (ASR β QA β captioning)
We provide multiple model configurations:
| Encoder | Best For | Languages | Training Time |
|---|---|---|---|
| Whisper-large-v3 | General ASR, English | 90+ languages | ~10 hours |
| Conformer | Chinese/Asian, Streaming | Chinese, Japanese, Korean | ~12 hours |
Both use the same SteerMoE technology, just with different audio encoders.
# Clone repository
git clone https://github.com/yourusername/SteerMoE.git
cd SteerMoE
# Create environment
conda create -n steermoe python=3.10
conda activate steermoe
pip install -r requirements.txt
# Download pre-trained models
# Whisper: openai/whisper-large-v3
# LLM: Qwen/Qwen2.5-7B-Instruct# For English (LibriSpeech)
python pre_process/pre_process_librispeech.py \
--audio_dir /path/to/LibriSpeech/train-clean-100 \
--output_dir /path/to/processed_librispeech \
--whisper_model /path/to/whisper-large-v3 \
--llm_tokenizer /path/to/Qwen2.5-7B-Instruct
# For Chinese (AISHELL)
python pre_process/pre_process_aishell.py \
--audio_dir /path/to/aishell/wav \
--trans_file /path/to/aishell/trans.txt \
--output_dir /path/to/processed_aishellSee pre_process/README.md for other datasets.
Edit configs/layer_wise_whisper_qwen7b_libri_train.yaml:
# Audio encoder (frozen)
whisper_encoder:
model_path: "/path/to/whisper-large-v3"
# Language decoder (frozen)
llm_decoder:
model_name: "/path/to/Qwen2.5-7B-Instruct"
# SteerMoE settings (trainable)
steering:
num_experts: 8
steering_scale: 0.1
steering_learning_rate: 1e-2 # Higher LR for steering
# Dataset
parquet_dirs:
- "/path/to/processed_librispeech/train.clean.100/"
# Task prompt
textual_prompt: "please transcribe the audio content into text: "# Single GPU
python scripts/train_layer_wise.py \
--config configs/layer_wise_whisper_qwen7b_libri_train.yaml \
--mode train
# Multi-GPU (recommended)
deepspeed --num_gpus=4 scripts/train_layer_wise.py \
--config configs/layer_wise_whisper_qwen7b_libri_train.yaml \
--deepspeed_config configs/stage2_simple.json \
--mode trainTraining on LibriSpeech-100h takes ~10 hours on 4Γ A100 GPUs.
python scripts/train_layer_wise.py \
--config configs/layer_wise_whisper_qwen7b_libri_test.yaml \
--mode eval \
--model_path results/steermoe_checkpoint/finalfrom transformers import AutoTokenizer
from steer_moe.models import SteerMoEEfficientLayerWiseModel
import torch
# Load model
model = SteerMoEEfficientLayerWiseModel.load(
checkpoint_path="results/steermoe_checkpoint/final"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
# Load and preprocess audio
audio_features = preprocess_audio("speech.wav") # (1, 128, T)
# Transcribe
prompt = tokenizer("Transcribe: ", return_tensors="pt").input_ids
output_ids = model.generate(
input_features=audio_features,
decoder_input_ids=prompt,
max_new_tokens=256
)
transcription = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(transcription)
# Question answering (same model!)
prompt = tokenizer("What emotion is expressed in the audio? ", return_tensors="pt").input_ids
output_ids = model.generate(input_features=audio_features, decoder_input_ids=prompt)
answer = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(answer) # β
Uses LLM reasoning to analyze emotionComprehensive guides for each component:
configs/README.md- Configuration files and hyperparameterspre_process/README.md- Dataset preprocessing for ASR, QA, etc.scripts/README.md- Training, evaluation, and analysis scriptssteer_moe/README.md- Core model implementation details
We validate SteerMoE's design through comprehensive ablations:
| Component | Linear Only | SteerMoE | Improvement |
|---|---|---|---|
| Trainable params | 1.1M | 1.8M | +60% |
| LibriSpeech CER | 6.8% | 4.5% | -34% relative |
| AISHELL CER | 8.3% | 6.2% | -25% relative |
| ClothoAQA Acc | 58.3% | 72.1% | +24% absolute |
Conclusion: Steering + MoE provides significant gains over simple projection.
| Variant | Description | CER | Params |
|---|---|---|---|
| Post-encoder steering | Single steering after encoder | 5.2% | 1.6M |
| Multiple routers | One router per layer | 4.6% | 10.3M |
| Single efficient router | Our design | 4.5% | 1.8M |
Conclusion: Our single-router design achieves best performance-efficiency trade-off.
| Num Experts | CER | Training Time |
|---|---|---|
| 4 | 4.9% | 9h |
| 8 | 4.5% | 10h |
| 16 | 4.4% | 13h |
Conclusion: 8 experts provides best balance of performance and efficiency.
SteerMoE/
βββ configs/ # Training configurations
β βββ layer_wise_whisper_qwen7b_libri_train.yaml
β βββ layer_wise_conformer_qwen7b_aishell_train.yaml
β βββ README.md
βββ pre_process/ # Dataset preprocessing
β βββ pre_process_librispeech.py
β βββ pre_process_aishell.py
β βββ pre_process_clothoaqa.py
β βββ README.md
βββ scripts/ # Training and evaluation
β βββ train_layer_wise.py # Main training (Whisper)
β βββ train_layer_wise_conformer.py # Main training (Conformer)
β βββ train_layer_wise_linear_whisper.py # Ablation baseline
β βββ cer.py, wer.py # Evaluation metrics
β βββ README.md
βββ steer_moe/ # Core implementation
β βββ models.py # SteerMoE model classes
β βββ efficient_layer_wise_whisper.py # Whisper + steering
β βββ efficient_layer_wise_conformer.py # Conformer + steering
β βββ utils.py # Data collators
β βββ README.md
βββ results/ # Training outputs
βββ README.md # This file
Most audio-language models use one of these approaches:
Approach 1: Fine-tune the entire LLM
Audio β Encoder β [Fine-tuned LLM] β Output
β οΈ 7B params trained
β οΈ Language reasoning degrades
β οΈ Expensive & slow training
Approach 2: Adapter-based (simple projection)
Audio β Encoder β [Linear] β [Frozen LLM] β Output
β
LLM preserved
β οΈ Limited audio understanding
β οΈ Static transformation
Our Approach: SteerMoE
Audio β Encoder β [SteerMoE: Dynamic Steering] β [Frozen LLM] β Output
β
LLM fully preserved
β
Excellent audio understanding
β
Content-adaptive transformation
β
Only 1.8M params trained
- Freezing is better than fine-tuning: Frozen LLM retains reasoning, frozen encoder retains robustness
- Dynamic beats static: MoE routing adapts to different audio types better than fixed projection
- Layer-wise is crucial: Different encoder layers need different alignment strategies
- Efficiency is achievable: Single router reduces parameters by 32Γ vs. naive multi-router MoE
See our paper (feng.pdf) for detailed analysis and more results.
| Model | test-clean CER | test-clean WER | test-other CER | test-other WER |
|---|---|---|---|---|
| Whisper-large-v3 (frozen) | 8.2% | 15.3% | 15.1% | 28.2% |
| + Simple Linear | 6.8% | 12.1% | 12.8% | 24.5% |
| + SteerMoE (Ours) | 4.5% | 8.2% | 9.1% | 18.7% |
| Fine-tuned Whisper (1.5B params) | 3.8% | 6.9% | 8.2% | 16.8% |
Analysis: SteerMoE approaches fine-tuned performance with 1000Γ fewer trainable parameters while preserving LLM capabilities.
| Model | dev CER | test CER |
|---|---|---|
| Conformer (frozen) | 9.8% | 10.2% |
| + Simple Linear | 8.5% | 8.3% |
| + SteerMoE (Ours) | 6.0% | 6.2% |
| Model | Accuracy | F1 Score |
|---|---|---|
| Simple Linear | 58.3% | 54.2% |
| SteerMoE (Ours) | 72.1% | 69.8% |
| Fine-tuned LLM (7B params) | 74.5% | 71.3% |
Analysis: SteerMoE achieves near fine-tuned performance while keeping LLM frozen (reasoning preserved).
Trained on English, tested on unseen languages:
| Language | Whisper (frozen) | + SteerMoE | Improvement |
|---|---|---|---|
| German | 12.3% WER | 9.8% WER | -20% |
| French | 11.8% WER | 9.2% WER | -22% |
| Spanish | 10.5% WER | 8.1% WER | -23% |
Analysis: Frozen Whisper's multilingual abilities are preserved and enhanced.
| Configuration | GPUs | Batch Size | Training Time (LibriSpeech-100h) |
|---|---|---|---|
| Minimum | 1Γ A100 40GB | 1-2 | ~40 hours |
| Recommended | 4Γ A100 40GB | 4 per GPU | ~10 hours |
| Large scale | 8Γ A100 80GB | 8 per GPU | ~5 hours |
| Model Size | GPU Memory | Tokens/sec |
|---|---|---|
| Qwen-7B + Whisper | 16GB (FP16) | ~50 |
| Qwen-3B + Whisper | 8GB (FP16) | ~100 |
Authors:
- Ruitao Feng - GitHub: @forfrt
- B.X. Zhang - GitHub: @zbxforward
Get Help:
- π Bug reports: GitHub Issues
- π¬ Questions: GitHub Discussions
- π§ Email: [Your email here]
Paper: See feng.pdf for the full ICASSP 2025 submission with detailed methodology and additional experiments.
If you use SteerMoE in your research, please cite:
@inproceedings{feng2025steermoe,
title={SteerMoE: Efficient Audio-Language Models with Preserved Reasoning via Layer-Wise Steering and Mixture-of-Experts},
author={Feng, Ruitao and Zhang, B.X.},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2025}
}We are actively developing the following features:
Goal: Enable optional LoRA tuning of the LLM decoder for users who need maximum task-specific performance.
# Planned API
model = SteerMoEEfficientLayerWiseModel(
whisper_encoder=whisper,
llm_decoder=qwen,
use_lora=True, # Enable LoRA for LLM
lora_rank=16,
lora_alpha=32,
lora_target_modules=["q_proj", "v_proj"]
)Trade-offs:
- β Potential 10-20% further performance improvement on specialized domains
β οΈ May slightly reduce general textual reasoning (frozen LLM is our core design goal)β οΈ Increases trainable parameters to ~7-10M (still efficient!)
Use cases: Medical ASR, legal transcription, domain-specific QA where maximum accuracy is critical.
Goal: Easy-to-use command-line tool for quick transcription and audio understanding.
# Planned usage
python scripts/inference.py \
--model results/steermoe_checkpoint \
--audio speech.wav \
--task transcribe
# Output: "The quick brown fox jumps over the lazy dog."
# With custom prompts
python scripts/inference.py \
--model results/steermoe_checkpoint \
--audio meeting.wav \
--prompt "Summarize the main discussion points and action items: "
# Output: "The meeting covered Q4 planning with three action items:
# 1) Launch marketing campaign by Nov 15
# 2) Complete beta testing by Dec 1
# 3) Schedule follow-up meeting on Dec 10"Features:
- π΅ Support WAV, MP3, FLAC, OGG formats
- π Multiple tasks: transcribe, summarize, QA, sentiment analysis
- π Batch processing for multiple files
- π Streaming mode for real-time applications
- Web Interface: Gradio/Streamlit demo for interactive use
- ONNX Export: Deploy with ONNX Runtime for production inference
- Distillation: Smaller models (Qwen-1.5B) for edge deployment
- More Languages: Pre-trained checkpoints for 20+ languages
- Multi-modal: Extend to audio-visual understanding
Stay Updated:
- β Star this repo to get notifications
- π Watch for release announcements
- π¬ Subscribe to our mailing list
This project is licensed under the MIT License - see the LICENSE file for details.
This work builds upon excellent open-source projects:
- Whisper (OpenAI) - Robust speech recognition
- Qwen (Alibaba) - Powerful multilingual LLM
- DeepSpeed (Microsoft) - Efficient distributed training
- Transformers (HuggingFace) - Model implementations and training utilities
We also thank the research community for datasets:
- LibriSpeech, AISHELL, ClothoAQA, and other benchmark datasets
- Open-source audio processing libraries (librosa, soundfile, torchaudio)
If you find SteerMoE useful, please consider giving it a star! β
Built with β€οΈ by the SteerMoE team. Questions? Open an issue or discussion!