- 
                Notifications
    You must be signed in to change notification settings 
- Fork 735
[Cherry-picked 0.10] Add pretrained weights from wav2vec2.0 and XLSR papers #1827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| ) | ||
| WAV2VEC2_ASR_BASE_10M.__doc__ = """Build "base" wav2vec2 model with an extra linear module | ||
| Pre-trained on 960 hours of *LibriSpeech* [:footcite:`7178964`] dataset, and | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this correspond to the Wav2Vec 2.0 Large | 10 minutes entry in the table? If so, should it be fine-tuned on LibriSpeech instead of Libri-Light?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Libri-Light is a subset of LibriSpeech, so both description is correct, but Libri-Light is more accurate.
Here is the description from the wav2vec 2.0 paper.
We fine-tune on five labeled data settings: 960 hours of transcribed Librispeech, the train-clean-100 subset comprising 100 hours (100 hours labeled), as well as the Libri-light limited resource training subsets originally extracted from Librispeech, these are train-10h (10 hours labeled), train-1h (1 hour labeled), train-10min (10 min labeled).
|  | ||
| WAV2VEC2_ASR_BASE_100H.__doc__ = """Build "base" wav2vec2 model with an extra linear module | ||
| Pre-trained and fine-tuned for ASR on 960 hours of | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is switched with the WAV2VEC2_ASR_BASE_960H doc below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thank you!
| ) | ||
| WAV2VEC2_ASR_LARGE_LV60K_10M.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module | ||
| Pre-trained on 60,000 hours of *Libri-Light* [:footcite:`librilight`] dataset, and | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this table and your WAV2VEC2_ASR_LARGE_LV60K_100H doc below, I think this should be fine-tuned on LibriSpeech instead of Libri-Light
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting the error. I looked at the paper again and it turned out that LibriVox is the correct one.
The following is the relationship between these datasets.
- LibriVox: 60,000 hours audio- LibriSpeech: 960 hours audio + transcript, subset of- LibriVox- LibriLight(- Limited Resource Training Set): subset of- LibriSpeechtraining subset
 
 
Add pretrained weights from https://github.com/pytorch/fairseq/tree/main/examples/wav2vec#pre-trained-models - Wav2Vec 2.0 Base / Large / Large (LV-60) - XLSR-53
Co-authored-by: Caroline Chen <[email protected]>
Add pretrained weights from https://github.com/pytorch/fairseq/tree/main/examples/wav2vec#pre-trained-models - Wav2Vec 2.0 Base / Large / Large (LV-60) - XLSR-53
Add pretrained weights from https://github.com/pytorch/fairseq/tree/main/examples/wav2vec#pre-trained-models