Skip to content

Conversation

@mthrok
Copy link
Contributor

@mthrok mthrok commented Oct 5, 2021

Add pretrained weights from https://github.com/pytorch/fairseq/tree/main/examples/wav2vec#pre-trained-models

  • Wav2Vec 2.0 Base / Large / Large (LV-60)
  • XLSR-53

)
WAV2VEC2_ASR_BASE_10M.__doc__ = """Build "base" wav2vec2 model with an extra linear module
Pre-trained on 960 hours of *LibriSpeech* [:footcite:`7178964`] dataset, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this correspond to the Wav2Vec 2.0 Large | 10 minutes entry in the table? If so, should it be fine-tuned on LibriSpeech instead of Libri-Light?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Libri-Light is a subset of LibriSpeech, so both description is correct, but Libri-Light is more accurate.

Here is the description from the wav2vec 2.0 paper.

We fine-tune on five labeled data settings: 960 hours of transcribed Librispeech, the train-clean-100 subset comprising 100 hours (100 hours labeled), as well as the Libri-light limited resource training subsets originally extracted from Librispeech, these are train-10h (10 hours labeled), train-1h (1 hour labeled), train-10min (10 min labeled).


WAV2VEC2_ASR_BASE_100H.__doc__ = """Build "base" wav2vec2 model with an extra linear module
Pre-trained and fine-tuned for ASR on 960 hours of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is switched with the WAV2VEC2_ASR_BASE_960H doc below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thank you!

)
WAV2VEC2_ASR_LARGE_LV60K_10M.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module
Pre-trained on 60,000 hours of *Libri-Light* [:footcite:`librilight`] dataset, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this table and your WAV2VEC2_ASR_LARGE_LV60K_100H doc below, I think this should be fine-tuned on LibriSpeech instead of Libri-Light

Copy link
Contributor Author

@mthrok mthrok Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting the error. I looked at the paper again and it turned out that LibriVox is the correct one.

The following is the relationship between these datasets.

  • LibriVox: 60,000 hours audio
    • LibriSpeech: 960 hours audio + transcript, subset of LibriVox
      • LibriLight (Limited Resource Training Set): subset of LibriSpeech training subset

@mthrok mthrok requested a review from carolineechen October 6, 2021 14:47
@mthrok mthrok merged commit e40c9c3 into pytorch:main Oct 6, 2021
@mthrok mthrok deleted the pretrain-3 branch October 6, 2021 14:59
mthrok added a commit that referenced this pull request Oct 6, 2021
@mthrok mthrok changed the title Add pretrained weights from wav2vec2.0 and XLSR papers [Cherry-picked 0.10] Add pretrained weights from wav2vec2.0 and XLSR papers Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants