Added datasets that roberta was trained on

nayef211 · nayef211 · commit b64fbfa334dd · 2022-01-07T13:52:18.000-08:00
diff --git a/torchtext/models/roberta/bundler.py b/torchtext/models/roberta/bundler.py
@@ -220,6 +220,10 @@ def encoderConf(self) -> RobertaEncoderConf:
     training on longer sequences; and dynamically changing the masking pattern applied
     to the training data.
 
+    The RoBERTa model was pretrained on the reunion of five datasets: BookCorpus,
+    English Wikipedia, CC-News, OpenWebText, and STORIES. Together theses datasets
+    contain over a 160GB of text.
+
     Originally published by the authors of RoBERTa under MIT License
     and redistributed with the same license.
     [`License <https://github.com/pytorch/fairseq/blob/main/LICENSE>`__,
@@ -262,6 +266,11 @@ def encoderConf(self) -> RobertaEncoderConf:
     training on longer sequences; and dynamically changing the masking pattern applied
     to the training data.
 
+    The RoBERTa model was pretrained on the reunion of five datasets: BookCorpus,
+    English Wikipedia, CC-News, OpenWebText, and STORIES. Together theses datasets
+    contain over a 160GB of text.
+
+
     Originally published by the authors of RoBERTa under MIT License
     and redistributed with the same license.
     [`License <https://github.com/pytorch/fairseq/blob/main/LICENSE>`__,