You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -131,6 +131,7 @@ This package comprises the following classes that can be imported in Python and
131
131
- Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective [`modeling.py`](./pytorch_pretrained_bert/modeling.py), [`modeling_openai.py`](./pytorch_pretrained_bert/modeling_openai.py), [`modeling_transfo_xl.py`](./pytorch_pretrained_bert/modeling_transfo_xl.py) files):
132
132
-`BertConfig` - Configuration class to store the configuration of a `BertModel` with utilities to read and write from JSON configuration files.
133
133
-`OpenAIGPTConfig` - Configuration class to store the configuration of a `OpenAIGPTModel` with utilities to read and write from JSON configuration files.
134
+
-`GPT2Config` - Configuration class to store the configuration of a `GPT2Model` with utilities to read and write from JSON configuration files.
134
135
-`TransfoXLConfig` - Configuration class to store the configuration of a `TransfoXLModel` with utilities to read and write from JSON configuration files.
135
136
136
137
The repository further comprises:
@@ -461,10 +462,12 @@ Here is a detailed documentation of the classes in the package and how to use th
461
462
462
463
| Sub-section | Description |
463
464
|-|-|
464
-
|[Loading Google AI's/OpenAI's pre-trained weights](#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump)| How to load Google AI/OpenAI's pre-trained weight or a PyTorch saved instance |
465
-
|[PyTorch models](#PyTorch-models)| API of the BERT, GPT, GPT-2 and Transformer-XL PyTorch model classes |
465
+
|[Loading pre-trained weights](#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump)| How to load Google AI/OpenAI's pre-trained weight or a PyTorch saved instance |
466
+
|[Serialization best-practices](#serialization-best-practices)| How to save and reload a fine-tuned model |
467
+
|[Configurations](#configurations)| API of the configuration classes for BERT, GPT, GPT-2 and Transformer-XL |
468
+
|[Models](#models)| API of the PyTorch model classes for BERT, GPT, GPT-2 and Transformer-XL |
466
469
|[Tokenizers](#tokenizers)| API of the tokenizers class for BERT, GPT, GPT-2 and Transformer-XL|
467
-
|[Optimizers](#optimizerss)| API of the optimizers |
470
+
|[Optimizers](#optimizers)| API of the optimizers |
468
471
469
472
### Loading Google AI or OpenAI pre-trained weights or PyTorch dump
470
473
@@ -524,7 +527,101 @@ model = GPT2Model.from_pretrained('gpt2')
524
527
525
528
```
526
529
527
-
### PyTorch models
530
+
### Serialization best-practices
531
+
532
+
This section explain how you can save and re-load a fine-tuned model (BERT, GPT, GPT-2 and Transformer-XL).
533
+
There are three types of files you need to save to be able to reload a fine-tuned model:
534
+
535
+
- the model it-self which should be saved following PyTorch serialization [best practices](https://pytorch.org/docs/stable/notes/serialization.html#best-practices),
536
+
- the configuration file of the model which is saved as a JSON file, and
537
+
- the vocabulary (and the merges for the BPE-based models GPT and GPT-2).
538
+
539
+
Here is the recommended way of saving the model, configuration and vocabulary to an `output_dir` directory and reloading the model and tokenizer afterwards:
540
+
541
+
```python
542
+
from pytorch_pretrained_bert importWEIGHTS_NAME, CONFIG_NAME
543
+
544
+
output_dir ="./models/"
545
+
546
+
# Step 1: Save a model, configuration and vocabulary that you have fine-tuned
547
+
548
+
# If we have a distributed model, save only the encapsulated model
549
+
# (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
550
+
model_to_save = model.module ifhasattr(model, 'module') else model
551
+
552
+
# If we save using the predefined names, we can load using `from_pretrained`
Models (BERT, GPT, GPT-2 and Transformer-XL) are defined and build from configuration classes which containes the parameters of the models (number of layers, dimensionalities...) and a few utilities to read and write from JSON configuration files. The respective configuration classes are:
610
+
611
+
-`BertConfig` for `BertModel` and BERT classes instances.
612
+
-`OpenAIGPTConfig` for `OpenAIGPTModel` and OpenAI GPT classes instances.
613
+
-`GPT2Config` for `GPT2Model` and OpenAI GPT-2 classes instances.
614
+
-`TransfoXLConfig` for `TransfoXLModel` and Transformer-XL classes instances.
615
+
616
+
These configuration classes contains a few utilities to load and save configurations:
617
+
618
+
-`from_dict(cls, json_object)`: A class method to construct a configuration from a Python dictionary of parameters. Returns an instance of the configuration class.
619
+
-`from_json_file(cls, json_file)`: A class method to construct a configuration from a json file of parameters. Returns an instance of the configuration class.
620
+
-`to_dict()`: Serializes an instance to a Python dictionary. Returns a dictionary.
621
+
-`to_json_string()`: Serializes an instance to a JSON string. Returns a string.
622
+
-`to_json_file(json_file_path)`: Save an instance to a json file.
623
+
624
+
### Models
528
625
529
626
#### 1. `BertModel`
530
627
@@ -796,8 +893,7 @@ This model *outputs*:
796
893
-`multiple_choice_logits`: the multiple choice logits as a torch.FloatTensor of size [batch_size, num_choices]
797
894
-`presents`: a list of pre-computed hidden-states (key and values in each attention blocks) as a torch.FloatTensors. They can be reused to speed up sequential decoding (see the `run_gpt2.py` example).
798
895
799
-
800
-
### Tokenizers:
896
+
### Tokenizers
801
897
802
898
#### `BertTokenizer`
803
899
@@ -816,6 +912,7 @@ and three methods:
816
912
-`tokenize(text)`: convert a `str` in a list of `str` tokens by (1) performing basic tokenization and (2) WordPiece tokenization.
817
913
-`convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
818
914
-`convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
915
+
-`save_vocabulary(directory_path)`: save the vocabulary file to `directory_path`. Return the path to the saved vocabulary file: `vocab_file_path`. The vocabulary can be reloaded with `BertTokenizer.from_pretrained('vocab_file_path')` or `BertTokenizer.from_pretrained('directory_path')`.
819
916
820
917
Please refer to the doc strings and code in [`tokenization.py`](./pytorch_pretrained_bert/tokenization.py) for the details of the `BasicTokenizer` and `WordpieceTokenizer` classes. In general it is recommended to use `BertTokenizer` unless you know what you are doing.
821
918
@@ -832,18 +929,22 @@ This class has four arguments:
832
929
833
930
and five methods:
834
931
835
-
-`tokenize(text)`: convert a `str` in a list of `str` tokens by (1) performing basic tokenization and (2) WordPiece tokenization.
932
+
-`tokenize(text)`: convert a `str` in a list of `str` tokens by performing BPE tokenization.
836
933
-`convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
837
934
-`convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
838
935
-`set_special_tokens(self, special_tokens)`: update the list of special tokens (see above arguments)
936
+
-`encode(text)`: convert a `str` in a list of `int` tokens by performing BPE encoding.
839
937
-`decode(ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)`: decode a list of `int` indices in a string and do some post-processing if needed: (i) remove special tokens from the output and (ii) clean up tokenization spaces.
938
+
-`save_vocabulary(directory_path)`: save the vocabulary, merge and special tokens files to `directory_path`. Return the path to the three files: `vocab_file_path`, `merge_file_path`, `special_tokens_file_path`. The vocabulary can be reloaded with `OpenAIGPTTokenizer.from_pretrained('directory_path')`.
840
939
841
940
Please refer to the doc strings and code in [`tokenization_openai.py`](./pytorch_pretrained_bert/tokenization_openai.py) for the details of the `OpenAIGPTTokenizer`.
842
941
843
942
#### `TransfoXLTokenizer`
844
943
845
944
`TransfoXLTokenizer` perform word tokenization. This tokenizer can be used for adaptive softmax and has utilities for counting tokens in a corpus to create a vocabulary ordered by toekn frequency (for adaptive softmax). See the adaptive softmax paper ([Efficient softmax approximation for GPUs](http://arxiv.org/abs/1609.04309)) for more details.
846
945
946
+
The API is similar to the API of `BertTokenizer` (see above).
947
+
847
948
Please refer to the doc strings and code in [`tokenization_transfo_xl.py`](./pytorch_pretrained_bert/tokenization_transfo_xl.py) for the details of these additional methods in `TransfoXLTokenizer`.
848
949
849
950
#### `GPT2Tokenizer`
@@ -858,13 +959,17 @@ This class has three arguments:
858
959
859
960
and two methods:
860
961
962
+
-`tokenize(text)`: convert a `str` in a list of `str` tokens by performing byte-level BPE.
963
+
-`convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
964
+
-`convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
965
+
-`set_special_tokens(self, special_tokens)`: update the list of special tokens (see above arguments)
861
966
-`encode(text)`: convert a `str` in a list of `int` tokens by performing byte-level BPE.
862
967
-`decode(tokens)`: convert back a list of `int` tokens in a `str`.
968
+
-`save_vocabulary(directory_path)`: save the vocabulary, merge and special tokens files to `directory_path`. Return the path to the three files: `vocab_file_path`, `merge_file_path`, `special_tokens_file_path`. The vocabulary can be reloaded with `OpenAIGPTTokenizer.from_pretrained('directory_path')`.
863
969
864
970
Please refer to [`tokenization_gpt2.py`](./pytorch_pretrained_bert/tokenization_gpt2.py) for more details on the `GPT2Tokenizer`.
865
971
866
-
867
-
### Optimizers:
972
+
### Optimizers
868
973
869
974
#### `BertAdam`
870
975
@@ -1174,18 +1279,20 @@ To get these results we used a combination of:
1174
1279
1175
1280
Here is the full list of hyper-parameters for this run:
1176
1281
```bash
1282
+
export SQUAD_DIR=/path/to/SQUAD
1283
+
1177
1284
python ./run_squad.py \
1178
1285
--bert_model bert-large-uncased \
1179
1286
--do_train \
1180
1287
--do_predict \
1181
1288
--do_lower_case \
1182
-
--train_file $SQUAD_TRAIN \
1183
-
--predict_file $SQUAD_EVAL \
1289
+
--train_file $SQUAD_DIR/train-v1.1.json \
1290
+
--predict_file $SQUAD_DIR/dev-v1.1.json \
1184
1291
--learning_rate 3e-5 \
1185
1292
--num_train_epochs 2 \
1186
1293
--max_seq_length 384 \
1187
1294
--doc_stride 128 \
1188
-
--output_dir $OUTPUT_DIR \
1295
+
--output_dir /tmp/debug_squad/ \
1189
1296
--train_batch_size 24 \
1190
1297
--gradient_accumulation_steps 2
1191
1298
```
@@ -1194,18 +1301,20 @@ If you have a recent GPU (starting from NVIDIA Volta series), you should try **1
1194
1301
1195
1302
Here is an example of hyper-parameters for a FP16 run we tried:
0 commit comments