-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
When I train the model by specifying number of training step instead of epoch, val_check_interval behaves strangely. Please see the following colab:
https://colab.research.google.com/drive/1I0ySRH03T9LdXHoGwCp3Q242dHEHWP_0?usp=sharing
In the code, I log the global_step on each validation.
class Model(pl.LightningModule):
...
def on_validation_start(self):
print('global_step:', self.global_step)I set Trainer's max_steps to 100 and val_check_interval to 10.
But when I run the cells In[4] and In[5], the outputs are different.
The only different between In[4] and In[5] is the number of samples of the dataset which should not be the reason.
In[4]:
train_set = RandomDataset(1, 40)
valid_set = RandomDataset(1, 40)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=2)
valid_loader = torch.utils.data.DataLoader(valid_set, batch_size=2)
trainer = pl.Trainer(
gpus=1,
max_steps=100,
val_check_interval=10,
num_sanity_val_steps=0,
log_every_n_steps=10,
progress_bar_refresh_rate=0,
)
model = Model()
trainer.fit(model, train_loader, valid_loader)Out[4]:
global_step: 9
global_step: 19
global_step: 29
global_step: 39
global_step: 49
global_step: 59
global_step: 69
global_step: 79
global_step: 89
global_step: 99
In[5]:
train_set = RandomDataset(1, 32)
valid_set = RandomDataset(1, 32)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=2)
valid_loader = torch.utils.data.DataLoader(valid_set, batch_size=2)
trainer = pl.Trainer(
gpus=1,
max_steps=100,
val_check_interval=10,
num_sanity_val_steps=0,
log_every_n_steps=10,
progress_bar_refresh_rate=0,
)
model = Model()
trainer.fit(model, train_loader, valid_loader)Out[5]:
global_step: 9
global_step: 25
global_step: 41
global_step: 57
global_step: 73
global_step: 89
Expected behavior
Since I specify max_steps and set val_check_interval to a integer, I expect that result is the same as Out[4] no matter the number of samples in the dataset. The doc says that val_check_interval specifies the number of training step between validations, so Out[5] should be the same as Out[4].
I also expect that the number of times validation performed should be the same. BTW, the x-axis in the tensorboard are also wrong. You can see that in Out[9].
Environment
* CUDA:
- GPU:
- Tesla T4
- available: True
- version: 10.2
* Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.9.0+cu102
- pytorch-lightning: 1.3.7post0
- tqdm: 4.41.1
* System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.10
- version: #1 SMP Sat Jun 5 09:50:34 PDT 2021