Skip to content

Extend val_check_interval support #8135

@amoshyc

Description

@amoshyc

🐛 Bug

When I train the model by specifying number of training step instead of epoch, val_check_interval behaves strangely. Please see the following colab:

https://colab.research.google.com/drive/1I0ySRH03T9LdXHoGwCp3Q242dHEHWP_0?usp=sharing

In the code, I log the global_step on each validation.

class Model(pl.LightningModule):
    ...

    def on_validation_start(self):
        print('global_step:', self.global_step)

I set Trainer's max_steps to 100 and val_check_interval to 10.
But when I run the cells In[4] and In[5], the outputs are different.
The only different between In[4] and In[5] is the number of samples of the dataset which should not be the reason.

In[4]:

train_set = RandomDataset(1, 40)
valid_set = RandomDataset(1, 40)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=2)
valid_loader = torch.utils.data.DataLoader(valid_set, batch_size=2)

trainer = pl.Trainer(
    gpus=1,
    max_steps=100,
    val_check_interval=10,
    num_sanity_val_steps=0,
    log_every_n_steps=10,
    progress_bar_refresh_rate=0,
)
model = Model()
trainer.fit(model, train_loader, valid_loader)

Out[4]:

global_step: 9
global_step: 19
global_step: 29
global_step: 39
global_step: 49
global_step: 59
global_step: 69
global_step: 79
global_step: 89
global_step: 99

In[5]:

train_set = RandomDataset(1, 32)
valid_set = RandomDataset(1, 32)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=2)
valid_loader = torch.utils.data.DataLoader(valid_set, batch_size=2)

trainer = pl.Trainer(
    gpus=1,
    max_steps=100,
    val_check_interval=10,
    num_sanity_val_steps=0,
    log_every_n_steps=10,
    progress_bar_refresh_rate=0,
)
model = Model()
trainer.fit(model, train_loader, valid_loader)

Out[5]:

global_step: 9
global_step: 25
global_step: 41
global_step: 57
global_step: 73
global_step: 89

Expected behavior

Since I specify max_steps and set val_check_interval to a integer, I expect that result is the same as Out[4] no matter the number of samples in the dataset. The doc says that val_check_interval specifies the number of training step between validations, so Out[5] should be the same as Out[4].

I also expect that the number of times validation performed should be the same. BTW, the x-axis in the tensorboard are also wrong. You can see that in Out[9].

Environment

* CUDA:
	- GPU:
		- Tesla T4
	- available:         True
	- version:           10.2
* Packages:
	- numpy:             1.19.5
	- pyTorch_debug:     False
	- pyTorch_version:   1.9.0+cu102
	- pytorch-lightning: 1.3.7post0
	- tqdm:              4.41.1
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- 
	- processor:         x86_64
	- python:            3.7.10
	- version:           #1 SMP Sat Jun 5 09:50:34 PDT 2021

cc @Borda @tchaton

Metadata

Metadata

Assignees

Labels

featureIs an improvement or enhancementhelp wantedOpen to be worked onpriority: 1Medium priority task

Type

No type

Projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions