-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
questionFurther information is requestedFurther information is requested
Description
What is your question?
I have a pytorch-lightning code with checkpointing that runs well on my desktop. But when I run it on our cluster with SLURM, the checkpoints do not get saved.
Code
model = Predictor(args)
check = ModelCheckpoint(save_top_k=1, verbose=True, monitor='val_acc', mode='max',
filepath='checks/{epoch})
trainer = pl.Trainer(checkpoint_callback=check, max_epochs=100, gpus=1)
trainer.fit(model)What have you tried?
I run it in the cluster with the following code:
salloc -G 1 srun python main.pyWhat's your environment?
- OS: Linux
- Packaging conda
kumuji and gallardorafael
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested