-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
distributedGeneric distributed-related topicGeneric distributed-related topicquestionFurther information is requestedFurther information is requestedwon't fixThis will not be worked onThis will not be worked on
Description
As far as I understand DDP backend runs my training script from beginning for each GPU that I use. Is there a way to avoid creating different hydra output directories in each of the scripts? Should I block somehow every process except one with local rank 0? In my case I'm saving model checkpoints and .yaml file to default hydra output directory, but config file is copied twice and checkpoints are saved once. Anyways, spawning too many of directories is not convenient.
What can I do?
Code
@hydra.main(config_path="train-config.yaml", strict=False)
def train(config: DictConfig) -> None:
config.hydra_base_dir = os.getcwd()
original_wd = hydra.utils.get_original_cwd()
os.chdir(original_wd)
checkpoint_callback = ModelCheckpoint(
filepath=config.hydra_base_dir,
save_top_k=3,
verbose=True,
monitor="val_loss",
mode="min",
)
shutil.copy2("train-config.yaml", os.path.join(config.hydra_base_dir, "train-config.yaml"))
wandb_logger = WandbLogger(
offline=False,
)
model = MyModel(config)
trainer = pl.Trainer(
max_epochs=config.train.max_epochs,
gpus=config.train.n_gpu,
auto_select_gpus=True,
distributed_backend="ddp",
checkpoint_callback=checkpoint_callback,
logger=wandb_logger,
)
trainer.fit(model)
What's your environment?
- OS: Ubuntu 18.04
- Conda, Python 3.7.7
- hydra-core==0.11.3
- pytorch-lightning==0.8.5
- wandb==0.9.3
Metadata
Metadata
Assignees
Labels
distributedGeneric distributed-related topicGeneric distributed-related topicquestionFurther information is requestedFurther information is requestedwon't fixThis will not be worked onThis will not be worked on