Skip to content

Exception: The wandb backend process has shutdown #10688

@morestart

Description

@morestart

🐛 Bug

Exception: The wandb backend process has shutdown

full error info:

Traceback (most recent call last):
  File "/home/cat/PycharmProjects/torch-ocr/tools/train/det_train/train.py", line 59, in <module>
    trainer.fit(model, data)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in fit
    self._call_and_handle_interrupt(
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1193, in _run
    self._dispatch()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1272, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1282, in run_stage
    return self._run_train()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1312, in _run_train
    self.fit_loop.run()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 232, in advance
    self.trainer.logger_connector.update_train_step_metrics()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 225, in update_train_step_metrics
    self.log_metrics(self.metrics["log"])
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 121, in log_metrics
    self.trainer.logger.save()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 427, in save
    logger.save()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 317, in save
    self._finalize_agg_metrics()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 152, in _finalize_agg_metrics
    self.log_metrics(metrics=metrics_to_log, step=agg_step)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 49, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 370, in log_metrics
    self.experiment.log({**metrics, "trainer/global_step": step})
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 43, in experiment
    return get_experiment() or DummyExperiment()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 49, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 41, in get_experiment
    return fn(self)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 349, in experiment
    self._experiment.define_metric("trainer/global_step")
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 2195, in define_metric
    m._commit()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/wandb_metric.py", line 117, in _commit
    self._callback(m)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 933, in _metric_callback
    self._backend.interface._publish_metric(metric_record)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 309, in _publish_metric
    self._publish(rec)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 223, in _publish
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

wandb: Waiting for W&B process to finish, PID 7468... (failed 1). Press ctrl-c to abort syncing.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1671, in _atexit_cleanup
    self._on_finish()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1844, in _on_finish
    self._backend.interface._publish_telemetry(self._telemetry_obj)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 82, in _publish_telemetry
    self._publish(rec)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 223, in _publish
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1680, in _atexit_cleanup
    self._backend.cleanup()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/backend/backend.py", line 228, in cleanup
    self.interface.join()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 481, in join
    super(InterfaceQueue, self).join()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 591, in join
    self._communicate_shutdown()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 478, in _communicate_shutdown
    _ = self._communicate(record)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 232, in _communicate
    return self._communicate_async(rec, local=local).get(timeout=timeout)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 237, in _communicate_async
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

进程已结束,退出代码为 1

To Reproduce

Expected behavior

Environment

  • PyTorch Lightning Version (e.g., 1.5.0): 1.5.1
  • PyTorch Version (e.g., 1.10): 1.8
  • Python version (e.g., 3.9): 3.8
  • OS (e.g., Linux): ubuntu20
  • CUDA/cuDNN version: cuda11.2
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source): conda
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

cc @awaelchli @morganmcg1 @AyushExel @borisdayma @scottire

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinglogger: wandbWeights & Biases

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions