Skip to content

CUDA memory leak after batch size finder #6570

@maxjeblick

Description

@maxjeblick

🐛 Bug

Using transformers + AdamW optimizer + batch size finder results in ~2 - 3 GB GPU memory not being freed after
trainer.tune (for xlm-roberta-base). This causes OOM issues on a subsequent call of trainer.fit.
I suspect that the state of the AdamW optimizer causes this issue.

Please reproduce using the BoringModel

https://colab.research.google.com/drive/1cugaUmLzNvk-38OyV8zyT9M9xQY4LkfH#scrollTo=j4w0wizx5XxJ

Expected behavior

GPU memory should be freed after the batch size finder (up to the model which may stay on GPU).

Environment

  • CUDA:
    • GPU:
      • Tesla T4
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.19.5
    • pyTorch_debug: False
    • pyTorch_version: 1.8.0+cu101
    • pytorch-lightning: 1.2.4
    • tqdm: 4.41.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.7.10
    • version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedOpen to be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions