Skip to content

Fetch GPU stats using torch.cuda.memory_stats #8780

@edward-io

Description

@edward-io

🚀 Feature

(as discussed in #7518)

Gather GPU stats using torch.cuda.memory_stats instead of nvidia-smi for GPUStatsMonitor.

Motivation

Some machines do not have nvidia-smi installed, so they currently are unable to gather data using the GPUStatsMonitor callback, which is useful for detecting OOMs and debugging their models.

Pitch

For users using PyTorch version >= 1.8.0, use torch.cuda.memory_stats to gather memory data instead of invoking the nvidia-smi binary.

Some fields (fan_speed, temperature) that are logged in GPUStatsMonitor are not available from torch.cuda.memory_stats. We can either 1) use nvidia-smi as a fallback if a user requests those fields or 2) Remove those fields if we see that they aren’t being used anywhere and don’t consider them useful anymore.

Alternatives

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Metadata

Metadata

Labels

featureIs an improvement or enhancementhelp wantedOpen to be worked on

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions