-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🚀 Feature
(as discussed in #7518)
Gather GPU stats using torch.cuda.memory_stats instead of nvidia-smi for GPUStatsMonitor.
Motivation
Some machines do not have nvidia-smi installed, so they currently are unable to gather data using the GPUStatsMonitor callback, which is useful for detecting OOMs and debugging their models.
Pitch
For users using PyTorch version >= 1.8.0, use torch.cuda.memory_stats to gather memory data instead of invoking the nvidia-smi binary.
Some fields (fan_speed, temperature) that are logged in GPUStatsMonitor are not available from torch.cuda.memory_stats. We can either 1) use nvidia-smi as a fallback if a user requests those fields or 2) Remove those fields if we see that they aren’t being used anywhere and don’t consider them useful anymore.
Alternatives
Additional context
If you enjoy Lightning, check out our other projects! ⚡
-
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
-
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
-
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
-
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.