You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source-pytorch/accelerators/hpu_intermediate.rst
+31Lines changed: 31 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -66,3 +66,34 @@ This enables advanced users to provide their own BF16 and FP32 operator list ins
66
66
trainer.fit(model, datamodule=dm)
67
67
68
68
For more details, please refer to `PyTorch Mixed Precision Training on Gaudi <https://docs.habana.ai/en/latest/PyTorch_User_Guide/PyTorch_User_Guide.html#pytorch-mixed-precision-training-on-gaudi>`__.
69
+
70
+
----
71
+
72
+
Enabling DeviceStatsMonitor with HPUs
73
+
----------------------------------------
74
+
75
+
:class:`~pytorch_lightning.callbacks.device_stats_monitor.DeviceStatsMonitor` is a callback that automatically monitors and logs device stats during the training stage.
76
+
This callback can be passed for training with HPUs. It returns a map of the following metrics with their values in bytes of type uint64:
77
+
78
+
- **Limit**: amount of total memory on HPU device.
79
+
- **InUse**: amount of allocated memory at any instance.
80
+
- **MaxInUse**: amount of total active memory allocated.
81
+
- **NumAllocs**: number of allocations.
82
+
- **NumFrees**: number of freed chunks.
83
+
- **ActiveAllocs**: number of active allocations.
84
+
- **MaxAllocSize**: maximum allocated size.
85
+
- **TotalSystemAllocs**: total number of system allocations.
86
+
- **TotalSystemFrees**: total number of system frees.
87
+
- **TotalActiveAllocs**: total number of active allocations.
88
+
89
+
The below snippet shows how DeviceStatsMonitor can be enabled.
90
+
91
+
.. code-block:: python
92
+
93
+
from pytorch_lightning import Trainer
94
+
from pytorch_lightning.callbacks import DeviceStatsMonitor
For more details, please refer to `Memory Stats APIs <https://docs.habana.ai/en/v1.5.0/PyTorch/PyTorch_User_Guide/Python_Packages.html#memory-stats-apis>`__.
Copy file name to clipboardExpand all lines: src/pytorch_lightning/CHANGELOG.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -111,6 +111,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
111
111
- Added support for async checkpointing ([#13658](https://github.com/Lightning-AI/lightning/pull/13658))
112
112
113
113
114
+
- Added support for HPU Device stats monitor ([#13819](https://github.com/Lightning-AI/lightning/pull/13819))
115
+
116
+
114
117
### Changed
115
118
116
119
-`accelerator="gpu"` now automatically selects an available GPU backend (CUDA and MPS currently) ([#13642](https://github.com/Lightning-AI/lightning/pull/13642))
0 commit comments