From 34ae3dca703d10bc845f9e0f2b610cc30a3db43b Mon Sep 17 00:00:00 2001 From: mori360 Date: Tue, 8 Oct 2024 10:47:35 -0700 Subject: [PATCH 01/10] add memory_profiler readme --- README.md | 1 + docs/memory_profiler.md | 10 ++++++++++ 2 files changed, 11 insertions(+) create mode 100644 docs/memory_profiler.md diff --git a/README.md b/README.md index b8119a3085..434e1bc32e 100644 --- a/README.md +++ b/README.md @@ -42,6 +42,7 @@ You may want to see how the model is defined or how parallelism techniques are a 10. DDP and HSDP 11. All options easily configured via [toml files](train_configs/) 12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning +13. [Memory profier](docs/memory_profiler.md) dump memory snapshots. We report our [Performance](docs/performance.md) verified on 64/128 GPUs. diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md new file mode 100644 index 0000000000..3e36d2c535 --- /dev/null +++ b/docs/memory_profiler.md @@ -0,0 +1,10 @@ +## Enable Memory Profiling + +Launch training job with the following command (or alternatively set configs in toml files) +``` +CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.enable_memory_snapshot --profiling.save_memory_snapshot_folder output_folder +``` +* `--profiling.enable_memory_snapshot`: enable memory snapshot +* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output foloder, default to be `memory_snapshot`. + + If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`. + + If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`. From 78b68db6d88c4cfdc0fc6adb676df3655a1c56be Mon Sep 17 00:00:00 2001 From: mori360 Date: Tue, 8 Oct 2024 10:49:22 -0700 Subject: [PATCH 02/10] add memory_profiler to README --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 434e1bc32e..2d36c794c4 100644 --- a/README.md +++ b/README.md @@ -42,7 +42,7 @@ You may want to see how the model is defined or how parallelism techniques are a 10. DDP and HSDP 11. All options easily configured via [toml files](train_configs/) 12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning -13. [Memory profier](docs/memory_profiler.md) dump memory snapshots. +13. [Memory profier](docs/memory_profiler.md) dump memory snapshots We report our [Performance](docs/performance.md) verified on 64/128 GPUs. From 8879836cf5058459d700fe7c8f1882a37d532777 Mon Sep 17 00:00:00 2001 From: mori360 Date: Tue, 8 Oct 2024 10:52:55 -0700 Subject: [PATCH 03/10] typo --- docs/memory_profiler.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md index 3e36d2c535..a7e5056ac3 100644 --- a/docs/memory_profiler.md +++ b/docs/memory_profiler.md @@ -5,6 +5,6 @@ Launch training job with the following command (or alternatively set configs in CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.enable_memory_snapshot --profiling.save_memory_snapshot_folder output_folder ``` * `--profiling.enable_memory_snapshot`: enable memory snapshot -* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output foloder, default to be `memory_snapshot`. +* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default to be `memory_snapshot`. + If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`. + If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`. From 44c5be14444b4dbb8100d82ae7219a74905af8d9 Mon Sep 17 00:00:00 2001 From: mori360 Date: Tue, 8 Oct 2024 14:15:03 -0700 Subject: [PATCH 04/10] move memory profiler introduction, add readme on visiualization --- README.md | 3 +-- docs/memory_profiler.md | 3 +++ 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 2d36c794c4..034fec0d35 100644 --- a/README.md +++ b/README.md @@ -35,14 +35,13 @@ You may want to see how the model is defined or how parallelism techniques are a 3. Selective layer and operator activation checkpointing 4. Distributed checkpointing (including async checkpointing) 5. Checkpointable data-loading, with the C4 dataset pre-configured (144M entries) -6. Loss, GPU memory, tokens-per-second, and MFU displayed and logged via [TensorBoard](#tensorboard) +6. Loss, GPU memory, tokens-per-second, and MFU displayed and logged via [TensorBoard](#tensorboard), dump [memory snapshots](docs/memory_profiler.md) 7. Learning rate scheduler, meta-init, optional Fused RMSNorm 8. [Float8](https://discuss.pytorch.org/t/distributed-w-torchtitan-enabling-float8-all-gather-in-fsdp2/209323) support ([how-to](docs/float8.md)) 9. `torch.compile` support 10. DDP and HSDP 11. All options easily configured via [toml files](train_configs/) 12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning -13. [Memory profier](docs/memory_profiler.md) dump memory snapshots We report our [Performance](docs/performance.md) verified on 64/128 GPUs. diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md index a7e5056ac3..d7e4275ce2 100644 --- a/docs/memory_profiler.md +++ b/docs/memory_profiler.md @@ -8,3 +8,6 @@ CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling. * `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default to be `memory_snapshot`. + If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`. + If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`. + +Once you have dumped the memory profiler, you will find the saved pickle files in your output folder. +To visualize the snapshot file, you can utilize the `memory_viz` tool, by either dragging and dropping the snapshot into your browser or generating its HTML file, following the [tutorial](https://pytorch.org/blog/understanding-gpu-memory-1/). From 9186158aaa946459d56ef264ec4596a9ed7bbc3c Mon Sep 17 00:00:00 2001 From: mori360 Date: Tue, 8 Oct 2024 14:26:33 -0700 Subject: [PATCH 05/10] combine debuggings together in README --- README.md | 3 ++- docs/memory_profiler.md | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 034fec0d35..00c6842d68 100644 --- a/README.md +++ b/README.md @@ -35,13 +35,14 @@ You may want to see how the model is defined or how parallelism techniques are a 3. Selective layer and operator activation checkpointing 4. Distributed checkpointing (including async checkpointing) 5. Checkpointable data-loading, with the C4 dataset pre-configured (144M entries) -6. Loss, GPU memory, tokens-per-second, and MFU displayed and logged via [TensorBoard](#tensorboard), dump [memory snapshots](docs/memory_profiler.md) +6. Loss, GPU memory, tokens-per-second, and MFU displayed and logged via [TensorBoard](#tensorboard) 7. Learning rate scheduler, meta-init, optional Fused RMSNorm 8. [Float8](https://discuss.pytorch.org/t/distributed-w-torchtitan-enabling-float8-all-gather-in-fsdp2/209323) support ([how-to](docs/float8.md)) 9. `torch.compile` support 10. DDP and HSDP 11. All options easily configured via [toml files](train_configs/) 12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning +13. CPU/GPU profiling, [memory profiling](docs/memory_profiler.md), flight recorder(#debugging) We report our [Performance](docs/performance.md) verified on 64/128 GPUs. diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md index d7e4275ce2..1bf620ea87 100644 --- a/docs/memory_profiler.md +++ b/docs/memory_profiler.md @@ -5,7 +5,7 @@ Launch training job with the following command (or alternatively set configs in CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.enable_memory_snapshot --profiling.save_memory_snapshot_folder output_folder ``` * `--profiling.enable_memory_snapshot`: enable memory snapshot -* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default to be `memory_snapshot`. +* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default to be `./outputs/memory_snapshot`. + If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`. + If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`. From 3a3ac97be71d8422492793cfc5ed91009c8d0e09 Mon Sep 17 00:00:00 2001 From: mori360 Date: Tue, 8 Oct 2024 14:28:14 -0700 Subject: [PATCH 06/10] fix a link bug --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 00c6842d68..94b41cb1d6 100644 --- a/README.md +++ b/README.md @@ -42,7 +42,7 @@ You may want to see how the model is defined or how parallelism techniques are a 10. DDP and HSDP 11. All options easily configured via [toml files](train_configs/) 12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning -13. CPU/GPU profiling, [memory profiling](docs/memory_profiler.md), flight recorder(#debugging) +13. CPU/GPU profiling, [memory profiling](docs/memory_profiler.md), [flight recorder](#debugging) We report our [Performance](docs/performance.md) verified on 64/128 GPUs. From 3403ba9e9e6a61325c99cd803b8aa518aeb5fd7d Mon Sep 17 00:00:00 2001 From: mori360 Date: Tue, 8 Oct 2024 14:30:54 -0700 Subject: [PATCH 07/10] add details in the dump folder --- docs/memory_profiler.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md index 1bf620ea87..b81b083a89 100644 --- a/docs/memory_profiler.md +++ b/docs/memory_profiler.md @@ -5,7 +5,7 @@ Launch training job with the following command (or alternatively set configs in CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.enable_memory_snapshot --profiling.save_memory_snapshot_folder output_folder ``` * `--profiling.enable_memory_snapshot`: enable memory snapshot -* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default to be `./outputs/memory_snapshot`. +* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default under your output folder to be `./outputs/memory_snapshot`. + If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`. + If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`. From 8d1b93af664002530365b3b8030289ba6c5c4b8b Mon Sep 17 00:00:00 2001 From: mori360 Date: Wed, 9 Oct 2024 16:31:51 -0700 Subject: [PATCH 08/10] typo --- docs/memory_profiler.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md index b81b083a89..dc596e886c 100644 --- a/docs/memory_profiler.md +++ b/docs/memory_profiler.md @@ -6,8 +6,8 @@ CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling. ``` * `--profiling.enable_memory_snapshot`: enable memory snapshot * `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default under your output folder to be `./outputs/memory_snapshot`. - + If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`. - + If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`. + + If in case of OOMs, output folder is `memory_snapshot/iteration_x_exit`. + + If regularly according to `profile_freq`, output folder is `memory_snapshot/iteration_x`. Once you have dumped the memory profiler, you will find the saved pickle files in your output folder. To visualize the snapshot file, you can utilize the `memory_viz` tool, by either dragging and dropping the snapshot into your browser or generating its HTML file, following the [tutorial](https://pytorch.org/blog/understanding-gpu-memory-1/). From 6865efd3be0a24580ac215ed4fe0ded746f95df7 Mon Sep 17 00:00:00 2001 From: mori360 Date: Wed, 9 Oct 2024 18:34:09 -0700 Subject: [PATCH 09/10] polish expression --- README.md | 2 +- docs/memory_profiler.md | 12 ++++++------ 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 94b41cb1d6..48ac1b2582 100644 --- a/README.md +++ b/README.md @@ -42,7 +42,7 @@ You may want to see how the model is defined or how parallelism techniques are a 10. DDP and HSDP 11. All options easily configured via [toml files](train_configs/) 12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning -13. CPU/GPU profiling, [memory profiling](docs/memory_profiler.md), [flight recorder](#debugging) +13. Debugging tools including CPU/GPU profiling, [memory profiling](docs/memory_profiler.md), [Flight Recorder](#debugging), etc. We report our [Performance](docs/performance.md) verified on 64/128 GPUs. diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md index dc596e886c..9107d3d1f4 100644 --- a/docs/memory_profiler.md +++ b/docs/memory_profiler.md @@ -2,12 +2,12 @@ Launch training job with the following command (or alternatively set configs in toml files) ``` -CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.enable_memory_snapshot --profiling.save_memory_snapshot_folder output_folder +CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.enable_memory_snapshot --profiling.save_memory_snapshot_folder memory_snapshot ``` -* `--profiling.enable_memory_snapshot`: enable memory snapshot -* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default under your output folder to be `./outputs/memory_snapshot`. - + If in case of OOMs, output folder is `memory_snapshot/iteration_x_exit`. - + If regularly according to `profile_freq`, output folder is `memory_snapshot/iteration_x`. +* `--profiling.enable_memory_snapshot`: to enable memory profiling +* `--profiling.save_memory_snapshot_folder`: configures the folder which memory snapshots are dumped into (`./outputs/memory_snapshot/` by default) + + In case of OOMs, the snapshots will be in `./outputs/memory_snapshot/iteration_x_exit`. + + Regular snapshots (taken every `profiling.profile_freq` iterations) will be in `memory_snapshot/iteration_x`. Once you have dumped the memory profiler, you will find the saved pickle files in your output folder. -To visualize the snapshot file, you can utilize the `memory_viz` tool, by either dragging and dropping the snapshot into your browser or generating its HTML file, following the [tutorial](https://pytorch.org/blog/understanding-gpu-memory-1/). +To visualize a snapshot file, you can drag and drop it to . To learn more details on memory profiling, please visit this [tutorial](https://pytorch.org/blog/understanding-gpu-memory-1/). From b2c5813ed1f4970ef94a362c1e62e0f16de3bab0 Mon Sep 17 00:00:00 2001 From: mori360 Date: Thu, 10 Oct 2024 11:35:39 -0700 Subject: [PATCH 10/10] remove duplicated expression --- docs/memory_profiler.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md index 9107d3d1f4..d73ecaf97f 100644 --- a/docs/memory_profiler.md +++ b/docs/memory_profiler.md @@ -9,5 +9,5 @@ CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling. + In case of OOMs, the snapshots will be in `./outputs/memory_snapshot/iteration_x_exit`. + Regular snapshots (taken every `profiling.profile_freq` iterations) will be in `memory_snapshot/iteration_x`. -Once you have dumped the memory profiler, you will find the saved pickle files in your output folder. +You cab find the saved pickle files in your output folder. To visualize a snapshot file, you can drag and drop it to . To learn more details on memory profiling, please visit this [tutorial](https://pytorch.org/blog/understanding-gpu-memory-1/).