From 34ae3dca703d10bc845f9e0f2b610cc30a3db43b Mon Sep 17 00:00:00 2001
From: mori360 <yifanmao@meta.com>
Date: Tue, 8 Oct 2024 10:47:35 -0700
Subject: [PATCH 01/10] add memory_profiler readme

---
 README.md               |  1 +
 docs/memory_profiler.md | 10 ++++++++++
 2 files changed, 11 insertions(+)
 create mode 100644 docs/memory_profiler.md

diff --git a/README.md b/README.md
index b8119a3085..434e1bc32e 100644
--- a/README.md
+++ b/README.md
@@ -42,6 +42,7 @@ You may want to see how the model is defined or how parallelism techniques are a
 10. DDP and HSDP
 11. All options easily configured via [toml files](train_configs/)
 12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning
+13. [Memory profier](docs/memory_profiler.md) dump memory snapshots.
 
 We report our [Performance](docs/performance.md) verified on 64/128 GPUs.
 
diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md
new file mode 100644
index 0000000000..3e36d2c535
--- /dev/null
+++ b/docs/memory_profiler.md
@@ -0,0 +1,10 @@
+## Enable Memory Profiling
+
+Launch training job with the following command (or alternatively set configs in toml files)
+```
+CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.enable_memory_snapshot --profiling.save_memory_snapshot_folder output_folder
+```
+* `--profiling.enable_memory_snapshot`: enable memory snapshot
+* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output foloder, default to be `memory_snapshot`.
+	+ If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`.
+	+ If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`.

From 78b68db6d88c4cfdc0fc6adb676df3655a1c56be Mon Sep 17 00:00:00 2001
From: mori360 <yifanmao@meta.com>
Date: Tue, 8 Oct 2024 10:49:22 -0700
Subject: [PATCH 02/10] add memory_profiler to README

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 434e1bc32e..2d36c794c4 100644
--- a/README.md
+++ b/README.md
@@ -42,7 +42,7 @@ You may want to see how the model is defined or how parallelism techniques are a
 10. DDP and HSDP
 11. All options easily configured via [toml files](train_configs/)
 12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning
-13. [Memory profier](docs/memory_profiler.md) dump memory snapshots.
+13. [Memory profier](docs/memory_profiler.md) dump memory snapshots
 
 We report our [Performance](docs/performance.md) verified on 64/128 GPUs.
 

From 8879836cf5058459d700fe7c8f1882a37d532777 Mon Sep 17 00:00:00 2001
From: mori360 <yifanmao@meta.com>
Date: Tue, 8 Oct 2024 10:52:55 -0700
Subject: [PATCH 03/10] typo

---
 docs/memory_profiler.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md
index 3e36d2c535..a7e5056ac3 100644
--- a/docs/memory_profiler.md
+++ b/docs/memory_profiler.md
@@ -5,6 +5,6 @@ Launch training job with the following command (or alternatively set configs in
 CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.enable_memory_snapshot --profiling.save_memory_snapshot_folder output_folder
 ```
 * `--profiling.enable_memory_snapshot`: enable memory snapshot
-* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output foloder, default to be `memory_snapshot`.
+* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default to be `memory_snapshot`.
 	+ If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`.
 	+ If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`.

From 44c5be14444b4dbb8100d82ae7219a74905af8d9 Mon Sep 17 00:00:00 2001
From: mori360 <yifanmao@meta.com>
Date: Tue, 8 Oct 2024 14:15:03 -0700
Subject: [PATCH 04/10] move memory profiler introduction, add readme on
 visiualization

---
 README.md               | 3 +--
 docs/memory_profiler.md | 3 +++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 2d36c794c4..034fec0d35 100644
--- a/README.md
+++ b/README.md
@@ -35,14 +35,13 @@ You may want to see how the model is defined or how parallelism techniques are a
 3. Selective layer and operator activation checkpointing
 4. Distributed checkpointing (including async checkpointing)
 5. Checkpointable data-loading, with the C4 dataset pre-configured (144M entries)
-6. Loss, GPU memory, tokens-per-second, and MFU displayed and logged via [TensorBoard](#tensorboard)
+6. Loss, GPU memory, tokens-per-second, and MFU displayed and logged via [TensorBoard](#tensorboard), dump [memory snapshots](docs/memory_profiler.md)
 7. Learning rate scheduler, meta-init, optional Fused RMSNorm
 8. [Float8](https://discuss.pytorch.org/t/distributed-w-torchtitan-enabling-float8-all-gather-in-fsdp2/209323) support ([how-to](docs/float8.md))
 9. `torch.compile` support
 10. DDP and HSDP
 11. All options easily configured via [toml files](train_configs/)
 12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning
-13. [Memory profier](docs/memory_profiler.md) dump memory snapshots
 
 We report our [Performance](docs/performance.md) verified on 64/128 GPUs.
 
diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md
index a7e5056ac3..d7e4275ce2 100644
--- a/docs/memory_profiler.md
+++ b/docs/memory_profiler.md
@@ -8,3 +8,6 @@ CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.
 * `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default to be `memory_snapshot`.
 	+ If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`.
 	+ If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`.
+
+Once you have dumped the memory profiler, you will find the saved pickle files in your output folder.
+To visualize the snapshot file, you can utilize the `memory_viz` tool, by either dragging and dropping the snapshot into your browser or generating its HTML file, following the [tutorial](https://pytorch.org/blog/understanding-gpu-memory-1/).

From 9186158aaa946459d56ef264ec4596a9ed7bbc3c Mon Sep 17 00:00:00 2001
From: mori360 <yifanmao@meta.com>
Date: Tue, 8 Oct 2024 14:26:33 -0700
Subject: [PATCH 05/10] combine debuggings together in README

---
 README.md               | 3 ++-
 docs/memory_profiler.md | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 034fec0d35..00c6842d68 100644
--- a/README.md
+++ b/README.md
@@ -35,13 +35,14 @@ You may want to see how the model is defined or how parallelism techniques are a
 3. Selective layer and operator activation checkpointing
 4. Distributed checkpointing (including async checkpointing)
 5. Checkpointable data-loading, with the C4 dataset pre-configured (144M entries)
-6. Loss, GPU memory, tokens-per-second, and MFU displayed and logged via [TensorBoard](#tensorboard), dump [memory snapshots](docs/memory_profiler.md)
+6. Loss, GPU memory, tokens-per-second, and MFU displayed and logged via [TensorBoard](#tensorboard)
 7. Learning rate scheduler, meta-init, optional Fused RMSNorm
 8. [Float8](https://discuss.pytorch.org/t/distributed-w-torchtitan-enabling-float8-all-gather-in-fsdp2/209323) support ([how-to](docs/float8.md))
 9. `torch.compile` support
 10. DDP and HSDP
 11. All options easily configured via [toml files](train_configs/)
 12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning
+13. CPU/GPU profiling, [memory profiling](docs/memory_profiler.md), flight recorder(#debugging)
 
 We report our [Performance](docs/performance.md) verified on 64/128 GPUs.
 
diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md
index d7e4275ce2..1bf620ea87 100644
--- a/docs/memory_profiler.md
+++ b/docs/memory_profiler.md
@@ -5,7 +5,7 @@ Launch training job with the following command (or alternatively set configs in
 CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.enable_memory_snapshot --profiling.save_memory_snapshot_folder output_folder
 ```
 * `--profiling.enable_memory_snapshot`: enable memory snapshot
-* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default to be `memory_snapshot`.
+* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default to be `./outputs/memory_snapshot`.
 	+ If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`.
 	+ If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`.
 

From 3a3ac97be71d8422492793cfc5ed91009c8d0e09 Mon Sep 17 00:00:00 2001
From: mori360 <yifanmao@meta.com>
Date: Tue, 8 Oct 2024 14:28:14 -0700
Subject: [PATCH 06/10] fix a link bug

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 00c6842d68..94b41cb1d6 100644
--- a/README.md
+++ b/README.md
@@ -42,7 +42,7 @@ You may want to see how the model is defined or how parallelism techniques are a
 10. DDP and HSDP
 11. All options easily configured via [toml files](train_configs/)
 12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning
-13. CPU/GPU profiling, [memory profiling](docs/memory_profiler.md), flight recorder(#debugging)
+13. CPU/GPU profiling, [memory profiling](docs/memory_profiler.md), [flight recorder](#debugging)
 
 We report our [Performance](docs/performance.md) verified on 64/128 GPUs.
 

From 3403ba9e9e6a61325c99cd803b8aa518aeb5fd7d Mon Sep 17 00:00:00 2001
From: mori360 <yifanmao@meta.com>
Date: Tue, 8 Oct 2024 14:30:54 -0700
Subject: [PATCH 07/10] add details in the dump folder

---
 docs/memory_profiler.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md
index 1bf620ea87..b81b083a89 100644
--- a/docs/memory_profiler.md
+++ b/docs/memory_profiler.md
@@ -5,7 +5,7 @@ Launch training job with the following command (or alternatively set configs in
 CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.enable_memory_snapshot --profiling.save_memory_snapshot_folder output_folder
 ```
 * `--profiling.enable_memory_snapshot`: enable memory snapshot
-* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default to be `./outputs/memory_snapshot`.
+* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default under your output folder to be `./outputs/memory_snapshot`.
 	+ If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`.
 	+ If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`.
 

From 8d1b93af664002530365b3b8030289ba6c5c4b8b Mon Sep 17 00:00:00 2001
From: mori360 <yifanmao@meta.com>
Date: Wed, 9 Oct 2024 16:31:51 -0700
Subject: [PATCH 08/10] typo

---
 docs/memory_profiler.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md
index b81b083a89..dc596e886c 100644
--- a/docs/memory_profiler.md
+++ b/docs/memory_profiler.md
@@ -6,8 +6,8 @@ CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.
 ```
 * `--profiling.enable_memory_snapshot`: enable memory snapshot
 * `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default under your output folder to be `./outputs/memory_snapshot`.
-	+ If in case of OOMs. output folder is `memory_snapshot/iteration_x_exit`.
-	+ If regularly according to `profile_freq`. output folder is `memory_snapshot/iteration_x`.
+	+ If in case of OOMs, output folder is `memory_snapshot/iteration_x_exit`.
+	+ If regularly according to `profile_freq`, output folder is `memory_snapshot/iteration_x`.
 
 Once you have dumped the memory profiler, you will find the saved pickle files in your output folder.
 To visualize the snapshot file, you can utilize the `memory_viz` tool, by either dragging and dropping the snapshot into your browser or generating its HTML file, following the [tutorial](https://pytorch.org/blog/understanding-gpu-memory-1/).

From 6865efd3be0a24580ac215ed4fe0ded746f95df7 Mon Sep 17 00:00:00 2001
From: mori360 <yifanmao@meta.com>
Date: Wed, 9 Oct 2024 18:34:09 -0700
Subject: [PATCH 09/10] polish expression

---
 README.md               |  2 +-
 docs/memory_profiler.md | 12 ++++++------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/README.md b/README.md
index 94b41cb1d6..48ac1b2582 100644
--- a/README.md
+++ b/README.md
@@ -42,7 +42,7 @@ You may want to see how the model is defined or how parallelism techniques are a
 10. DDP and HSDP
 11. All options easily configured via [toml files](train_configs/)
 12. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning
-13. CPU/GPU profiling, [memory profiling](docs/memory_profiler.md), [flight recorder](#debugging)
+13. Debugging tools including CPU/GPU profiling, [memory profiling](docs/memory_profiler.md), [Flight Recorder](#debugging), etc.
 
 We report our [Performance](docs/performance.md) verified on 64/128 GPUs.
 
diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md
index dc596e886c..9107d3d1f4 100644
--- a/docs/memory_profiler.md
+++ b/docs/memory_profiler.md
@@ -2,12 +2,12 @@
 
 Launch training job with the following command (or alternatively set configs in toml files)
 ```
-CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.enable_memory_snapshot --profiling.save_memory_snapshot_folder output_folder
+CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.enable_memory_snapshot --profiling.save_memory_snapshot_folder memory_snapshot
 ```
-* `--profiling.enable_memory_snapshot`: enable memory snapshot
-* `--profiling.save_memory_snapshot_folder`: dump memory snapshots in to output folder, default under your output folder to be `./outputs/memory_snapshot`.
-	+ If in case of OOMs, output folder is `memory_snapshot/iteration_x_exit`.
-	+ If regularly according to `profile_freq`, output folder is `memory_snapshot/iteration_x`.
+* `--profiling.enable_memory_snapshot`: to enable memory profiling
+* `--profiling.save_memory_snapshot_folder`: configures the folder which memory snapshots are dumped into (`./outputs/memory_snapshot/` by default)
+	+ In case of OOMs, the snapshots will be in `./outputs/memory_snapshot/iteration_x_exit`.
+	+ Regular snapshots (taken every `profiling.profile_freq` iterations) will be in `memory_snapshot/iteration_x`.
 
 Once you have dumped the memory profiler, you will find the saved pickle files in your output folder.
-To visualize the snapshot file, you can utilize the `memory_viz` tool, by either dragging and dropping the snapshot into your browser or generating its HTML file, following the [tutorial](https://pytorch.org/blog/understanding-gpu-memory-1/).
+To visualize a snapshot file, you can drag and drop it to <https://pytorch.org/memory_viz>. To learn more details on memory profiling, please visit this [tutorial](https://pytorch.org/blog/understanding-gpu-memory-1/).

From b2c5813ed1f4970ef94a362c1e62e0f16de3bab0 Mon Sep 17 00:00:00 2001
From: mori360 <yifanmao@meta.com>
Date: Thu, 10 Oct 2024 11:35:39 -0700
Subject: [PATCH 10/10] remove duplicated expression

---
 docs/memory_profiler.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/memory_profiler.md b/docs/memory_profiler.md
index 9107d3d1f4..d73ecaf97f 100644
--- a/docs/memory_profiler.md
+++ b/docs/memory_profiler.md
@@ -9,5 +9,5 @@ CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --profiling.
 	+ In case of OOMs, the snapshots will be in `./outputs/memory_snapshot/iteration_x_exit`.
 	+ Regular snapshots (taken every `profiling.profile_freq` iterations) will be in `memory_snapshot/iteration_x`.
 
-Once you have dumped the memory profiler, you will find the saved pickle files in your output folder.
+You cab find the saved pickle files in your output folder.
 To visualize a snapshot file, you can drag and drop it to <https://pytorch.org/memory_viz>. To learn more details on memory profiling, please visit this [tutorial](https://pytorch.org/blog/understanding-gpu-memory-1/).