File tree Expand file tree Collapse file tree 1 file changed +5
-5
lines changed Expand file tree Collapse file tree 1 file changed +5
-5
lines changed Original file line number Diff line number Diff line change @@ -32,21 +32,21 @@ run the llama debug model locally to verify the setup is correct:
3232
3333# TensorBoard
3434
35- To visualize training metrics on TensorBoard :
35+ To visualize TensorBoard metrics of models trained on a remote server via a local web browser :
3636
37- 1 . (by default) set ` enable_tensorboard = true ` in ` torchtrain/train_configs/train_config .toml`
37+ 1 . Make sure ` metrics. enable_tensorboard` option is set to true in model training (either from a .toml file or from CLI).
3838
39- 2 . set up SSH tunneling
39+ 2 . Set up SSH tunneling, by running the following from local CLI
4040```
4141ssh -L 6006:127.0.0.1:6006 [username]@[hostname]
4242```
4343
44- 3 . then in the torchtrain repo
44+ 3 . Inside the SSH tunnel that logged into the remote server, go to the torchtrain repo, and start the TensorBoard backend
4545```
4646tensorboard --logdir=./torchtrain/outputs/tb
4747```
4848
49- 4 . go to the URL it provides OR to http://localhost:6006/
49+ 4 . In the local web browser, go to the URL it provides OR to http://localhost:6006/ .
5050
5151## Multi-Node Training
5252For training on ParallelCluster/Slurm type configurations, you can use the multinode_trainer.slurm file to submit your sbatch job.</br >
You can’t perform that action at this time.
0 commit comments