Skip to content

Conversation

@tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Feb 16, 2024

Stack from ghstack (oldest at bottom):

Screenshot 2024-02-15 at 5 19 09 PM

tianyu-l added a commit that referenced this pull request Feb 16, 2024
ghstack-source-id: 4cf9b3a
Pull Request resolved: #60
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 16, 2024
This was linked to issues Feb 16, 2024
train.py Outdated
"global_avg_loss": global_avg_loss,
"global_max_loss": global_max_loss,
"loss/global_avg": global_avg_loss,
"loss/global_max": global_max_loss,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - using the / here is confusing to me...I thought it represented the loss divided by the global avg, and same for max...
maybe consider just an _ or : or even :: as the separator? (loss:global_avg, loss::global_max, memory_current_active).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh good point. I use [tag]/[metric] here because TB collects plots under the same [tag] together in a row, so that they form a visual group. Just like in the picture in PR summary, memory metrics are grouped into memory_current, and memory_peak. I'll explore a way that can achieve this but without ambiguity for losses.

Copy link
Contributor Author

@tianyu-l tianyu-l Feb 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some exploration, e.g. tried to put related metrics into a single plot. The options we have are add_scalars and add_custom_scalars, and it seems neither is ideal (e.g.). I'm changing loss/global_avg to loss_metrics/global_avg for now to make it less ambiguous.

Copy link
Contributor

@lessw2020 lessw2020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, thanks for integrating these stats!
one very minor nit about the / being possibly confused as division when used in labelling.

tianyu-l added a commit that referenced this pull request Feb 17, 2024
ghstack-source-id: da7e02b
Pull Request resolved: #60
@tianyu-l tianyu-l merged commit b77c89f into gh/tianyu-l/1/base Feb 17, 2024
tianyu-l added a commit that referenced this pull request Feb 17, 2024
ghstack-source-id: da7e02b
Pull Request resolved: #60
@tianyu-l tianyu-l deleted the gh/tianyu-l/1/head branch February 17, 2024 01:42
lessw2020 pushed a commit that referenced this pull request Apr 18, 2024
ghstack-source-id: da7e02b
Pull Request resolved: #60
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
ghstack-source-id: da7e02b
Pull Request resolved: pytorch#60
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Tensorboard Add metrics to collect during training

4 participants