Skip to content

Conversation

@wanchaol
Copy link
Collaborator

as titled, we don't want to allow steps == -1 case as it would blow up the lr scheduler

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 28, 2024
batch_size = 8
seq_len = 2048
warmup_pct = 0.20 # lr scheduler warm up
warmup_steps = 5 # lr scheduler warm up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - don't we want 2 here if the default steps is 10 (i.e. 20% rather than make it 50%).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a comment here suggesting the pct should be around 20%? A user might have no idea how many steps to warm up when changing total number of training steps.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good I'll update. Is 20% the rule of thumb here already? I thought this is more like a tuning thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's definitely a tuning thing - I did 20% to try and be safe bc we are doing such short runs by default (10 steps!) while still showing some reasonable learning gains.
A more rigourous approach would be 2/(1−β2) training iterations for the warmup with the current linear scheduling, which is 2000 iterations assuming default AdamW. (see https://arxiv.org/abs/1910.04209).

The issue here is we want to show a reasonable curve in the short term and doing that in 10 steps or even 100 is very different than the more rigorous algo above (which can be simplified to 2k warmup iterations) or even plugged in to the lr if we wanted to get fancy and be more exact.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the CI errors:

  1. apply pre-commit
  2. modify default training_steps in the test file
  3. the GPU test failed because HF is down and dataset cannot be downloaded ... I guess we don't expect this to happen a lot? But for test it may make sense to use a fake dataset.

batch_size = 8
seq_len = 2048
warmup_pct = 0.20 # lr scheduler warm up
warmup_steps = 5 # lr scheduler warm up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a comment here suggesting the pct should be around 20%? A user might have no idea how many steps to warm up when changing total number of training steps.

as titled, we don't want to allow steps == -1 case as it would blow up
the lr scheduler
@wanchaol wanchaol merged commit 6d5a71e into main Feb 29, 2024
lessw2020 pushed a commit that referenced this pull request Apr 18, 2024
as titled, we don't want to allow steps == -1 case as it would blow up
the lr scheduler
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
as titled, we don't want to allow steps == -1 case as it would blow up
the lr scheduler
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants