Skip to content

Conversation

@torronen
Copy link
Contributor

@torronen torronen commented Jan 30, 2022

Suggested changes for LightGBM results through ML.NET similar as through Python:

  • keep LightGBM default seed if seed has not been set
  • add mapping from NumberOfIterations to num_iterations
  • add NumberOfIterations to parameters array for LightGBM
  • change sigmoid default value to match LightGBM
  • Default Evaluation Metric to None per LightGBM default

Project that can be used for comparison between LightGBM in Python and Microsoft.ML.LightGBM and also compare ModelBuilder with python-FLAML: https://github.com/torronen/lightgbm-comparison

Rationale: microsoft/FLAML#409 (comment)
Reasons for changes explained in the issues:

I suggest the results should be equal through Python and ML.NET so that developers can discuss and share best practices about hyperparameters. Also, it enables to use tuning from Python.

Sigmoid value change has been propose before but was not implemented.
It may need more consideration: #667

PR is for comments and discussions for now. Results are not yet equal through ML.Net and Python.

 keep lightgbm default seed if it has not been specified in Seed
LightGBM: map NumberOfIterations to num_trees
LightGBM: Sigmod to default of LightGBM (0.5 => 1)
@dnfadmin
Copy link

dnfadmin commented Jan 30, 2022

CLA assistant check
All CLA requirements met.

@torronen
Copy link
Contributor Author

torronen commented Feb 1, 2022

These names are valid aliases. Defaults are not yet considered, but at least metric should be "" per default, not logloss, but it might not matter too much.

Missing:

lgbm1
lgbm2

@torronen
Copy link
Contributor Author

torronen commented Feb 1, 2022

Is there any reason to use aliases? If not, I suggest we update

  • main_split_gain=main_split_to_gain
  • min_sum_hessian_in_leaf=min_child_weight
  • bagging_freq = subsample_freq
  • bagging_fraction = subsample
  • lambda_l2 =reg_lambda
  • lambda_l1 =reg_alpha
  • boosting =boosting_type
  • verbosity =verbose
  • unbalanced_sets = is_unbalance
  • min_data_in_leaf = min_data_per_leaf

TODO: Check names are valid for 2.3.1, above is from current documentation.
2.3.1 source: https://github.com/microsoft/LightGBM/blob/v2.3.1/src/io/config_auto.cpp

@torronen
Copy link
Contributor Author

torronen commented Feb 1, 2022

Iterations actually seem to be ran inside .NET, so it does not need to be passed anywhere.

Dictionary<string, object> parameters, Dataset dtrain, Dataset dvalid = null, int numIteration = 100,

I will close this PR as it is better not to update the defaults if it does not provide any better performance. Changing default may be a breaking change for some developers.

However, for some reason Python seems to provide better speed and accuracy for LightGBM (and therefore, for many applications) at least on a few datasets I've tried.

@torronen torronen closed this Feb 1, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

2 participants