-
Notifications
You must be signed in to change notification settings - Fork 705
[design doc] XGBoost on SQLFlow #753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[design doc] XGBoost on SQLFlow #753
Conversation
3f2b143 to
999c673
Compare
|
Need to merge #754 first to keep the commit history of ant-xgboost design. |
doc/xgboost_on_sqlflow_design.md
Outdated
|
|
||
| ``` sql | ||
| SELECT * FROM train_table | ||
| TRAIN XGBoost |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TRAIN XGBoost.someModel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XGBoost use the objective paramter to specify the training objective such as:
objective=binary:logistic , ref https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters
Maybe we can use Train XGBoost in the train statement, and specify the objective in WITH statement: model.objective=binary:logistic ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still prefer to put the objective in the TRAIN clause, it seems quite similar to tf.estimator.*.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the doc, would use TRAIN xgboost.multi.softmax to fill the objective parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still prefer to put the
objectivein theTRAINclause, it seems quite similar totf.estimator.*.
@Yancey1989 @typhoonzero objective corresponds to the loss of function of a model. So it shouldn't be in the model name. For different types of models, there are gbtree, gblinear and dart as listed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tonyyang-svail Thanks, you are right. I'll update to putobjective in attributes.
doc/xgboost_on_sqlflow_design.md
Outdated
| 0.77 4.0 2.6 2 3 | ||
| ``` | ||
| `codegen_xgboost.go` would write down the `train.txt.group` file like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need to write a file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XGboost use DMatrix as the input dataset , and it seems the text file format is popular in XGBoost:
XGBoost currently supports two text formats for ingesting data: LibSVM and CSV
ref: https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html
…flow into xgboost_design
typhoonzero
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one line needs to be fixed
doc/xgboost_on_sqlflow_design.md
Outdated
| SELECT * FROM train_table | ||
| TRAIN xgboost.multi.softmax | ||
| WITH | ||
| train.objective="multi:softmax", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should remove train.objective now?
|
Hi @wangkuiyi , thanks for correcting the grammar, I updated the design and remove some detail paragraphs:
|
|
|
||
| The code generator `codegen_xgboost.go` outputs an XGBoost program in Python. It contains the following features: | ||
| 1. It tells the SQL engine to run the SELECT statement and retrieve the training/test data. It saves the data into a text file, which could be loaded by XGBoost using the DMatrix interface. | ||
| 1. Parse and resolve the WITH clause to fill the `xgboost.train` arguments and the XGBoost Parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the parsing of the WITH clause is the parser's work, but not the submitter's work, am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser can parse the WITH clause to a general attrs struct which is a Go struct map[string]*expr, and each generator would resolve theattrs to program parameters, such as XGBoost generator would convert the attrs as follows:
- keys with
train.prefix toxgboost.trainarguments. - keys without any prefix to XGBoost Parameters which is JSON format.
A part work of #749