Skip to content

Conversation

@Yancey0623
Copy link
Collaborator

@Yancey0623 Yancey0623 commented Aug 31, 2019

A part work of #749

@Yancey0623
Copy link
Collaborator Author

Yancey0623 commented Aug 31, 2019

Need to merge #754 first to keep the commit history of ant-xgboost design.


``` sql
SELECT * FROM train_table
TRAIN XGBoost
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TRAIN XGBoost.someModel?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XGBoost use the objective paramter to specify the training objective such as:
objective=binary:logistic , ref https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters

Maybe we can use Train XGBoost in the train statement, and specify the objective in WITH statement: model.objective=binary:logistic ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still prefer to put the objective in the TRAIN clause, it seems quite similar to tf.estimator.*.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the doc, would use TRAIN xgboost.multi.softmax to fill the objective parameter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still prefer to put the objective in the TRAIN clause, it seems quite similar to tf.estimator.*.

@Yancey1989 @typhoonzero objective corresponds to the loss of function of a model. So it shouldn't be in the model name. For different types of models, there are gbtree, gblinear and dart as listed here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tonyyang-svail Thanks, you are right. I'll update to putobjective in attributes.

0.77 4.0 2.6 2 3
```
`codegen_xgboost.go` would write down the `train.txt.group` file like:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need to write a file?

Copy link
Collaborator Author

@Yancey0623 Yancey0623 Sep 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XGboost use DMatrix as the input dataset , and it seems the text file format is popular in XGBoost:

XGBoost currently supports two text formats for ingesting data: LibSVM and CSV

ref: https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html

@Yancey0623 Yancey0623 mentioned this pull request Sep 2, 2019
6 tasks
@Yancey0623 Yancey0623 changed the title xgboost design [design doc] XGBoost on SQLFlow Sep 2, 2019
typhoonzero
typhoonzero previously approved these changes Sep 3, 2019
Copy link
Collaborator

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one line needs to be fixed

SELECT * FROM train_table
TRAIN xgboost.multi.softmax
WITH
train.objective="multi:softmax",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should remove train.objective now?

@Yancey0623
Copy link
Collaborator Author

Hi @wangkuiyi , thanks for correcting the grammar, I updated the design and remove some detail paragraphs:

  1. remove the Input Data Format paragraph as we would support the group/weight instances in the future.
  2. remove the Learning API or Scikit-Learn API paragraph as it's too detailed.


The code generator `codegen_xgboost.go` outputs an XGBoost program in Python. It contains the following features:
1. It tells the SQL engine to run the SELECT statement and retrieve the training/test data. It saves the data into a text file, which could be loaded by XGBoost using the DMatrix interface.
1. Parse and resolve the WITH clause to fill the `xgboost.train` arguments and the XGBoost Parameters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the parsing of the WITH clause is the parser's work, but not the submitter's work, am I right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser can parse the WITH clause to a general attrs struct which is a Go struct map[string]*expr, and each generator would resolve theattrs to program parameters, such as XGBoost generator would convert the attrs as follows:

  • keys with train. prefix to xgboost.train arguments.
  • keys without any prefix to XGBoost Parameters which is JSON format.

@Yancey0623 Yancey0623 merged commit 3500246 into sql-machine-learning:develop Sep 5, 2019
@Yancey0623 Yancey0623 deleted the xgboost_design branch September 5, 2019 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants