From d8fe3480f5963711a6b801e479b8cc61d54e84a3 Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Sat, 31 Aug 2019 17:31:31 +0800 Subject: [PATCH 01/20] rename antxgboost design --- ...boost_on_sqlflow_design.md => antxgboost_on_sqlflow_design.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename doc/{xgboost_on_sqlflow_design.md => antxgboost_on_sqlflow_design.md} (100%) diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/antxgboost_on_sqlflow_design.md similarity index 100% rename from doc/xgboost_on_sqlflow_design.md rename to doc/antxgboost_on_sqlflow_design.md From 999c673340666086187a021d82d5351118988932 Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Sat, 31 Aug 2019 17:31:49 +0800 Subject: [PATCH 02/20] add xgboost design --- doc/xgboost_on_sqlflow_design.md | 109 +++++++++++++++++++++++++++++++ 1 file changed, 109 insertions(+) create mode 100644 doc/xgboost_on_sqlflow_design.md diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md new file mode 100644 index 0000000000..36b844a745 --- /dev/null +++ b/doc/xgboost_on_sqlflow_design.md @@ -0,0 +1,109 @@ +# Design Doc: XGBoost on SQLFlow + +## Introduction + +This design doc introduces how do users train/predict the [XGBoost](https://xgboost.ai/) model by SQLFlow SQL and how +we implement it. + +## Design + +We prefer users to execute the SQLFlow Train/Predict SQL as follows: + + ``` sql + SELECT * FROM train_table + TRAIN XGBoost + WITH + train.objective="multi:softmax", + train.num_round=2, + model.max_depth=2, + model.eta=1 + INTO my_xgb_model; + ``` + + ``` sql + SELECT * FROM test_table + PREDICT pred_table.result + USING my_xgb_model; + ``` + +where: +- `my_xgb_model` is the trained model. +- The keyword `XGBOOST` is used to distinguish with the Tensorflow Model. +- The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train) except the `params` arguments. +- The prefix `model.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html); + +`codegen_xgboost.go` would generate a XGBoost Python program accoding to the XGBoost SQL including: +- Prepare the input data. +- pass the arguments to XGBoost Python program. +- Save the trained model. + +### Input Format + +XGBoost using `DMatrix` as the input structure, according to [Text Input Format of DMatrix](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html), we prefer to implement refuse `db.generator` and +generate text files as LibSVM format. + +- For the basic input format, `db_geneator` would yield `(features, label)` for each iteration + + the train table can be like: + + ``` text + col0 | col1 | col2 | label + 1.1 NULL 2.2 1 + 0.8 2.0 2.2 2 + 0.2 3.0 NULL 0 + 0.77 4.0 2.6 2 + ``` + + `codegen_xgboost.go` would write down the `train.txt` file like: + + ``` text + 1 0:1.1 2:2.2 + 2 0:0.8 1:2.0 3:2.2 + 0 0:0.2 1:3.0 + 2 0:0.77 1:4.0 2:2.6 + ``` + +- For the group information, users can easy to specify the group column by `train.group_column` in the WITH statement +, just like: + + ``` sql + SELECT * FROM train_table + TRAIN XGBOOST + WITH + train.group_column=group + ... + ``` + + The group column in table can be like: + + ``` text + col1 | col2| col3 | label | group + 1.1 2.0 2.2 1 1 + 0.8 2.0 2.2 2 1 + 0.2 3.0 4.2 0 2 + 0.77 4.0 2.6 2 3 + ``` + + `codegen_xgboost.go` would write down the `train.txt.group` file like: + + ``` text + 2 + 1 + 1 + ``` + +- For the `Weight` information, users can specify the weight column like `group`: + + ``` sql + SELECT * FROM train_table + TRAIN XGBOOST + WITH + train.weight_column=weight + ``` + + `codegen_xgboost.go` would also write the `train.txt.weight` file on the disk. + +## TBD + +- Implement auto-train feature to search the parameter. +- Support the sparse data format. From eb17739e2b639ab4c3db8323885fc7ded806e062 Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Sat, 31 Aug 2019 17:33:36 +0800 Subject: [PATCH 03/20] update --- doc/xgboost_on_sqlflow_design.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md index 36b844a745..1d2132d0b3 100644 --- a/doc/xgboost_on_sqlflow_design.md +++ b/doc/xgboost_on_sqlflow_design.md @@ -29,7 +29,7 @@ We prefer users to execute the SQLFlow Train/Predict SQL as follows: where: - `my_xgb_model` is the trained model. - The keyword `XGBOOST` is used to distinguish with the Tensorflow Model. -- The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train) except the `params` arguments. +- The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train). - The prefix `model.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html); `codegen_xgboost.go` would generate a XGBoost Python program accoding to the XGBoost SQL including: From c30485ac65cc9da19e626258f4a8cefeef3ae6fe Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Sat, 31 Aug 2019 17:34:34 +0800 Subject: [PATCH 04/20] update --- doc/xgboost_on_sqlflow_design.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md index 1d2132d0b3..1f590937bd 100644 --- a/doc/xgboost_on_sqlflow_design.md +++ b/doc/xgboost_on_sqlflow_design.md @@ -42,7 +42,7 @@ where: XGBoost using `DMatrix` as the input structure, according to [Text Input Format of DMatrix](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html), we prefer to implement refuse `db.generator` and generate text files as LibSVM format. -- For the basic input format, `db_geneator` would yield `(features, label)` for each iteration +- For the **basic** input format, `db_geneator` would yield `(features, label)` for each iteration the train table can be like: @@ -63,7 +63,7 @@ generate text files as LibSVM format. 2 0:0.77 1:4.0 2:2.6 ``` -- For the group information, users can easy to specify the group column by `train.group_column` in the WITH statement +- For the **group** input format, users can easy to specify the group column by `train.group_column` in the WITH statement , just like: ``` sql @@ -92,7 +92,7 @@ generate text files as LibSVM format. 1 ``` -- For the `Weight` information, users can specify the weight column like `group`: +- For the **weight** input format, users can specify the weight column like `group`: ``` sql SELECT * FROM train_table From ba274b1d8a682e80eb232dcf6ac333920f66c679 Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Sun, 1 Sep 2019 22:50:27 +0800 Subject: [PATCH 05/20] update doc --- doc/xgboost_on_sqlflow_design.md | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md index 1f590937bd..e32595cb29 100644 --- a/doc/xgboost_on_sqlflow_design.md +++ b/doc/xgboost_on_sqlflow_design.md @@ -28,21 +28,25 @@ We prefer users to execute the SQLFlow Train/Predict SQL as follows: where: - `my_xgb_model` is the trained model. -- The keyword `XGBOOST` is used to distinguish with the Tensorflow Model. +- `XGBoost` is used to distinguish with the Tensorflow Model. - The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train). - The prefix `model.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html); -`codegen_xgboost.go` would generate a XGBoost Python program accoding to the XGBoost SQL including: -- Prepare the input data. -- pass the arguments to XGBoost Python program. +`codegen_xgboost.go` would generate a XGBoost Python program including: +- Generate the XGBoost input database. +- Pass the train/predict parameters to XGBoost Python program. - Save the trained model. ### Input Format -XGBoost using `DMatrix` as the input structure, according to [Text Input Format of DMatrix](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html), we prefer to implement refuse `db.generator` and -generate text files as LibSVM format. +SQLFlow implemented `db_generator` taht takes the `SELECT STATEMENT` as the input and outputs a iterator function which +yields `(features, label)` for each iteration. `codegen_xgboost` would reuse the `db_generator` to generate the XGBoost +input database. -- For the **basic** input format, `db_geneator` would yield `(features, label)` for each iteration +XGBoost using `DMatrix` as the input structure, according to [Text Input Format of DMatrix](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html), we prefer to implement `XGBoostDatabase` that +takes `db_generator` as the input and outputs text files with LibSVM format. + +- For the **basic** input format the train table can be like: From 320c3be1687de7dd91a129495b4515bcc39d420a Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Sun, 1 Sep 2019 23:25:34 +0800 Subject: [PATCH 06/20] polish doc --- doc/xgboost_on_sqlflow_design.md | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md index e32595cb29..797cd9cf6f 100644 --- a/doc/xgboost_on_sqlflow_design.md +++ b/doc/xgboost_on_sqlflow_design.md @@ -2,7 +2,7 @@ ## Introduction -This design doc introduces how do users train/predict the [XGBoost](https://xgboost.ai/) model by SQLFlow SQL and how +This design doc introduces how users can train/predict the [XGBoost](https://xgboost.ai/) model by SQLFlow SQL and how we implement it. ## Design @@ -17,6 +17,7 @@ We prefer users to execute the SQLFlow Train/Predict SQL as follows: train.num_round=2, model.max_depth=2, model.eta=1 + LABEL class INTO my_xgb_model; ``` @@ -28,20 +29,21 @@ We prefer users to execute the SQLFlow Train/Predict SQL as follows: where: - `my_xgb_model` is the trained model. -- `XGBoost` is used to distinguish with the Tensorflow Model. +- `XGBoost` means to train an XGBoost model instead of the TensorFlow Model. - The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train). - The prefix `model.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html); -`codegen_xgboost.go` would generate a XGBoost Python program including: +`codegen_xgboost.go` would generate an XGBoost Python program including: - Generate the XGBoost input database. - Pass the train/predict parameters to XGBoost Python program. - Save the trained model. ### Input Format -SQLFlow implemented `db_generator` taht takes the `SELECT STATEMENT` as the input and outputs a iterator function which -yields `(features, label)` for each iteration. `codegen_xgboost` would reuse the `db_generator` to generate the XGBoost -input database. +SQLFlow implements [db_generator](/sql/python/sqlflow_submitter/db.py#db_generator) that takes the +`SELECT STATEMENT` as the input and outputs a iterable function which +yields `(features, label)` for each iteration call. `codegen_xgboost` would reuse the `db_generator` +to generate the XGBoost input database. XGBoost using `DMatrix` as the input structure, according to [Text Input Format of DMatrix](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html), we prefer to implement `XGBoostDatabase` that takes `db_generator` as the input and outputs text files with LibSVM format. @@ -67,12 +69,12 @@ takes `db_generator` as the input and outputs text files with LibSVM format. 2 0:0.77 1:4.0 2:2.6 ``` -- For the **group** input format, users can easy to specify the group column by `train.group_column` in the WITH statement -, just like: +- For the **group** input format, users can easy to specify the group column by `train.group_column` in the WITH statement like: ``` sql SELECT * FROM train_table - TRAIN XGBOOST + TRAIN XGBoost + LABEL class WITH train.group_column=group ... @@ -100,7 +102,8 @@ takes `db_generator` as the input and outputs text files with LibSVM format. ``` sql SELECT * FROM train_table - TRAIN XGBOOST + TRAIN XGBoost + LABEL class WITH train.weight_column=weight ``` From e359497c94725cda8cd82c47b3103fa061adbdda Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Tue, 3 Sep 2019 13:49:47 +0800 Subject: [PATCH 07/20] initialize xgboost codegen --- sql/codegen_ant_xgboost.go | 2 +- sql/codegen_xgboost.go | 116 +++++++++++++++++++++++++++++++++ sql/codegen_xgboost_test.go | 25 +++++++ sql/executor.go | 15 +++-- sql/executor_test.go | 10 +++ sql/expression_resolver_xgb.go | 64 ++++++++++++++++++ sql/template_xgboost.go | 82 +++++++++++++++++++++++ 7 files changed, 309 insertions(+), 5 deletions(-) create mode 100644 sql/codegen_xgboost.go create mode 100644 sql/codegen_xgboost_test.go create mode 100644 sql/expression_resolver_xgb.go create mode 100644 sql/template_xgboost.go diff --git a/sql/codegen_ant_xgboost.go b/sql/codegen_ant_xgboost.go index 52c57a516e..cb10777ab6 100644 --- a/sql/codegen_ant_xgboost.go +++ b/sql/codegen_ant_xgboost.go @@ -790,7 +790,7 @@ func xgCreatePredictionTable(pr *extendedSelect, r *antXGBoostFiller, db *DB) er return nil } -func genXG(w io.Writer, pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db *DB) error { +func genAntXGboost(w io.Writer, pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db *DB) error { r, e := newAntXGBoostFiller(pr, ds, db) if e != nil { return e diff --git a/sql/codegen_xgboost.go b/sql/codegen_xgboost.go new file mode 100644 index 0000000000..879a7ea68c --- /dev/null +++ b/sql/codegen_xgboost.go @@ -0,0 +1,116 @@ +// Copyright 2019 The SQLFlow Authors. All rights reserved. +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package sql + +import ( + "fmt" + "io" + "text/template" +) + +type xgbTrainConfig struct { + NumBoostRound int `json:"num_boost_round,omitempty"` + Maximize bool `json:"maximize,omitempty"` +} + +type xgbFiller struct { + IsTrain bool + TrainingDatasetSQL string + ValidationDatasetSQL string + TrainCfg *xgbTrainConfig + Features []*featureMeta + Label *featureMeta + ParamsCfgJSON string + TrainCfgJSON string + *connectionConfig +} + +func fillXGBTrainCfg(rt *resolvedXGBTrainClause) (*xgbTrainConfig, error) { + // TODO(Yancey1989): fill all the training control parameters + c := &xgbTrainConfig{ + NumBoostRound: rt.NumBoostRound, + Maximize: rt.Maximize, + } + return c, nil +} + +func newXGBFiller(pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db *DB) (*xgbFiller, error) { + rt, err := resolveXGBTrainClause(&pr.trainClause) + training, validation := trainingAndValidationDataset(pr, ds) + if err != nil { + return nil, err + } + + trainCfg, err := fillXGBTrainCfg(rt) + if err != nil { + return nil, err + } + + r := &xgbFiller{ + IsTrain: pr.train, + TrainCfg: trainCfg, + TrainingDatasetSQL: training, + ValidationDatasetSQL: validation, + } + // TODO(Yancey1989): fill the train_args and parameters by WITH statment + r.TrainCfgJSON = "" + r.ParamsCfgJSON = "" + + if r.connectionConfig, err = newConnectionConfig(db); err != nil { + return nil, err + } + + for _, columns := range pr.columns { + feaCols, colSpecs, err := resolveTrainColumns(&columns) + if err != nil { + return nil, err + } + if len(colSpecs) != 0 { + return nil, fmt.Errorf("newFiller doesn't support DENSE/SPARSE") + } + for _, col := range feaCols { + fm := &featureMeta{ + FeatureName: col.GetKey(), + Dtype: col.GetDtype(), + Delimiter: col.GetDelimiter(), + InputShape: col.GetInputShape(), + IsSparse: false, + } + r.Features = append(r.Features, fm) + } + } + r.Label = &featureMeta{ + FeatureName: pr.label, + Dtype: "int32", + Delimiter: ",", + InputShape: "[1]", + IsSparse: false, + } + + return r, nil +} + +func genXGBoost(w io.Writer, pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db *DB) error { + r, e := newXGBFiller(pr, ds, fts, db) + if e != nil { + return e + } + if pr.train { + fmt.Println(r.TrainCfgJSON) + return xgbTrainTemplate.Execute(w, r) + } + return fmt.Errorf("xgboost prediction codegen has not been implemented") +} + +var xgbTrainTemplate = template.Must(template.New("codegenXGBTrain").Parse(xgbTrainTemplateText)) diff --git a/sql/codegen_xgboost_test.go b/sql/codegen_xgboost_test.go new file mode 100644 index 0000000000..b1aa40351c --- /dev/null +++ b/sql/codegen_xgboost_test.go @@ -0,0 +1,25 @@ +// Copyright 2019 The SQLFlow Authors. All rights reserved. +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package sql + +const testXGBoostTrainSelectIris = ` +SELECT * +FROM iris.train +TRAIN xgb.multi.softprob +WITH + train.num_boost_round = 30 +COLUMN sepal_length, sepal_width, petal_length, petal_width +LABEL class +INTO sqlflow_models.my_xgboost_model; +` diff --git a/sql/executor.go b/sql/executor.go index 17187c56dd..ea9435eeb8 100644 --- a/sql/executor.go +++ b/sql/executor.go @@ -387,8 +387,15 @@ func train(wr *PipeWriter, tr *extendedSelect, db *DB, cwd string, modelDir stri var program bytes.Buffer if strings.HasPrefix(strings.ToUpper(tr.estimator), `XGBOOST.`) { // TODO(sperlingxx): write a separate train pipeline for ant-xgboost to support remote mode - if e := genXG(&program, tr, ds, fts, db); e != nil { - return fmt.Errorf("genXG %v", e) + if e := genAntXGboost(&program, tr, ds, fts, db); e != nil { + return fmt.Errorf("genAntXGBoost %v", e) + } + } else if strings.HasPrefix(strings.ToUpper(tr.estimator), `XGB.`) { + // FIXME(Yancey1989): it's a temporary solution, just for the unit test, we perfer to distinguish + // xgboost and ant-xgboost with env SQLFLOW_WITH_ANTXGBOOST, + // issue: https://github.com/sql-machine-learning/sqlflow/issues/758 + if e := genXGBoost(&program, tr, ds, fts, db); e != nil { + return fmt.Errorf("GenXGBoost %v", e) } } else { if e := genTF(&program, tr, ds, fts, db); e != nil { @@ -453,8 +460,8 @@ func pred(wr *PipeWriter, pr *extendedSelect, db *DB, cwd string, modelDir strin var buf bytes.Buffer if strings.HasPrefix(strings.ToUpper(pr.estimator), `XGBOOST.`) { // TODO(sperlingxx): write a separate pred pipeline for ant-xgboost to support remote mode - if e := genXG(&buf, pr, nil, fts, db); e != nil { - return fmt.Errorf("genXG %v", e) + if e := genAntXGboost(&buf, pr, nil, fts, db); e != nil { + return fmt.Errorf("genAntXGBoost %v", e) } } else { if e := genTF(&buf, pr, nil, fts, db); e != nil { diff --git a/sql/executor_test.go b/sql/executor_test.go index bf65b9d548..a360be58e8 100644 --- a/sql/executor_test.go +++ b/sql/executor_test.go @@ -88,6 +88,16 @@ func TestExecutorTrainAnalyzePredictAntXGBoost(t *testing.T) { }) } +func TestExecutorTrainXGBoost(t *testing.T) { + a := assert.New(t) + modelDir := "" + a.NotPanics(func() { + stream := runExtendedSQL(testXGBoostTrainSelectIris, testDB, modelDir, nil) + a.True(goodStream(stream.ReadAll())) + + }) +} + func TestExecutorTrainAndPredictDNN(t *testing.T) { a := assert.New(t) modelDir := "" diff --git a/sql/expression_resolver_xgb.go b/sql/expression_resolver_xgb.go new file mode 100644 index 0000000000..d44cbcd41e --- /dev/null +++ b/sql/expression_resolver_xgb.go @@ -0,0 +1,64 @@ +// Copyright 2019 The SQLFlow Authors. All rights reserved. +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package sql + +import ( + "fmt" + "strconv" +) + +type resolvedXGBTrainClause struct { + NumBoostRound int + Maximize bool + ParamsAttr map[string]*attribute +} + +func resolveXGBTrainClause(tc *trainClause) (*resolvedXGBTrainClause, error) { + attrs, err := resolveAttribute(&tc.trainAttrs) + if err != nil { + return nil, err + } + getIntAttr := func(key string, defaultValue int) int { + if p, ok := attrs[key]; ok { + strVal, _ := p.Value.(string) + intVal, err := strconv.Atoi(trimQuotes(strVal)) + defer delete(attrs, p.FullName) + if err == nil { + return intVal + } + fmt.Printf("ignore invalid %s=%s, default is %d", key, p.Value, defaultValue) + } + return defaultValue + } + getBoolAttr := func(key string, defaultValue bool, optional bool) bool { + if p, ok := attrs[key]; ok { + strVal, _ := p.Value.(string) + boolVal, err := strconv.ParseBool(trimQuotes(strVal)) + if !optional { + defer delete(attrs, p.FullName) + } + if err == nil { + return boolVal + } else if !optional { + fmt.Printf("ignore invalid %s=%s, default is %v", key, p.Value, defaultValue) + } + } + return defaultValue + } + return &resolvedXGBTrainClause{ + NumBoostRound: getIntAttr("train.num_boost_round", 10), + Maximize: getBoolAttr("train.maximize", false, true), + ParamsAttr: filter(attrs, "params", true), + }, nil +} diff --git a/sql/template_xgboost.go b/sql/template_xgboost.go new file mode 100644 index 0000000000..8cdf256ccf --- /dev/null +++ b/sql/template_xgboost.go @@ -0,0 +1,82 @@ +// Copyright 2019 The SQLFlow Authors. All rights reserved. +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package sql + +const xgbTrainTemplateText = ` +import xgboost as xgb +from sqlflow_submitter.db import connect, db_generator + +driver="{{.Driver}}" + +{{if ne .Database ""}} +database="{{.Database}}" +{{else}} +database="" +{{end}} + +session_cfg = {} +{{ range $k, $v := .Session }} +session_cfg["{{$k}}"] = "{{$v}}" +{{end}} + +{{if ne .TrainCfgJSON ""}} +train_args = {{.TrainCfgJSON}} +{{else}} +train_args = {} +{{end}} + +{{if ne .ParamsCfgJSON ""}} +params = {{.ParamsCfgJSON}} +{{else}} +params = {} +{{end}} + +feature_column_names = [{{range .Features}} +"{{.FeatureName}}", +{{end}}] + +{{/* Convert go side featureSpec to python dict for input_fn */}} +feature_specs = dict() +{{ range $value := .Features }} +feature_specs["{{$value.FeatureName}}"] = { + "feature_name": "{{$value.FeatureName}}", + "dtype": "{{$value.Dtype}}", + "delimiter": "{{$value.Delimiter}}", + "shape": {{$value.InputShape}}, + "is_sparse": "{{$value.IsSparse}}" == "true" +} +{{end}} + + + +conn = connect(driver, database, user="{{.User}}", password="{{.Password}}", host="{{.Host}}", port={{.Port}}, auth="{{.Auth}}") + +def xgb_dataset(fn, dataset_sql): + gen = db_generator(driver, conn, session_cfg, dataset_sql, feature_column_names, "{{.Label.FeatureName}}", feature_specs) + with open(fn, 'w') as f: + for item in gen(): + features, label = item + row_data = [str(label[0])] + ["%d:%f" % (i, v) for i, v in enumerate(features)] + f.write("\t".join(row_data) + "\n") + # TODO(yancey1989): genearte group and weight text file if necessary + return xgb.DMatrix(fn) + +dtrain = xgb_dataset('train.txt', "{{.TrainingDatasetSQL}}") +dtest = xgb_dataset('test.txt', "{{.ValidationDatasetSQL}}") + +//TODO(Yancey1989): specify the eval metrics by WITH statement in SQL +train_args["evals"] = [(dtest, "auc")] +bst = xgb.train(params, dtrain, **train_args) +bst.save_model() +` From 50e703161daf2f0808aacba5c1fab0bb1fad9522 Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Tue, 3 Sep 2019 13:50:38 +0800 Subject: [PATCH 08/20] initialize xgboost codegen --- sql/template_xgboost.go | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/template_xgboost.go b/sql/template_xgboost.go index 8cdf256ccf..bf1d4fc7c5 100644 --- a/sql/template_xgboost.go +++ b/sql/template_xgboost.go @@ -75,7 +75,7 @@ def xgb_dataset(fn, dataset_sql): dtrain = xgb_dataset('train.txt', "{{.TrainingDatasetSQL}}") dtest = xgb_dataset('test.txt', "{{.ValidationDatasetSQL}}") -//TODO(Yancey1989): specify the eval metrics by WITH statement in SQL +#TODO(Yancey1989): specify the eval metrics by WITH statement in SQL train_args["evals"] = [(dtest, "auc")] bst = xgb.train(params, dtrain, **train_args) bst.save_model() From e83530b431cc7ce64b5cc038632f250399c3c95b Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Tue, 3 Sep 2019 14:24:34 +0800 Subject: [PATCH 09/20] init xgboost codegen --- sql/codegen_xgboost.go | 4 +++- sql/expression_resolver_xgb.go | 2 +- sql/template_xgboost.go | 2 +- 3 files changed, 5 insertions(+), 3 deletions(-) diff --git a/sql/codegen_xgboost.go b/sql/codegen_xgboost.go index 879a7ea68c..5fec5673f2 100644 --- a/sql/codegen_xgboost.go +++ b/sql/codegen_xgboost.go @@ -31,6 +31,7 @@ type xgbFiller struct { TrainCfg *xgbTrainConfig Features []*featureMeta Label *featureMeta + Save string ParamsCfgJSON string TrainCfgJSON string *connectionConfig @@ -62,6 +63,7 @@ func newXGBFiller(pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db TrainCfg: trainCfg, TrainingDatasetSQL: training, ValidationDatasetSQL: validation, + Save: pr.save, } // TODO(Yancey1989): fill the train_args and parameters by WITH statment r.TrainCfgJSON = "" @@ -77,7 +79,7 @@ func newXGBFiller(pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db return nil, err } if len(colSpecs) != 0 { - return nil, fmt.Errorf("newFiller doesn't support DENSE/SPARSE") + return nil, fmt.Errorf("newXGBoostFiller doesn't support DENSE/SPARSE") } for _, col := range feaCols { fm := &featureMeta{ diff --git a/sql/expression_resolver_xgb.go b/sql/expression_resolver_xgb.go index d44cbcd41e..28b102eb9f 100644 --- a/sql/expression_resolver_xgb.go +++ b/sql/expression_resolver_xgb.go @@ -56,9 +56,9 @@ func resolveXGBTrainClause(tc *trainClause) (*resolvedXGBTrainClause, error) { } return defaultValue } + return &resolvedXGBTrainClause{ NumBoostRound: getIntAttr("train.num_boost_round", 10), Maximize: getBoolAttr("train.maximize", false, true), - ParamsAttr: filter(attrs, "params", true), }, nil } diff --git a/sql/template_xgboost.go b/sql/template_xgboost.go index bf1d4fc7c5..7b3776f900 100644 --- a/sql/template_xgboost.go +++ b/sql/template_xgboost.go @@ -78,5 +78,5 @@ dtest = xgb_dataset('test.txt', "{{.ValidationDatasetSQL}}") #TODO(Yancey1989): specify the eval metrics by WITH statement in SQL train_args["evals"] = [(dtest, "auc")] bst = xgb.train(params, dtrain, **train_args) -bst.save_model() +bst.save_model("{{.Save}}") ` From 545645ebcb3feb4111de43285a82f0d309600427 Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Tue, 3 Sep 2019 14:50:56 +0800 Subject: [PATCH 10/20] fix typo --- sql/codegen_ant_xgboost.go | 2 +- sql/executor.go | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/sql/codegen_ant_xgboost.go b/sql/codegen_ant_xgboost.go index da202f0377..dcc2302da9 100644 --- a/sql/codegen_ant_xgboost.go +++ b/sql/codegen_ant_xgboost.go @@ -795,7 +795,7 @@ func xgCreatePredictionTable(pr *extendedSelect, r *antXGBoostFiller, db *DB) er return nil } -func genAntXGboost(w io.Writer, pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db *DB) error { +func genAntXGBoost(w io.Writer, pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db *DB) error { r, e := newAntXGBoostFiller(pr, ds, db) if e != nil { return e diff --git a/sql/executor.go b/sql/executor.go index ea9435eeb8..c7c7a60011 100644 --- a/sql/executor.go +++ b/sql/executor.go @@ -387,7 +387,7 @@ func train(wr *PipeWriter, tr *extendedSelect, db *DB, cwd string, modelDir stri var program bytes.Buffer if strings.HasPrefix(strings.ToUpper(tr.estimator), `XGBOOST.`) { // TODO(sperlingxx): write a separate train pipeline for ant-xgboost to support remote mode - if e := genAntXGboost(&program, tr, ds, fts, db); e != nil { + if e := genAntXGBoost(&program, tr, ds, fts, db); e != nil { return fmt.Errorf("genAntXGBoost %v", e) } } else if strings.HasPrefix(strings.ToUpper(tr.estimator), `XGB.`) { @@ -460,7 +460,7 @@ func pred(wr *PipeWriter, pr *extendedSelect, db *DB, cwd string, modelDir strin var buf bytes.Buffer if strings.HasPrefix(strings.ToUpper(pr.estimator), `XGBOOST.`) { // TODO(sperlingxx): write a separate pred pipeline for ant-xgboost to support remote mode - if e := genAntXGboost(&buf, pr, nil, fts, db); e != nil { + if e := genAntXGBoost(&buf, pr, nil, fts, db); e != nil { return fmt.Errorf("genAntXGBoost %v", e) } } else { From 7861959f2ba57f708ebae2cf3a66028204367ed5 Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Tue, 3 Sep 2019 19:22:24 +0800 Subject: [PATCH 11/20] update --- doc/xgboost_on_sqlflow_design.md | 119 +++++++++++++++++++++++++++++++ 1 file changed, 119 insertions(+) create mode 100644 doc/xgboost_on_sqlflow_design.md diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md new file mode 100644 index 0000000000..8c17b72703 --- /dev/null +++ b/doc/xgboost_on_sqlflow_design.md @@ -0,0 +1,119 @@ +# Design Doc: XGBoost on SQLFlow + +## Introduction + +This design doc introduces how users can train/predict the [XGBoost](https://xgboost.ai/) model by SQLFlow SQL and how +we implement it. + +## Design + +We prefer users to execute the SQLFlow Train/Predict SQL as follows: + + ``` sql + SELECT * FROM train_table + TRAIN xgboost.multi.softmax + WITH + train.objective="multi:softmax", + train.num_round=2, + model.max_depth=2, + model.eta=1 + LABEL class + INTO my_xgb_model; + ``` + + ``` sql + SELECT * FROM test_table + PREDICT pred_table.result + USING my_xgb_model; + ``` + +where: +- `my_xgb_model` is the trained model. +- `xgboost.multi.softmax` specify the training model: + - The prefix `xgboost.` is used to distinguish with Tensorflow model. + - `multi.softmax` is the learning task, SQLFlow would fill it to [XGBoost objective parameter](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters): `objective=multi:softmax`. +- The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train). +- The prefix `model.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) except the `objective` parameter. + +`codegen_xgboost.go` would generate an XGBoost Python program including: +- Generate the XGBoost input database. +- Pass the train/predict parameters to XGBoost Python program. +- Save the trained model. +- Using [Learning API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training) instead of [Sckiet-Learn API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) just because we prefer explain the XGBoost model by [SHAP](https://github.com/slundberg/shap). + +### Input Format + +SQLFlow implements [db_generator](/sql/python/sqlflow_submitter/db.py#db_generator) that takes the +`SELECT STATEMENT` as the input and outputs a iterable function which +yields `(features, label)` for each iteration call. `codegen_xgboost` would reuse the `db_generator` +to generate the XGBoost input database. + +XGBoost using `DMatrix` as the input structure, according to [Text Input Format of DMatrix](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html), we prefer to implement `XGBoostDatabase` that +takes `db_generator` as the input and outputs text files with LibSVM format. + +- For the **basic** input format + + the train table can be like: + + ``` text + col0 | col1 | col2 | label + 1.1 NULL 2.2 1 + 0.8 2.0 2.2 2 + 0.2 3.0 NULL 0 + 0.77 4.0 2.6 2 + ``` + + `codegen_xgboost.go` would write down the `train.txt` file like: + + ``` text + 1 0:1.1 2:2.2 + 2 0:0.8 1:2.0 3:2.2 + 0 0:0.2 1:3.0 + 2 0:0.77 1:4.0 2:2.6 + ``` + +- For the **group** input format, users can easy to specify the group column by `train.group_column` in the WITH statement like: + + ``` sql + SELECT * FROM train_table + TRAIN XGBoost + LABEL class + WITH + train.group_column=group + ... + ``` + + The group column in table can be like: + + ``` text + col1 | col2| col3 | label | group + 1.1 2.0 2.2 1 1 + 0.8 2.0 2.2 2 1 + 0.2 3.0 4.2 0 2 + 0.77 4.0 2.6 2 3 + ``` + + `codegen_xgboost.go` would write down the `train.txt.group` file like: + + ``` text + 2 + 1 + 1 + ``` + +- For the **weight** input format, users can specify the weight column like `group`: + + ``` sql + SELECT * FROM train_table + TRAIN XGBoost + LABEL class + WITH + train.weight_column=weight + ``` + + `codegen_xgboost.go` would also write the `train.txt.weight` file on the disk. + +## TBD + +- Implement auto-train feature to search the parameter. +- Support the sparse data format. From 9238117b8e0fa06381cc7c23155d2bc8c5d5240e Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Tue, 3 Sep 2019 19:23:09 +0800 Subject: [PATCH 12/20] model to params --- doc/xgboost_on_sqlflow_design.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md index 8c17b72703..a4bc049436 100644 --- a/doc/xgboost_on_sqlflow_design.md +++ b/doc/xgboost_on_sqlflow_design.md @@ -15,8 +15,8 @@ We prefer users to execute the SQLFlow Train/Predict SQL as follows: WITH train.objective="multi:softmax", train.num_round=2, - model.max_depth=2, - model.eta=1 + params.max_depth=2, + params.eta=1 LABEL class INTO my_xgb_model; ``` @@ -33,7 +33,7 @@ where: - The prefix `xgboost.` is used to distinguish with Tensorflow model. - `multi.softmax` is the learning task, SQLFlow would fill it to [XGBoost objective parameter](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters): `objective=multi:softmax`. - The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train). -- The prefix `model.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) except the `objective` parameter. +- The prefix `params.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) except the `objective` parameter. `codegen_xgboost.go` would generate an XGBoost Python program including: - Generate the XGBoost input database. From 4d4f867742fcb9a6a47b9b7850d7e6840b1f139f Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Tue, 3 Sep 2019 19:24:59 +0800 Subject: [PATCH 13/20] remove conflict file --- doc/antxgboost_on_sqlflow_design.md | 152 ---------------------------- 1 file changed, 152 deletions(-) delete mode 100644 doc/antxgboost_on_sqlflow_design.md diff --git a/doc/antxgboost_on_sqlflow_design.md b/doc/antxgboost_on_sqlflow_design.md deleted file mode 100644 index b8bdd9d0b1..0000000000 --- a/doc/antxgboost_on_sqlflow_design.md +++ /dev/null @@ -1,152 +0,0 @@ -# _Design:_ xgboost on sqlflow - -## Overview - -This is a design doc about why and how to support running xgboost via sqlflow as a machine learning estimator. - -We propose to build a lightweight python template for xgboost on basis of `xgblauncher`, -an incubating xgboost wrapper in [ant-xgboost](https://github.com/alipay/ant-xgboost). - -## Context - -Gradient boosting machine (GBM) is a widely used (supervised) machine learning method, -which trains a bunch of weak learners, typically decision trees, -in a gradual, additive and sequential manner. -A lot of winning solutions of data mining and machine learning challenges, -such as : Kaggle, KDD cup, are based on GBM or related techniques. - -There exists a lot of GBM frameworks (implementations), we propose to use [xgboost](https://xgboost.ai/) as backend of sqlflow, -which is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable, -often regarded as one of the best GBM frameworks. - - -## _Proposed Solution:_ ant-xgboost on sqlflow - -We propose to use [ant-xgboost](https://github.com/alipay/ant-xgboost) as backend, -which is consistent with [xgboost](https://github.com/dmlc/xgboost) in kernel level. -Because in `ant-xgboost`, there exists an incubating module named [xgblauncher](https://github.com/alipay/ant-xgboost/tree/ant_master/xgboost-launcher), -an extendable, cloud-native xgboost based machine learning pipeline. -Comparing to python API provided by `xgboost`, it is easier to build a python code template for xgboost task launching on basis of `xgblauncher`. - -### User Experience - -In terms of sqlflow users, xgboost is an alternative `Estimator` like `TensorFlow Estimators`. -Working with xgboost is quite similar to working with TensorFlow Estimators; just change `TRAIN DNNClassifier` into `TRAIN XGBoostEstimator`. - -In addition, xgboost specific parameters can be configured in the same way as TensorFlow parameters. - -Below is a demo about training/predicting via xgboost : - -```sql -// sample clause of train -select - c1, c2, c3, c4, c5 as class -from kaggle_credit_fraud_training_data -TRAIN XGBoostEstimator -WITH - booster = "gbtree" - objective = "logistic:binary" - eval_metric = "auc" - train_eval_ratio = 0.8 -COLUMN - c1, - NUMERIC(c2, 10), - BUCKET(c3, [0, 10, 100]), - c4 -LABEL class -INTO sqlflow_models.xgboost_model_table; - -// sample clause of predict -select - c1, c2, c3, c4 -from kaggle_credit_fraud_development_data -PREDICT kaggle_credit_fraud_development_data.class -USING sqlflow_models.xgboost_model_table; -``` - -### Implementation - -As `codegen.go` generating TensorFlow code from sqlflow AST, -we will add `codegen_xgboost.go` which translate sqlflow AST into a python launcher program of xgboost. - -Since xgblauncher provide `DataSource` and `ModelSource`, abstraction of custom I/O pipeline, by which we can reuse data/model pipeline of `sqlflow_submitter`. - -The full documentation of xgblauncher will be available soon. Below, we show a demonstration of DataSource/ModelSource API. - -```python -class DataSource: - """ - DataSource API - A handler of data reading/writing, which is compatible with both single-machine and distributed runtime. - """ - def __init__(self, - rank: int, - num_worker: int, - column_conf: configs.ColumnFields, - source_conf): - pass - - @abstractmethod - def read(self) -> Iterator[XGBoostRecord]: - pass - - @abstractmethod - def write(self, result_iter: Iterator[XGBoostResult]): - pass - - -class ModelSource: - """ - ModelSource API - A handler by which XGBLauncher save/load model(booster) and related information. - """ - def __init__(self, source_conf): - pass - - @abstractmethod - def read_buffer(self, model_path: str) -> bytes: - pass - - @abstractmethod - def write_buffer(self, buf: bytes, model_path: str): - pass - - @abstractmethod - def read_lines(self, model_path: str) -> List[str]: - pass - - @abstractmethod - def write_lines(self, lines: List[str], model_path: str): - pass -``` - - -With the help of xgblauncher, we can launch xgboost from sqlflow AST via a lightweight python `code template` and a corrsponding `filler`. -The `code template` roughly includes components as follows: - -* `TFDataSource` that is responsible for fetching and pre-processing data via tf.feature_columns API. - Data will be fetched in mini-batch style by executing TF compute graph with mini-batch data feed by `sqlflow_submitter.db.db_generator`. - -* `DBDataSource` that is responsible for writing prediction results into specific data base. - The writing action can be implemented via `sqlflow_submitter.db.insert_values`. - -* `LocalModelSource` that is responsible for reading/writing _gboost models on local file system. - -* Configure template building and entry point of xgblauncher. - - -#### Running distributed xgboost job on k8s cluster - -Distributed training is supported in xgboost via [rabit](https://github.com/dmlc/rabit), a reliable allreduce and broadcast interface for distributed machine learning. -To run a distributed xgboost job with `rabit`, all we need to do is setup a distributed environment. - -For now, xgboost has been bind to some popular distributed computing frameworks, such as Apache Spark, Apache Flink, Dask. -However, specific computing frameworks are not always available in production environments. -So, we propose a cloud-native approach: running xgboost directly on `k8s cluster`. - -As `xgblauncher` is scalable and docker-friendly, xgblauncher-based containers can be easily orchestrated by [xgboost operator](https://github.com/kubeflow/xgboost-operator), -a specific kubernetes controller for (distributed) xgboost jobs. -With the help of `xgboost operator`, it is easy to handle `XGBoostJob` via `kuberentes API`, a kubernetes' custom resource defined by `xgboost operator`. - -`XGBoostJob` building and tracking will be integrated to `xgblauncher` in near future. -After that, we can generate python codes with an option to decide whether running xgboost job locally or submitting it to remote k8s cluster. From 0b1d9a3a94ffbf02bb5df7a8874eecb72873a498 Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Tue, 3 Sep 2019 19:53:03 +0800 Subject: [PATCH 14/20] remove unused code --- sql/codegen_xgboost.go | 1 - 1 file changed, 1 deletion(-) diff --git a/sql/codegen_xgboost.go b/sql/codegen_xgboost.go index 5fec5673f2..acf03e11bd 100644 --- a/sql/codegen_xgboost.go +++ b/sql/codegen_xgboost.go @@ -109,7 +109,6 @@ func genXGBoost(w io.Writer, pr *extendedSelect, ds *trainAndValDataset, fts fie return e } if pr.train { - fmt.Println(r.TrainCfgJSON) return xgbTrainTemplate.Execute(w, r) } return fmt.Errorf("xgboost prediction codegen has not been implemented") From 005b754ff5f002dd874e33534a08575b57066859 Mon Sep 17 00:00:00 2001 From: Yi Wang Date: Tue, 3 Sep 2019 21:06:39 -0700 Subject: [PATCH 15/20] Update xgboost_on_sqlflow_design.md --- doc/xgboost_on_sqlflow_design.md | 80 +++++++++++++++++--------------- 1 file changed, 42 insertions(+), 38 deletions(-) diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md index a4bc049436..cbf14cda35 100644 --- a/doc/xgboost_on_sqlflow_design.md +++ b/doc/xgboost_on_sqlflow_design.md @@ -2,44 +2,48 @@ ## Introduction -This design doc introduces how users can train/predict the [XGBoost](https://xgboost.ai/) model by SQLFlow SQL and how -we implement it. - -## Design - -We prefer users to execute the SQLFlow Train/Predict SQL as follows: - - ``` sql - SELECT * FROM train_table - TRAIN xgboost.multi.softmax - WITH - train.objective="multi:softmax", - train.num_round=2, - params.max_depth=2, - params.eta=1 - LABEL class - INTO my_xgb_model; - ``` - - ``` sql - SELECT * FROM test_table - PREDICT pred_table.result - USING my_xgb_model; - ``` - -where: -- `my_xgb_model` is the trained model. -- `xgboost.multi.softmax` specify the training model: - - The prefix `xgboost.` is used to distinguish with Tensorflow model. - - `multi.softmax` is the learning task, SQLFlow would fill it to [XGBoost objective parameter](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters): `objective=multi:softmax`. -- The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train). -- The prefix `params.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) except the `objective` parameter. - -`codegen_xgboost.go` would generate an XGBoost Python program including: -- Generate the XGBoost input database. -- Pass the train/predict parameters to XGBoost Python program. -- Save the trained model. -- Using [Learning API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training) instead of [Sckiet-Learn API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) just because we prefer explain the XGBoost model by [SHAP](https://github.com/slundberg/shap). +This design explains how SQLFlow calls [XGBoost](https://xgboost.ai/) for training models and prediciton. + +## Usage + +To explain the benefit of integrating XGBoost with SQLFlow, let us start with an example. The following SQLFlow code snippet shows how users can train an XGBoost tree model named `my_xgb_model`. + +``` sql +SELECT * FROM train_table +TRAIN xgboost.multi.softmax +WITH + train.objective="multi:softmax", + train.num_round=2, + params.max_depth=2, + params.eta=1 +LABEL class +INTO my_xgb_model; +``` + +The following example shows how to predict using the model `my_xgb_model`. + +``` sql +SELECT * FROM test_table +PREDICT pred_table.result +USING my_xgb_model; +``` + +The the above examples, +- `my_xgb_model` names the trained model. +- `xgboost.multi.softmax` is the model spec, where + - the prefix `xgboost.` tells the model is a XGBoost one, but not a Tensorflow model, and + - `multi.softmax` names an [XGBoost learning task](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters). +- In the `WITH` clause, + - keys with the prefix `train.` identifies parameters of XGBoost API [`xgboost.train`](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train), and + - the prefix `params.` identifies [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) except the `objective` parameter, which was specified by the identifier after the keyword `TRAIN`, as explained above. + +## The Code Generator + +The code generator `codegen_xgboost.go` outputs an XGBoost program in Python. It contains the following features: +1. Generate the XGBoost input database. +1. Pass the train/predict parameters to XGBoost Python program. +1. Save the trained model. +1. Using [Learning API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training) instead of [Sckiet-Learn API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) just because we prefer explain the XGBoost model by [SHAP](https://github.com/slundberg/shap). ### Input Format From 60ca03022ba26383461a42ea700acd734af100bc Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Wed, 4 Sep 2019 13:48:04 +0800 Subject: [PATCH 16/20] remove xgb resolver --- sql/codegen_xgboost.go | 21 +---------- sql/expression_resolver_xgb.go | 64 ---------------------------------- 2 files changed, 1 insertion(+), 84 deletions(-) delete mode 100644 sql/expression_resolver_xgb.go diff --git a/sql/codegen_xgboost.go b/sql/codegen_xgboost.go index acf03e11bd..3822ae7e67 100644 --- a/sql/codegen_xgboost.go +++ b/sql/codegen_xgboost.go @@ -37,30 +37,11 @@ type xgbFiller struct { *connectionConfig } -func fillXGBTrainCfg(rt *resolvedXGBTrainClause) (*xgbTrainConfig, error) { - // TODO(Yancey1989): fill all the training control parameters - c := &xgbTrainConfig{ - NumBoostRound: rt.NumBoostRound, - Maximize: rt.Maximize, - } - return c, nil -} - func newXGBFiller(pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db *DB) (*xgbFiller, error) { - rt, err := resolveXGBTrainClause(&pr.trainClause) + var err error training, validation := trainingAndValidationDataset(pr, ds) - if err != nil { - return nil, err - } - - trainCfg, err := fillXGBTrainCfg(rt) - if err != nil { - return nil, err - } - r := &xgbFiller{ IsTrain: pr.train, - TrainCfg: trainCfg, TrainingDatasetSQL: training, ValidationDatasetSQL: validation, Save: pr.save, diff --git a/sql/expression_resolver_xgb.go b/sql/expression_resolver_xgb.go deleted file mode 100644 index 28b102eb9f..0000000000 --- a/sql/expression_resolver_xgb.go +++ /dev/null @@ -1,64 +0,0 @@ -// Copyright 2019 The SQLFlow Authors. All rights reserved. -// Licensed under the Apache License, Version 2.0 (the "License"); -// you may not use this file except in compliance with the License. -// You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -package sql - -import ( - "fmt" - "strconv" -) - -type resolvedXGBTrainClause struct { - NumBoostRound int - Maximize bool - ParamsAttr map[string]*attribute -} - -func resolveXGBTrainClause(tc *trainClause) (*resolvedXGBTrainClause, error) { - attrs, err := resolveAttribute(&tc.trainAttrs) - if err != nil { - return nil, err - } - getIntAttr := func(key string, defaultValue int) int { - if p, ok := attrs[key]; ok { - strVal, _ := p.Value.(string) - intVal, err := strconv.Atoi(trimQuotes(strVal)) - defer delete(attrs, p.FullName) - if err == nil { - return intVal - } - fmt.Printf("ignore invalid %s=%s, default is %d", key, p.Value, defaultValue) - } - return defaultValue - } - getBoolAttr := func(key string, defaultValue bool, optional bool) bool { - if p, ok := attrs[key]; ok { - strVal, _ := p.Value.(string) - boolVal, err := strconv.ParseBool(trimQuotes(strVal)) - if !optional { - defer delete(attrs, p.FullName) - } - if err == nil { - return boolVal - } else if !optional { - fmt.Printf("ignore invalid %s=%s, default is %v", key, p.Value, defaultValue) - } - } - return defaultValue - } - - return &resolvedXGBTrainClause{ - NumBoostRound: getIntAttr("train.num_boost_round", 10), - Maximize: getBoolAttr("train.maximize", false, true), - }, nil -} From f71b5c37edcbe7c547aba330852f87d25c568b3b Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Wed, 4 Sep 2019 16:32:48 +0800 Subject: [PATCH 17/20] update --- doc/xgboost_on_sqlflow_design.md | 92 +++----------------------------- 1 file changed, 7 insertions(+), 85 deletions(-) diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md index cbf14cda35..aa9d9209ba 100644 --- a/doc/xgboost_on_sqlflow_design.md +++ b/doc/xgboost_on_sqlflow_design.md @@ -12,10 +12,9 @@ To explain the benefit of integrating XGBoost with SQLFlow, let us start with an SELECT * FROM train_table TRAIN xgboost.multi.softmax WITH - train.objective="multi:softmax", train.num_round=2, - params.max_depth=2, - params.eta=1 + max_depth=2, + eta=1 LABEL class INTO my_xgb_model; ``` @@ -35,89 +34,12 @@ The the above examples, - `multi.softmax` names an [XGBoost learning task](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters). - In the `WITH` clause, - keys with the prefix `train.` identifies parameters of XGBoost API [`xgboost.train`](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train), and - - the prefix `params.` identifies [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) except the `objective` parameter, which was specified by the identifier after the keyword `TRAIN`, as explained above. + - keys without any prefix identifies [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) except the `objective` parameter, which was specified by the identifier after the keyword `TRAIN`, as explained above. ## The Code Generator The code generator `codegen_xgboost.go` outputs an XGBoost program in Python. It contains the following features: -1. Generate the XGBoost input database. -1. Pass the train/predict parameters to XGBoost Python program. -1. Save the trained model. -1. Using [Learning API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training) instead of [Sckiet-Learn API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) just because we prefer explain the XGBoost model by [SHAP](https://github.com/slundberg/shap). - -### Input Format - -SQLFlow implements [db_generator](/sql/python/sqlflow_submitter/db.py#db_generator) that takes the -`SELECT STATEMENT` as the input and outputs a iterable function which -yields `(features, label)` for each iteration call. `codegen_xgboost` would reuse the `db_generator` -to generate the XGBoost input database. - -XGBoost using `DMatrix` as the input structure, according to [Text Input Format of DMatrix](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html), we prefer to implement `XGBoostDatabase` that -takes `db_generator` as the input and outputs text files with LibSVM format. - -- For the **basic** input format - - the train table can be like: - - ``` text - col0 | col1 | col2 | label - 1.1 NULL 2.2 1 - 0.8 2.0 2.2 2 - 0.2 3.0 NULL 0 - 0.77 4.0 2.6 2 - ``` - - `codegen_xgboost.go` would write down the `train.txt` file like: - - ``` text - 1 0:1.1 2:2.2 - 2 0:0.8 1:2.0 3:2.2 - 0 0:0.2 1:3.0 - 2 0:0.77 1:4.0 2:2.6 - ``` - -- For the **group** input format, users can easy to specify the group column by `train.group_column` in the WITH statement like: - - ``` sql - SELECT * FROM train_table - TRAIN XGBoost - LABEL class - WITH - train.group_column=group - ... - ``` - - The group column in table can be like: - - ``` text - col1 | col2| col3 | label | group - 1.1 2.0 2.2 1 1 - 0.8 2.0 2.2 2 1 - 0.2 3.0 4.2 0 2 - 0.77 4.0 2.6 2 3 - ``` - - `codegen_xgboost.go` would write down the `train.txt.group` file like: - - ``` text - 2 - 1 - 1 - ``` - -- For the **weight** input format, users can specify the weight column like `group`: - - ``` sql - SELECT * FROM train_table - TRAIN XGBoost - LABEL class - WITH - train.weight_column=weight - ``` - - `codegen_xgboost.go` would also write the `train.txt.weight` file on the disk. - -## TBD - -- Implement auto-train feature to search the parameter. -- Support the sparse data format. +1. Transport the user-typed **SELECT STATEMENT** into [XGBoost DMatrix](https://xgboost.readthedocs.io/en/latest/python/python_api.html?highlight=dmatrix#xgboost.DMatrix) which is the Data Matrix used in XGBoost. +1. Fill the training control arguments and xgboost parameters according to the user-typed **WITH STATEMENT**. +1. Save the trained model on disk. +1. For the **PREDICT STATEMENT**, the submitter Python program would load the trained model and `dtest` which is a DMatrix object generated from **PREDICT SELECT STATEMENT** to output the prediction result. From bce4a2df7f3e3e893586f44730b42105a2c53f79 Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Wed, 4 Sep 2019 16:35:30 +0800 Subject: [PATCH 18/20] fix conflict --- sql/executor_test.go | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/sql/executor_test.go b/sql/executor_test.go index d62c288aaa..7914df072d 100644 --- a/sql/executor_test.go +++ b/sql/executor_test.go @@ -113,16 +113,6 @@ func TestExecutorTrainXGBoost(t *testing.T) { }) } -func TestExecutorTrainXGBoost(t *testing.T) { - a := assert.New(t) - modelDir := "" - a.NotPanics(func() { - stream := runExtendedSQL(testXGBoostTrainSelectIris, testDB, modelDir, nil) - a.True(goodStream(stream.ReadAll())) - - }) -} - func TestExecutorTrainAndPredictDNN(t *testing.T) { a := assert.New(t) modelDir := "" From 1366feecc5d8674b191edd4d2bf6f4c347dd1e46 Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Wed, 4 Sep 2019 17:12:49 +0800 Subject: [PATCH 19/20] remove some details section --- doc/xgboost_on_sqlflow_design.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md index aa9d9209ba..e5a966ee6c 100644 --- a/doc/xgboost_on_sqlflow_design.md +++ b/doc/xgboost_on_sqlflow_design.md @@ -39,7 +39,9 @@ The the above examples, ## The Code Generator The code generator `codegen_xgboost.go` outputs an XGBoost program in Python. It contains the following features: -1. Transport the user-typed **SELECT STATEMENT** into [XGBoost DMatrix](https://xgboost.readthedocs.io/en/latest/python/python_api.html?highlight=dmatrix#xgboost.DMatrix) which is the Data Matrix used in XGBoost. -1. Fill the training control arguments and xgboost parameters according to the user-typed **WITH STATEMENT**. +1. Execute the user-typed **SELECT STATEMENT** to retrieve the training data from SQL engine, then convert it to +[XGBoost DMatrix](https://xgboost.readthedocs.io/en/latest/python/python_api.html?highlight=dmatrix#xgboost.DMatrix) +which is the Data Matrix used inn XGBoost. +1. Parse and resolve the **WITH** clause to fill the `xgboost.train` arguments and the XGBoost Parameters. 1. Save the trained model on disk. -1. For the **PREDICT STATEMENT**, the submitter Python program would load the trained model and `dtest` which is a DMatrix object generated from **PREDICT SELECT STATEMENT** to output the prediction result. +1. For the **PREDICT STATEMENT**, the submitter Python program would load the trained model and test data to output the prediction result to a SQL engine. From bdd68bec2841c0a3b932ce9149dd17b661a50740 Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Wed, 4 Sep 2019 23:29:20 +0800 Subject: [PATCH 20/20] update follows the comment --- doc/xgboost_on_sqlflow_design.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md index e5a966ee6c..d9f5841170 100644 --- a/doc/xgboost_on_sqlflow_design.md +++ b/doc/xgboost_on_sqlflow_design.md @@ -39,9 +39,7 @@ The the above examples, ## The Code Generator The code generator `codegen_xgboost.go` outputs an XGBoost program in Python. It contains the following features: -1. Execute the user-typed **SELECT STATEMENT** to retrieve the training data from SQL engine, then convert it to -[XGBoost DMatrix](https://xgboost.readthedocs.io/en/latest/python/python_api.html?highlight=dmatrix#xgboost.DMatrix) -which is the Data Matrix used inn XGBoost. -1. Parse and resolve the **WITH** clause to fill the `xgboost.train` arguments and the XGBoost Parameters. +1. It tells the SQL engine to run the SELECT statement and retrieve the training/test data. It saves the data into a text file, which could be loaded by XGBoost using the DMatrix interface. +1. Parse and resolve the WITH clause to fill the `xgboost.train` arguments and the XGBoost Parameters. 1. Save the trained model on disk. -1. For the **PREDICT STATEMENT**, the submitter Python program would load the trained model and test data to output the prediction result to a SQL engine. +1. For the PREDICT clause, it loads the trained model and test data and then outputs the prediction result to a SQL engine.