From d8fe3480f5963711a6b801e479b8cc61d54e84a3 Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Sat, 31 Aug 2019 17:31:31 +0800
Subject: [PATCH 01/20] rename antxgboost design

---
 ...boost_on_sqlflow_design.md => antxgboost_on_sqlflow_design.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename doc/{xgboost_on_sqlflow_design.md => antxgboost_on_sqlflow_design.md} (100%)

diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/antxgboost_on_sqlflow_design.md
similarity index 100%
rename from doc/xgboost_on_sqlflow_design.md
rename to doc/antxgboost_on_sqlflow_design.md

From 999c673340666086187a021d82d5351118988932 Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Sat, 31 Aug 2019 17:31:49 +0800
Subject: [PATCH 02/20] add xgboost design

---
 doc/xgboost_on_sqlflow_design.md | 109 +++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)
 create mode 100644 doc/xgboost_on_sqlflow_design.md

diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md
new file mode 100644
index 0000000000..36b844a745
--- /dev/null
+++ b/doc/xgboost_on_sqlflow_design.md
@@ -0,0 +1,109 @@
+# Design Doc: XGBoost on SQLFlow
+
+## Introduction
+
+This design doc introduces how do users train/predict the [XGBoost](https://xgboost.ai/) model by SQLFlow SQL and how
+we implement it.
+
+## Design
+
+We prefer users to execute the SQLFlow Train/Predict SQL as follows:
+
+  ``` sql
+  SELECT * FROM train_table
+  TRAIN XGBoost
+  WITH
+      train.objective="multi:softmax",
+      train.num_round=2,
+      model.max_depth=2,
+      model.eta=1
+  INTO my_xgb_model;
+  ```
+  
+  ``` sql
+  SELECT * FROM test_table
+  PREDICT pred_table.result
+  USING my_xgb_model;
+  ```
+
+where:
+- `my_xgb_model` is the trained model.
+- The keyword `XGBOOST` is used to distinguish with the Tensorflow Model.
+- The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train) except the `params` arguments.
+- The prefix `model.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html);
+
+`codegen_xgboost.go` would generate a XGBoost Python program accoding to the XGBoost SQL including:
+- Prepare the input data.
+- pass the arguments to XGBoost Python program.
+- Save the trained model.
+
+### Input Format
+
+XGBoost using `DMatrix` as the input structure, according to [Text Input Format of DMatrix](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html), we prefer to implement refuse `db.generator` and 
+generate text files as LibSVM format.
+
+- For the basic input format, `db_geneator` would yield `(features, label)` for each iteration
+
+    the train table can be like:
+
+    ``` text
+    col0 | col1 | col2 | label
+    1.1 NULL 2.2 1
+    0.8 2.0 2.2 2
+    0.2 3.0 NULL 0
+    0.77 4.0 2.6 2
+    ```
+
+    `codegen_xgboost.go` would write down the `train.txt` file like:
+
+    ``` text
+    1 0:1.1 2:2.2
+    2 0:0.8 1:2.0 3:2.2
+    0 0:0.2 1:3.0
+    2 0:0.77 1:4.0 2:2.6 
+    ```
+
+- For the group information, users can easy to specify the group column by `train.group_column` in the WITH statement
+, just like:
+
+    ``` sql
+    SELECT * FROM train_table
+    TRAIN XGBOOST
+    WITH
+        train.group_column=group
+    ...
+    ```
+
+    The group column in table can be like:
+
+    ``` text
+    col1 | col2| col3 | label | group
+    1.1 2.0 2.2 1 1
+    0.8 2.0 2.2 2 1
+    0.2 3.0 4.2 0 2
+    0.77 4.0 2.6 2 3
+    ```
+
+    `codegen_xgboost.go` would write down the `train.txt.group` file like:
+
+    ``` text
+    2
+    1
+    1
+    ```
+
+- For the `Weight` information, users can specify the weight column like `group`:
+
+    ``` sql
+    SELECT * FROM train_table
+    TRAIN XGBOOST
+    WITH
+        train.weight_column=weight
+    ```
+
+    `codegen_xgboost.go` would also write the `train.txt.weight` file on the disk.
+  
+## TBD
+
+- Implement auto-train feature to search the parameter.
+- Support the sparse data format.

From eb17739e2b639ab4c3db8323885fc7ded806e062 Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Sat, 31 Aug 2019 17:33:36 +0800
Subject: [PATCH 03/20] update

---
 doc/xgboost_on_sqlflow_design.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md
index 36b844a745..1d2132d0b3 100644
--- a/doc/xgboost_on_sqlflow_design.md
+++ b/doc/xgboost_on_sqlflow_design.md
@@ -29,7 +29,7 @@ We prefer users to execute the SQLFlow Train/Predict SQL as follows:
 where:
 - `my_xgb_model` is the trained model.
 - The keyword `XGBOOST` is used to distinguish with the Tensorflow Model.
-- The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train) except the `params` arguments.
+- The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train).
 - The prefix `model.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html);
 
 `codegen_xgboost.go` would generate a XGBoost Python program accoding to the XGBoost SQL including:

From c30485ac65cc9da19e626258f4a8cefeef3ae6fe Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Sat, 31 Aug 2019 17:34:34 +0800
Subject: [PATCH 04/20] update

---
 doc/xgboost_on_sqlflow_design.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md
index 1d2132d0b3..1f590937bd 100644
--- a/doc/xgboost_on_sqlflow_design.md
+++ b/doc/xgboost_on_sqlflow_design.md
@@ -42,7 +42,7 @@ where:
 XGBoost using `DMatrix` as the input structure, according to [Text Input Format of DMatrix](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html), we prefer to implement refuse `db.generator` and 
 generate text files as LibSVM format.
 
-- For the basic input format, `db_geneator` would yield `(features, label)` for each iteration
+- For the **basic** input format, `db_geneator` would yield `(features, label)` for each iteration
 
     the train table can be like:
 
@@ -63,7 +63,7 @@ generate text files as LibSVM format.
     2 0:0.77 1:4.0 2:2.6 
     ```
 
-- For the group information, users can easy to specify the group column by `train.group_column` in the WITH statement
+- For the **group** input format, users can easy to specify the group column by `train.group_column` in the WITH statement
 , just like:
 
     ``` sql
@@ -92,7 +92,7 @@ generate text files as LibSVM format.
     1
     ```
 
-- For the `Weight` information, users can specify the weight column like `group`:
+- For the **weight** input format, users can specify the weight column like `group`:
 
     ``` sql
     SELECT * FROM train_table

From ba274b1d8a682e80eb232dcf6ac333920f66c679 Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Sun, 1 Sep 2019 22:50:27 +0800
Subject: [PATCH 05/20] update doc

---
 doc/xgboost_on_sqlflow_design.md | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md
index 1f590937bd..e32595cb29 100644
--- a/doc/xgboost_on_sqlflow_design.md
+++ b/doc/xgboost_on_sqlflow_design.md
@@ -28,21 +28,25 @@ We prefer users to execute the SQLFlow Train/Predict SQL as follows:
 
 where:
 - `my_xgb_model` is the trained model.
-- The keyword `XGBOOST` is used to distinguish with the Tensorflow Model.
+- `XGBoost` is used to distinguish with the Tensorflow Model.
 - The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train).
 - The prefix `model.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html);
 
-`codegen_xgboost.go` would generate a XGBoost Python program accoding to the XGBoost SQL including:
-- Prepare the input data.
-- pass the arguments to XGBoost Python program.
+`codegen_xgboost.go` would generate a XGBoost Python program including:
+- Generate the XGBoost input database.
+- Pass the train/predict parameters to XGBoost Python program.
 - Save the trained model.
 
 ### Input Format
 
-XGBoost using `DMatrix` as the input structure, according to [Text Input Format of DMatrix](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html), we prefer to implement refuse `db.generator` and 
-generate text files as LibSVM format.
+SQLFlow implemented `db_generator` taht takes the `SELECT STATEMENT` as the input and outputs a iterator function which 
+yields `(features, label)` for each iteration. `codegen_xgboost` would reuse the `db_generator` to generate the XGBoost 
+input database.
 
-- For the **basic** input format, `db_geneator` would yield `(features, label)` for each iteration
+XGBoost using `DMatrix` as the input structure, according to [Text Input Format of DMatrix](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html), we prefer to implement `XGBoostDatabase` that
+takes `db_generator` as the input and outputs text files with LibSVM format.
+
+- For the **basic** input format
 
     the train table can be like:
 

From 320c3be1687de7dd91a129495b4515bcc39d420a Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Sun, 1 Sep 2019 23:25:34 +0800
Subject: [PATCH 06/20] polish doc

---
 doc/xgboost_on_sqlflow_design.md | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md
index e32595cb29..797cd9cf6f 100644
--- a/doc/xgboost_on_sqlflow_design.md
+++ b/doc/xgboost_on_sqlflow_design.md
@@ -2,7 +2,7 @@
 
 ## Introduction
 
-This design doc introduces how do users train/predict the [XGBoost](https://xgboost.ai/) model by SQLFlow SQL and how
+This design doc introduces how  users can train/predict the [XGBoost](https://xgboost.ai/) model by SQLFlow SQL and how
 we implement it.
 
 ## Design
@@ -17,6 +17,7 @@ We prefer users to execute the SQLFlow Train/Predict SQL as follows:
       train.num_round=2,
       model.max_depth=2,
       model.eta=1
+  LABEL class
   INTO my_xgb_model;
   ```
   
@@ -28,20 +29,21 @@ We prefer users to execute the SQLFlow Train/Predict SQL as follows:
 
 where:
 - `my_xgb_model` is the trained model.
-- `XGBoost` is used to distinguish with the Tensorflow Model.
+- `XGBoost` means to train an XGBoost model instead of the TensorFlow Model.
 - The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train).
 - The prefix `model.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html);
 
-`codegen_xgboost.go` would generate a XGBoost Python program including:
+`codegen_xgboost.go` would generate an XGBoost Python program including:
 - Generate the XGBoost input database.
 - Pass the train/predict parameters to XGBoost Python program.
 - Save the trained model.
 
 ### Input Format
 
-SQLFlow implemented `db_generator` taht takes the `SELECT STATEMENT` as the input and outputs a iterator function which 
-yields `(features, label)` for each iteration. `codegen_xgboost` would reuse the `db_generator` to generate the XGBoost 
-input database.
+SQLFlow implements [db_generator](/sql/python/sqlflow_submitter/db.py#db_generator) that takes the 
+`SELECT STATEMENT` as the input and outputs a iterable function which 
+yields `(features, label)` for each iteration call. `codegen_xgboost` would reuse the `db_generator`
+to generate the XGBoost input database.
 
 XGBoost using `DMatrix` as the input structure, according to [Text Input Format of DMatrix](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html), we prefer to implement `XGBoostDatabase` that
 takes `db_generator` as the input and outputs text files with LibSVM format.
@@ -67,12 +69,12 @@ takes `db_generator` as the input and outputs text files with LibSVM format.
     2 0:0.77 1:4.0 2:2.6 
     ```
 
-- For the **group** input format, users can easy to specify the group column by `train.group_column` in the WITH statement
-, just like:
+- For the **group** input format, users can easy to specify the group column by `train.group_column` in the WITH statement like:
 
     ``` sql
     SELECT * FROM train_table
-    TRAIN XGBOOST
+    TRAIN XGBoost
+    LABEL class
     WITH
         train.group_column=group
     ...
@@ -100,7 +102,8 @@ takes `db_generator` as the input and outputs text files with LibSVM format.
 
     ``` sql
     SELECT * FROM train_table
-    TRAIN XGBOOST
+    TRAIN XGBoost
+    LABEL class
     WITH
         train.weight_column=weight
     ```

From e359497c94725cda8cd82c47b3103fa061adbdda Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Tue, 3 Sep 2019 13:49:47 +0800
Subject: [PATCH 07/20] initialize xgboost codegen

---
 sql/codegen_ant_xgboost.go     |   2 +-
 sql/codegen_xgboost.go         | 116 +++++++++++++++++++++++++++++++++
 sql/codegen_xgboost_test.go    |  25 +++++++
 sql/executor.go                |  15 +++--
 sql/executor_test.go           |  10 +++
 sql/expression_resolver_xgb.go |  64 ++++++++++++++++++
 sql/template_xgboost.go        |  82 +++++++++++++++++++++++
 7 files changed, 309 insertions(+), 5 deletions(-)
 create mode 100644 sql/codegen_xgboost.go
 create mode 100644 sql/codegen_xgboost_test.go
 create mode 100644 sql/expression_resolver_xgb.go
 create mode 100644 sql/template_xgboost.go

diff --git a/sql/codegen_ant_xgboost.go b/sql/codegen_ant_xgboost.go
index 52c57a516e..cb10777ab6 100644
--- a/sql/codegen_ant_xgboost.go
+++ b/sql/codegen_ant_xgboost.go
@@ -790,7 +790,7 @@ func xgCreatePredictionTable(pr *extendedSelect, r *antXGBoostFiller, db *DB) er
 	return nil
 }
 
-func genXG(w io.Writer, pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db *DB) error {
+func genAntXGboost(w io.Writer, pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db *DB) error {
 	r, e := newAntXGBoostFiller(pr, ds, db)
 	if e != nil {
 		return e
diff --git a/sql/codegen_xgboost.go b/sql/codegen_xgboost.go
new file mode 100644
index 0000000000..879a7ea68c
--- /dev/null
+++ b/sql/codegen_xgboost.go
@@ -0,0 +1,116 @@
+// Copyright 2019 The SQLFlow Authors. All rights reserved.
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package sql
+
+import (
+	"fmt"
+	"io"
+	"text/template"
+)
+
+type xgbTrainConfig struct {
+	NumBoostRound int  `json:"num_boost_round,omitempty"`
+	Maximize      bool `json:"maximize,omitempty"`
+}
+
+type xgbFiller struct {
+	IsTrain              bool
+	TrainingDatasetSQL   string
+	ValidationDatasetSQL string
+	TrainCfg             *xgbTrainConfig
+	Features             []*featureMeta
+	Label                *featureMeta
+	ParamsCfgJSON        string
+	TrainCfgJSON         string
+	*connectionConfig
+}
+
+func fillXGBTrainCfg(rt *resolvedXGBTrainClause) (*xgbTrainConfig, error) {
+	// TODO(Yancey1989): fill all the training control parameters
+	c := &xgbTrainConfig{
+		NumBoostRound: rt.NumBoostRound,
+		Maximize:      rt.Maximize,
+	}
+	return c, nil
+}
+
+func newXGBFiller(pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db *DB) (*xgbFiller, error) {
+	rt, err := resolveXGBTrainClause(&pr.trainClause)
+	training, validation := trainingAndValidationDataset(pr, ds)
+	if err != nil {
+		return nil, err
+	}
+
+	trainCfg, err := fillXGBTrainCfg(rt)
+	if err != nil {
+		return nil, err
+	}
+
+	r := &xgbFiller{
+		IsTrain:              pr.train,
+		TrainCfg:             trainCfg,
+		TrainingDatasetSQL:   training,
+		ValidationDatasetSQL: validation,
+	}
+	// TODO(Yancey1989): fill the train_args and parameters by WITH statment
+	r.TrainCfgJSON = ""
+	r.ParamsCfgJSON = ""
+
+	if r.connectionConfig, err = newConnectionConfig(db); err != nil {
+		return nil, err
+	}
+
+	for _, columns := range pr.columns {
+		feaCols, colSpecs, err := resolveTrainColumns(&columns)
+		if err != nil {
+			return nil, err
+		}
+		if len(colSpecs) != 0 {
+			return nil, fmt.Errorf("newFiller doesn't support DENSE/SPARSE")
+		}
+		for _, col := range feaCols {
+			fm := &featureMeta{
+				FeatureName: col.GetKey(),
+				Dtype:       col.GetDtype(),
+				Delimiter:   col.GetDelimiter(),
+				InputShape:  col.GetInputShape(),
+				IsSparse:    false,
+			}
+			r.Features = append(r.Features, fm)
+		}
+	}
+	r.Label = &featureMeta{
+		FeatureName: pr.label,
+		Dtype:       "int32",
+		Delimiter:   ",",
+		InputShape:  "[1]",
+		IsSparse:    false,
+	}
+
+	return r, nil
+}
+
+func genXGBoost(w io.Writer, pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db *DB) error {
+	r, e := newXGBFiller(pr, ds, fts, db)
+	if e != nil {
+		return e
+	}
+	if pr.train {
+		fmt.Println(r.TrainCfgJSON)
+		return xgbTrainTemplate.Execute(w, r)
+	}
+	return fmt.Errorf("xgboost prediction codegen has not been implemented")
+}
+
+var xgbTrainTemplate = template.Must(template.New("codegenXGBTrain").Parse(xgbTrainTemplateText))
diff --git a/sql/codegen_xgboost_test.go b/sql/codegen_xgboost_test.go
new file mode 100644
index 0000000000..b1aa40351c
--- /dev/null
+++ b/sql/codegen_xgboost_test.go
@@ -0,0 +1,25 @@
+// Copyright 2019 The SQLFlow Authors. All rights reserved.
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package sql
+
+const testXGBoostTrainSelectIris = ` 
+SELECT *
+FROM iris.train
+TRAIN xgb.multi.softprob
+WITH
+	train.num_boost_round = 30
+COLUMN sepal_length, sepal_width, petal_length, petal_width
+LABEL class 
+INTO sqlflow_models.my_xgboost_model;
+`
diff --git a/sql/executor.go b/sql/executor.go
index 17187c56dd..ea9435eeb8 100644
--- a/sql/executor.go
+++ b/sql/executor.go
@@ -387,8 +387,15 @@ func train(wr *PipeWriter, tr *extendedSelect, db *DB, cwd string, modelDir stri
 	var program bytes.Buffer
 	if strings.HasPrefix(strings.ToUpper(tr.estimator), `XGBOOST.`) {
 		// TODO(sperlingxx): write a separate train pipeline for ant-xgboost to support remote mode
-		if e := genXG(&program, tr, ds, fts, db); e != nil {
-			return fmt.Errorf("genXG %v", e)
+		if e := genAntXGboost(&program, tr, ds, fts, db); e != nil {
+			return fmt.Errorf("genAntXGBoost %v", e)
+		}
+	} else if strings.HasPrefix(strings.ToUpper(tr.estimator), `XGB.`) {
+		// FIXME(Yancey1989): it's a temporary solution, just for the unit test, we perfer to distinguish
+		// xgboost and ant-xgboost with env SQLFLOW_WITH_ANTXGBOOST,
+		// issue: https://github.com/sql-machine-learning/sqlflow/issues/758
+		if e := genXGBoost(&program, tr, ds, fts, db); e != nil {
+			return fmt.Errorf("GenXGBoost %v", e)
 		}
 	} else {
 		if e := genTF(&program, tr, ds, fts, db); e != nil {
@@ -453,8 +460,8 @@ func pred(wr *PipeWriter, pr *extendedSelect, db *DB, cwd string, modelDir strin
 	var buf bytes.Buffer
 	if strings.HasPrefix(strings.ToUpper(pr.estimator), `XGBOOST.`) {
 		// TODO(sperlingxx): write a separate pred pipeline for ant-xgboost to support remote mode
-		if e := genXG(&buf, pr, nil, fts, db); e != nil {
-			return fmt.Errorf("genXG %v", e)
+		if e := genAntXGboost(&buf, pr, nil, fts, db); e != nil {
+			return fmt.Errorf("genAntXGBoost %v", e)
 		}
 	} else {
 		if e := genTF(&buf, pr, nil, fts, db); e != nil {
diff --git a/sql/executor_test.go b/sql/executor_test.go
index bf65b9d548..a360be58e8 100644
--- a/sql/executor_test.go
+++ b/sql/executor_test.go
@@ -88,6 +88,16 @@ func TestExecutorTrainAnalyzePredictAntXGBoost(t *testing.T) {
 	})
 }
 
+func TestExecutorTrainXGBoost(t *testing.T) {
+	a := assert.New(t)
+	modelDir := ""
+	a.NotPanics(func() {
+		stream := runExtendedSQL(testXGBoostTrainSelectIris, testDB, modelDir, nil)
+		a.True(goodStream(stream.ReadAll()))
+
+	})
+}
+
 func TestExecutorTrainAndPredictDNN(t *testing.T) {
 	a := assert.New(t)
 	modelDir := ""
diff --git a/sql/expression_resolver_xgb.go b/sql/expression_resolver_xgb.go
new file mode 100644
index 0000000000..d44cbcd41e
--- /dev/null
+++ b/sql/expression_resolver_xgb.go
@@ -0,0 +1,64 @@
+// Copyright 2019 The SQLFlow Authors. All rights reserved.
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package sql
+
+import (
+	"fmt"
+	"strconv"
+)
+
+type resolvedXGBTrainClause struct {
+	NumBoostRound int
+	Maximize      bool
+	ParamsAttr    map[string]*attribute
+}
+
+func resolveXGBTrainClause(tc *trainClause) (*resolvedXGBTrainClause, error) {
+	attrs, err := resolveAttribute(&tc.trainAttrs)
+	if err != nil {
+		return nil, err
+	}
+	getIntAttr := func(key string, defaultValue int) int {
+		if p, ok := attrs[key]; ok {
+			strVal, _ := p.Value.(string)
+			intVal, err := strconv.Atoi(trimQuotes(strVal))
+			defer delete(attrs, p.FullName)
+			if err == nil {
+				return intVal
+			}
+			fmt.Printf("ignore invalid %s=%s, default is %d", key, p.Value, defaultValue)
+		}
+		return defaultValue
+	}
+	getBoolAttr := func(key string, defaultValue bool, optional bool) bool {
+		if p, ok := attrs[key]; ok {
+			strVal, _ := p.Value.(string)
+			boolVal, err := strconv.ParseBool(trimQuotes(strVal))
+			if !optional {
+				defer delete(attrs, p.FullName)
+			}
+			if err == nil {
+				return boolVal
+			} else if !optional {
+				fmt.Printf("ignore invalid %s=%s, default is %v", key, p.Value, defaultValue)
+			}
+		}
+		return defaultValue
+	}
+	return &resolvedXGBTrainClause{
+		NumBoostRound: getIntAttr("train.num_boost_round", 10),
+		Maximize:      getBoolAttr("train.maximize", false, true),
+		ParamsAttr:    filter(attrs, "params", true),
+	}, nil
+}
diff --git a/sql/template_xgboost.go b/sql/template_xgboost.go
new file mode 100644
index 0000000000..8cdf256ccf
--- /dev/null
+++ b/sql/template_xgboost.go
@@ -0,0 +1,82 @@
+// Copyright 2019 The SQLFlow Authors. All rights reserved.
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package sql
+
+const xgbTrainTemplateText = `
+import xgboost as xgb
+from sqlflow_submitter.db import connect, db_generator
+
+driver="{{.Driver}}"
+
+{{if ne .Database ""}}
+database="{{.Database}}"
+{{else}}
+database=""
+{{end}}
+
+session_cfg = {}
+{{ range $k, $v := .Session }}
+session_cfg["{{$k}}"] = "{{$v}}"
+{{end}}
+
+{{if ne .TrainCfgJSON ""}}
+train_args = {{.TrainCfgJSON}}
+{{else}}
+train_args = {}
+{{end}}
+
+{{if ne .ParamsCfgJSON ""}}
+params = {{.ParamsCfgJSON}}
+{{else}}
+params = {}
+{{end}}
+
+feature_column_names = [{{range .Features}}
+"{{.FeatureName}}",
+{{end}}]
+
+{{/* Convert go side featureSpec to python dict for input_fn */}}
+feature_specs = dict()
+{{ range $value := .Features }}
+feature_specs["{{$value.FeatureName}}"] = {
+    "feature_name": "{{$value.FeatureName}}",
+    "dtype": "{{$value.Dtype}}",
+    "delimiter": "{{$value.Delimiter}}",
+    "shape": {{$value.InputShape}},
+    "is_sparse": "{{$value.IsSparse}}" == "true"
+}
+{{end}}
+
+
+
+conn = connect(driver, database, user="{{.User}}", password="{{.Password}}", host="{{.Host}}", port={{.Port}}, auth="{{.Auth}}")
+
+def xgb_dataset(fn, dataset_sql):
+		gen = db_generator(driver, conn, session_cfg, dataset_sql, feature_column_names, "{{.Label.FeatureName}}", feature_specs)
+		with open(fn, 'w') as f:
+				for item in gen():
+						features, label = item
+						row_data = [str(label[0])] + ["%d:%f" % (i, v) for i, v in enumerate(features)]
+						f.write("\t".join(row_data) + "\n")
+		# TODO(yancey1989): genearte group and weight text file if necessary
+		return xgb.DMatrix(fn) 
+
+dtrain = xgb_dataset('train.txt', "{{.TrainingDatasetSQL}}")
+dtest = xgb_dataset('test.txt', "{{.ValidationDatasetSQL}}")
+
+//TODO(Yancey1989): specify the eval metrics by WITH statement in SQL
+train_args["evals"] = [(dtest, "auc")]
+bst = xgb.train(params, dtrain, **train_args)
+bst.save_model()
+`

From 50e703161daf2f0808aacba5c1fab0bb1fad9522 Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Tue, 3 Sep 2019 13:50:38 +0800
Subject: [PATCH 08/20] initialize xgboost codegen

---
 sql/template_xgboost.go | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sql/template_xgboost.go b/sql/template_xgboost.go
index 8cdf256ccf..bf1d4fc7c5 100644
--- a/sql/template_xgboost.go
+++ b/sql/template_xgboost.go
@@ -75,7 +75,7 @@ def xgb_dataset(fn, dataset_sql):
 dtrain = xgb_dataset('train.txt', "{{.TrainingDatasetSQL}}")
 dtest = xgb_dataset('test.txt', "{{.ValidationDatasetSQL}}")
 
-//TODO(Yancey1989): specify the eval metrics by WITH statement in SQL
+#TODO(Yancey1989): specify the eval metrics by WITH statement in SQL
 train_args["evals"] = [(dtest, "auc")]
 bst = xgb.train(params, dtrain, **train_args)
 bst.save_model()

From e83530b431cc7ce64b5cc038632f250399c3c95b Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Tue, 3 Sep 2019 14:24:34 +0800
Subject: [PATCH 09/20] init xgboost codegen

---
 sql/codegen_xgboost.go         | 4 +++-
 sql/expression_resolver_xgb.go | 2 +-
 sql/template_xgboost.go        | 2 +-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/sql/codegen_xgboost.go b/sql/codegen_xgboost.go
index 879a7ea68c..5fec5673f2 100644
--- a/sql/codegen_xgboost.go
+++ b/sql/codegen_xgboost.go
@@ -31,6 +31,7 @@ type xgbFiller struct {
 	TrainCfg             *xgbTrainConfig
 	Features             []*featureMeta
 	Label                *featureMeta
+	Save                 string
 	ParamsCfgJSON        string
 	TrainCfgJSON         string
 	*connectionConfig
@@ -62,6 +63,7 @@ func newXGBFiller(pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db
 		TrainCfg:             trainCfg,
 		TrainingDatasetSQL:   training,
 		ValidationDatasetSQL: validation,
+		Save:                 pr.save,
 	}
 	// TODO(Yancey1989): fill the train_args and parameters by WITH statment
 	r.TrainCfgJSON = ""
@@ -77,7 +79,7 @@ func newXGBFiller(pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db
 			return nil, err
 		}
 		if len(colSpecs) != 0 {
-			return nil, fmt.Errorf("newFiller doesn't support DENSE/SPARSE")
+			return nil, fmt.Errorf("newXGBoostFiller doesn't support DENSE/SPARSE")
 		}
 		for _, col := range feaCols {
 			fm := &featureMeta{
diff --git a/sql/expression_resolver_xgb.go b/sql/expression_resolver_xgb.go
index d44cbcd41e..28b102eb9f 100644
--- a/sql/expression_resolver_xgb.go
+++ b/sql/expression_resolver_xgb.go
@@ -56,9 +56,9 @@ func resolveXGBTrainClause(tc *trainClause) (*resolvedXGBTrainClause, error) {
 		}
 		return defaultValue
 	}
+
 	return &resolvedXGBTrainClause{
 		NumBoostRound: getIntAttr("train.num_boost_round", 10),
 		Maximize:      getBoolAttr("train.maximize", false, true),
-		ParamsAttr:    filter(attrs, "params", true),
 	}, nil
 }
diff --git a/sql/template_xgboost.go b/sql/template_xgboost.go
index bf1d4fc7c5..7b3776f900 100644
--- a/sql/template_xgboost.go
+++ b/sql/template_xgboost.go
@@ -78,5 +78,5 @@ dtest = xgb_dataset('test.txt', "{{.ValidationDatasetSQL}}")
 #TODO(Yancey1989): specify the eval metrics by WITH statement in SQL
 train_args["evals"] = [(dtest, "auc")]
 bst = xgb.train(params, dtrain, **train_args)
-bst.save_model()
+bst.save_model("{{.Save}}")
 `

From 545645ebcb3feb4111de43285a82f0d309600427 Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Tue, 3 Sep 2019 14:50:56 +0800
Subject: [PATCH 10/20] fix typo

---
 sql/codegen_ant_xgboost.go | 2 +-
 sql/executor.go            | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/sql/codegen_ant_xgboost.go b/sql/codegen_ant_xgboost.go
index da202f0377..dcc2302da9 100644
--- a/sql/codegen_ant_xgboost.go
+++ b/sql/codegen_ant_xgboost.go
@@ -795,7 +795,7 @@ func xgCreatePredictionTable(pr *extendedSelect, r *antXGBoostFiller, db *DB) er
 	return nil
 }
 
-func genAntXGboost(w io.Writer, pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db *DB) error {
+func genAntXGBoost(w io.Writer, pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db *DB) error {
 	r, e := newAntXGBoostFiller(pr, ds, db)
 	if e != nil {
 		return e
diff --git a/sql/executor.go b/sql/executor.go
index ea9435eeb8..c7c7a60011 100644
--- a/sql/executor.go
+++ b/sql/executor.go
@@ -387,7 +387,7 @@ func train(wr *PipeWriter, tr *extendedSelect, db *DB, cwd string, modelDir stri
 	var program bytes.Buffer
 	if strings.HasPrefix(strings.ToUpper(tr.estimator), `XGBOOST.`) {
 		// TODO(sperlingxx): write a separate train pipeline for ant-xgboost to support remote mode
-		if e := genAntXGboost(&program, tr, ds, fts, db); e != nil {
+		if e := genAntXGBoost(&program, tr, ds, fts, db); e != nil {
 			return fmt.Errorf("genAntXGBoost %v", e)
 		}
 	} else if strings.HasPrefix(strings.ToUpper(tr.estimator), `XGB.`) {
@@ -460,7 +460,7 @@ func pred(wr *PipeWriter, pr *extendedSelect, db *DB, cwd string, modelDir strin
 	var buf bytes.Buffer
 	if strings.HasPrefix(strings.ToUpper(pr.estimator), `XGBOOST.`) {
 		// TODO(sperlingxx): write a separate pred pipeline for ant-xgboost to support remote mode
-		if e := genAntXGboost(&buf, pr, nil, fts, db); e != nil {
+		if e := genAntXGBoost(&buf, pr, nil, fts, db); e != nil {
 			return fmt.Errorf("genAntXGBoost %v", e)
 		}
 	} else {

From 7861959f2ba57f708ebae2cf3a66028204367ed5 Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Tue, 3 Sep 2019 19:22:24 +0800
Subject: [PATCH 11/20] update

---
 doc/xgboost_on_sqlflow_design.md | 119 +++++++++++++++++++++++++++++++
 1 file changed, 119 insertions(+)
 create mode 100644 doc/xgboost_on_sqlflow_design.md

diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md
new file mode 100644
index 0000000000..8c17b72703
--- /dev/null
+++ b/doc/xgboost_on_sqlflow_design.md
@@ -0,0 +1,119 @@
+# Design Doc: XGBoost on SQLFlow
+
+## Introduction
+
+This design doc introduces how  users can train/predict the [XGBoost](https://xgboost.ai/) model by SQLFlow SQL and how
+we implement it.
+
+## Design
+
+We prefer users to execute the SQLFlow Train/Predict SQL as follows:
+
+  ``` sql
+  SELECT * FROM train_table
+  TRAIN xgboost.multi.softmax
+  WITH
+      train.objective="multi:softmax",
+      train.num_round=2,
+      model.max_depth=2,
+      model.eta=1
+  LABEL class
+  INTO my_xgb_model;
+  ```
+  
+  ``` sql
+  SELECT * FROM test_table
+  PREDICT pred_table.result
+  USING my_xgb_model;
+  ```
+
+where:
+- `my_xgb_model` is the trained model.
+- `xgboost.multi.softmax` specify the training model:
+    - The prefix `xgboost.` is used to distinguish with Tensorflow model.
+    - `multi.softmax` is the learning task, SQLFlow would fill it to [XGBoost objective parameter](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters): `objective=multi:softmax`.
+- The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train).
+- The prefix `model.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) except the `objective` parameter.
+
+`codegen_xgboost.go` would generate an XGBoost Python program including:
+- Generate the XGBoost input database.
+- Pass the train/predict parameters to XGBoost Python program.
+- Save the trained model.
+- Using [Learning API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training) instead of [Sckiet-Learn API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) just because we prefer explain the XGBoost model by [SHAP](https://github.com/slundberg/shap).
+
+### Input Format
+
+SQLFlow implements [db_generator](/sql/python/sqlflow_submitter/db.py#db_generator) that takes the 
+`SELECT STATEMENT` as the input and outputs a iterable function which 
+yields `(features, label)` for each iteration call. `codegen_xgboost` would reuse the `db_generator`
+to generate the XGBoost input database.
+
+XGBoost using `DMatrix` as the input structure, according to [Text Input Format of DMatrix](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html), we prefer to implement `XGBoostDatabase` that
+takes `db_generator` as the input and outputs text files with LibSVM format.
+
+- For the **basic** input format
+
+    the train table can be like:
+
+    ``` text
+    col0 | col1 | col2 | label
+    1.1 NULL 2.2 1
+    0.8 2.0 2.2 2
+    0.2 3.0 NULL 0
+    0.77 4.0 2.6 2
+    ```
+
+    `codegen_xgboost.go` would write down the `train.txt` file like:
+
+    ``` text
+    1 0:1.1 2:2.2
+    2 0:0.8 1:2.0 3:2.2
+    0 0:0.2 1:3.0
+    2 0:0.77 1:4.0 2:2.6 
+    ```
+
+- For the **group** input format, users can easy to specify the group column by `train.group_column` in the WITH statement like:
+
+    ``` sql
+    SELECT * FROM train_table
+    TRAIN XGBoost
+    LABEL class
+    WITH
+        train.group_column=group
+    ...
+    ```
+
+    The group column in table can be like:
+
+    ``` text
+    col1 | col2| col3 | label | group
+    1.1 2.0 2.2 1 1
+    0.8 2.0 2.2 2 1
+    0.2 3.0 4.2 0 2
+    0.77 4.0 2.6 2 3
+    ```
+
+    `codegen_xgboost.go` would write down the `train.txt.group` file like:
+
+    ``` text
+    2
+    1
+    1
+    ```
+
+- For the **weight** input format, users can specify the weight column like `group`:
+
+    ``` sql
+    SELECT * FROM train_table
+    TRAIN XGBoost
+    LABEL class
+    WITH
+        train.weight_column=weight
+    ```
+
+    `codegen_xgboost.go` would also write the `train.txt.weight` file on the disk.
+  
+## TBD
+
+- Implement auto-train feature to search the parameter.
+- Support the sparse data format.

From 9238117b8e0fa06381cc7c23155d2bc8c5d5240e Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Tue, 3 Sep 2019 19:23:09 +0800
Subject: [PATCH 12/20] model to params

---
 doc/xgboost_on_sqlflow_design.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md
index 8c17b72703..a4bc049436 100644
--- a/doc/xgboost_on_sqlflow_design.md
+++ b/doc/xgboost_on_sqlflow_design.md
@@ -15,8 +15,8 @@ We prefer users to execute the SQLFlow Train/Predict SQL as follows:
   WITH
       train.objective="multi:softmax",
       train.num_round=2,
-      model.max_depth=2,
-      model.eta=1
+      params.max_depth=2,
+      params.eta=1
   LABEL class
   INTO my_xgb_model;
   ```
@@ -33,7 +33,7 @@ where:
     - The prefix `xgboost.` is used to distinguish with Tensorflow model.
     - `multi.softmax` is the learning task, SQLFlow would fill it to [XGBoost objective parameter](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters): `objective=multi:softmax`.
 - The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train).
-- The prefix `model.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) except the `objective` parameter.
+- The prefix `params.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) except the `objective` parameter.
 
 `codegen_xgboost.go` would generate an XGBoost Python program including:
 - Generate the XGBoost input database.

From 4d4f867742fcb9a6a47b9b7850d7e6840b1f139f Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Tue, 3 Sep 2019 19:24:59 +0800
Subject: [PATCH 13/20] remove conflict file

---
 doc/antxgboost_on_sqlflow_design.md | 152 ----------------------------
 1 file changed, 152 deletions(-)
 delete mode 100644 doc/antxgboost_on_sqlflow_design.md

diff --git a/doc/antxgboost_on_sqlflow_design.md b/doc/antxgboost_on_sqlflow_design.md
deleted file mode 100644
index b8bdd9d0b1..0000000000
--- a/doc/antxgboost_on_sqlflow_design.md
+++ /dev/null
@@ -1,152 +0,0 @@
-# _Design:_ xgboost on sqlflow
-
-## Overview
-
-This is a design doc about why and how to support running xgboost via sqlflow as a machine learning estimator.
-
-We propose to build a lightweight python template for xgboost on basis of `xgblauncher`,
-an incubating xgboost wrapper in [ant-xgboost](https://github.com/alipay/ant-xgboost).
-
-## Context
-
-Gradient boosting machine (GBM) is a widely used (supervised) machine learning method, 
-which trains a bunch of weak learners, typically decision trees, 
-in a gradual, additive and sequential manner. 
-A lot of winning solutions of data mining and machine learning challenges, 
-such as : Kaggle, KDD cup, are based on GBM or related techniques.
-
-There exists a lot of GBM frameworks (implementations), we propose to use [xgboost](https://xgboost.ai/) as backend of sqlflow, 
-which is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable, 
-often regarded as one of the best GBM frameworks.
-
-
-## _Proposed Solution:_ ant-xgboost on sqlflow
-   
-We propose to use [ant-xgboost](https://github.com/alipay/ant-xgboost) as backend,
-which is consistent with [xgboost](https://github.com/dmlc/xgboost) in kernel level. 
-Because in `ant-xgboost`, there exists an incubating module named [xgblauncher](https://github.com/alipay/ant-xgboost/tree/ant_master/xgboost-launcher), 
-an extendable, cloud-native xgboost based machine learning pipeline. 
-Comparing to python API provided by `xgboost`, it is easier to build a python code template for xgboost task launching on basis of `xgblauncher`.
-
-### User Experience
-    
-In terms of sqlflow users, xgboost is an alternative `Estimator` like `TensorFlow Estimators`. 
-Working with xgboost is quite similar to working with TensorFlow Estimators; just change `TRAIN DNNClassifier` into `TRAIN XGBoostEstimator`. 
-
-In addition, xgboost specific parameters can be configured in the same way as TensorFlow parameters. 
-
-Below is a demo about training/predicting via xgboost :
-
-```sql
-// sample clause of train
-select 
-    c1, c2, c3, c4, c5 as class
-from kaggle_credit_fraud_training_data
-TRAIN XGBoostEstimator
-WITH
-  booster = "gbtree"
-  objective = "logistic:binary"
-  eval_metric = "auc"
-  train_eval_ratio = 0.8
-COLUMN
-  c1,
-  NUMERIC(c2, 10),
-  BUCKET(c3, [0, 10, 100]),
-  c4
-LABEL class
-INTO sqlflow_models.xgboost_model_table;
-
-// sample clause of predict
-select 
-    c1, c2, c3, c4
-from kaggle_credit_fraud_development_data
-PREDICT kaggle_credit_fraud_development_data.class
-USING sqlflow_models.xgboost_model_table;
-```
-
-### Implementation
-
-As `codegen.go` generating TensorFlow code from sqlflow AST,
-we will add `codegen_xgboost.go` which translate sqlflow AST into a python launcher program of xgboost. 
-
-Since xgblauncher provide `DataSource` and `ModelSource`, abstraction of custom I/O pipeline, by which we can reuse data/model pipeline of `sqlflow_submitter`.
-
-The full documentation of xgblauncher will be available soon. Below, we show a demonstration of DataSource/ModelSource API.
- 
-```python
-class DataSource:
-    """
-    DataSource API
-    A handler of data reading/writing, which is compatible with both single-machine and distributed runtime.
-    """
-    def __init__(self, 
-                 rank: int, 
-                 num_worker: int,
-                 column_conf: configs.ColumnFields,
-                 source_conf):
-        pass
-        
-    @abstractmethod
-    def read(self) -> Iterator[XGBoostRecord]:
-        pass
-
-    @abstractmethod
-    def write(self, result_iter: Iterator[XGBoostResult]):
-        pass
-
-    
-class ModelSource:
-    """
-    ModelSource API
-    A handler by which XGBLauncher save/load model(booster) and related information.
-    """
-    def __init__(self, source_conf):
-        pass
-
-    @abstractmethod
-    def read_buffer(self, model_path: str) -> bytes:
-        pass
-
-    @abstractmethod
-    def write_buffer(self, buf: bytes, model_path: str):
-        pass
-
-    @abstractmethod
-    def read_lines(self, model_path: str) -> List[str]:
-        pass
-
-    @abstractmethod
-    def write_lines(self, lines: List[str], model_path: str):
-        pass
-``` 
-
-
-With the help of xgblauncher, we can launch xgboost from sqlflow AST via a lightweight python `code template` and a corrsponding `filler`.
-The `code template` roughly includes components as follows: 
-
-* `TFDataSource` that is responsible for fetching and pre-processing data via tf.feature_columns API.
-   Data will be fetched in mini-batch style by executing TF compute graph with mini-batch data feed by `sqlflow_submitter.db.db_generator`.
-
-* `DBDataSource` that is responsible for writing prediction results into specific data base.
-   The writing action can be implemented via `sqlflow_submitter.db.insert_values`.
-
-* `LocalModelSource` that is responsible for reading/writing _gboost models on local file system.
-
-* Configure template building and entry point of xgblauncher.
-
-
-#### Running distributed xgboost job on k8s cluster
-
-Distributed training is supported in xgboost via [rabit](https://github.com/dmlc/rabit), a reliable allreduce and broadcast interface for distributed machine learning.
-To run a distributed xgboost job with `rabit`, all we need to do is setup a distributed environment.  
-
-For now, xgboost has been bind to some popular distributed computing frameworks, such as Apache Spark, Apache Flink, Dask.
-However, specific computing frameworks are not always available in production environments. 
-So, we propose a cloud-native approach: running xgboost directly on `k8s cluster`. 
- 
-As `xgblauncher` is scalable and docker-friendly, xgblauncher-based containers can be easily orchestrated by [xgboost operator](https://github.com/kubeflow/xgboost-operator),
-a specific kubernetes controller for (distributed) xgboost jobs.
-With the help of `xgboost operator`, it is easy to handle `XGBoostJob` via `kuberentes API`, a kubernetes' custom resource defined by `xgboost operator`. 
-
-`XGBoostJob` building and tracking will be integrated to `xgblauncher` in near future. 
-After that, we can generate python codes with an option to decide whether running xgboost job locally or submitting it to remote k8s cluster.

From 0b1d9a3a94ffbf02bb5df7a8874eecb72873a498 Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Tue, 3 Sep 2019 19:53:03 +0800
Subject: [PATCH 14/20] remove unused code

---
 sql/codegen_xgboost.go | 1 -
 1 file changed, 1 deletion(-)

diff --git a/sql/codegen_xgboost.go b/sql/codegen_xgboost.go
index 5fec5673f2..acf03e11bd 100644
--- a/sql/codegen_xgboost.go
+++ b/sql/codegen_xgboost.go
@@ -109,7 +109,6 @@ func genXGBoost(w io.Writer, pr *extendedSelect, ds *trainAndValDataset, fts fie
 		return e
 	}
 	if pr.train {
-		fmt.Println(r.TrainCfgJSON)
 		return xgbTrainTemplate.Execute(w, r)
 	}
 	return fmt.Errorf("xgboost prediction codegen has not been implemented")

From 005b754ff5f002dd874e33534a08575b57066859 Mon Sep 17 00:00:00 2001
From: Yi Wang <yi.wang.2005@gmail.com>
Date: Tue, 3 Sep 2019 21:06:39 -0700
Subject: [PATCH 15/20] Update xgboost_on_sqlflow_design.md

---
 doc/xgboost_on_sqlflow_design.md | 80 +++++++++++++++++---------------
 1 file changed, 42 insertions(+), 38 deletions(-)

diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md
index a4bc049436..cbf14cda35 100644
--- a/doc/xgboost_on_sqlflow_design.md
+++ b/doc/xgboost_on_sqlflow_design.md
@@ -2,44 +2,48 @@
 
 ## Introduction
 
-This design doc introduces how  users can train/predict the [XGBoost](https://xgboost.ai/) model by SQLFlow SQL and how
-we implement it.
-
-## Design
-
-We prefer users to execute the SQLFlow Train/Predict SQL as follows:
-
-  ``` sql
-  SELECT * FROM train_table
-  TRAIN xgboost.multi.softmax
-  WITH
-      train.objective="multi:softmax",
-      train.num_round=2,
-      params.max_depth=2,
-      params.eta=1
-  LABEL class
-  INTO my_xgb_model;
-  ```
-  
-  ``` sql
-  SELECT * FROM test_table
-  PREDICT pred_table.result
-  USING my_xgb_model;
-  ```
-
-where:
-- `my_xgb_model` is the trained model.
-- `xgboost.multi.softmax` specify the training model:
-    - The prefix `xgboost.` is used to distinguish with Tensorflow model.
-    - `multi.softmax` is the learning task, SQLFlow would fill it to [XGBoost objective parameter](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters): `objective=multi:softmax`.
-- The prefix `train.` in `WITH` statement mappings to the training arguments of XGBoost [train function](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train).
-- The prefix `params.` in `WITH` statement mappings to the [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) except the `objective` parameter.
-
-`codegen_xgboost.go` would generate an XGBoost Python program including:
-- Generate the XGBoost input database.
-- Pass the train/predict parameters to XGBoost Python program.
-- Save the trained model.
-- Using [Learning API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training) instead of [Sckiet-Learn API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) just because we prefer explain the XGBoost model by [SHAP](https://github.com/slundberg/shap).
+This design explains how SQLFlow calls [XGBoost](https://xgboost.ai/) for training models and prediciton.
+
+## Usage
+
+To explain the benefit of integrating XGBoost with SQLFlow, let us start with an example.  The following SQLFlow code snippet shows how users can train an XGBoost tree model named `my_xgb_model`.
+
+``` sql
+SELECT * FROM train_table
+TRAIN xgboost.multi.softmax
+WITH
+    train.objective="multi:softmax",
+    train.num_round=2,
+    params.max_depth=2,
+    params.eta=1
+LABEL class
+INTO my_xgb_model;
+```
+
+The following example shows how to predict using the model `my_xgb_model`.
+
+``` sql
+SELECT * FROM test_table
+PREDICT pred_table.result
+USING my_xgb_model;
+```
+
+The the above examples,
+- `my_xgb_model` names the trained model.
+- `xgboost.multi.softmax` is the model spec, where
+    - the prefix `xgboost.` tells the model is a XGBoost one, but not a Tensorflow model, and
+    - `multi.softmax` names an [XGBoost learning task](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters).
+- In the `WITH` clause, 
+  - keys with the prefix `train.` identifies parameters of XGBoost API [`xgboost.train`](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train), and
+  - the prefix `params.` identifies [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) except the `objective` parameter, which was specified by the identifier after the keyword `TRAIN`, as explained above.
+
+## The Code Generator
+
+The code generator `codegen_xgboost.go` outputs an XGBoost program in Python. It contains the following features:
+1. Generate the XGBoost input database.
+1. Pass the train/predict parameters to XGBoost Python program.
+1. Save the trained model.
+1. Using [Learning API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training) instead of [Sckiet-Learn API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) just because we prefer explain the XGBoost model by [SHAP](https://github.com/slundberg/shap).
 
 ### Input Format
 

From 60ca03022ba26383461a42ea700acd734af100bc Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Wed, 4 Sep 2019 13:48:04 +0800
Subject: [PATCH 16/20] remove xgb resolver

---
 sql/codegen_xgboost.go         | 21 +----------
 sql/expression_resolver_xgb.go | 64 ----------------------------------
 2 files changed, 1 insertion(+), 84 deletions(-)
 delete mode 100644 sql/expression_resolver_xgb.go

diff --git a/sql/codegen_xgboost.go b/sql/codegen_xgboost.go
index acf03e11bd..3822ae7e67 100644
--- a/sql/codegen_xgboost.go
+++ b/sql/codegen_xgboost.go
@@ -37,30 +37,11 @@ type xgbFiller struct {
 	*connectionConfig
 }
 
-func fillXGBTrainCfg(rt *resolvedXGBTrainClause) (*xgbTrainConfig, error) {
-	// TODO(Yancey1989): fill all the training control parameters
-	c := &xgbTrainConfig{
-		NumBoostRound: rt.NumBoostRound,
-		Maximize:      rt.Maximize,
-	}
-	return c, nil
-}
-
 func newXGBFiller(pr *extendedSelect, ds *trainAndValDataset, fts fieldTypes, db *DB) (*xgbFiller, error) {
-	rt, err := resolveXGBTrainClause(&pr.trainClause)
+	var err error
 	training, validation := trainingAndValidationDataset(pr, ds)
-	if err != nil {
-		return nil, err
-	}
-
-	trainCfg, err := fillXGBTrainCfg(rt)
-	if err != nil {
-		return nil, err
-	}
-
 	r := &xgbFiller{
 		IsTrain:              pr.train,
-		TrainCfg:             trainCfg,
 		TrainingDatasetSQL:   training,
 		ValidationDatasetSQL: validation,
 		Save:                 pr.save,
diff --git a/sql/expression_resolver_xgb.go b/sql/expression_resolver_xgb.go
deleted file mode 100644
index 28b102eb9f..0000000000
--- a/sql/expression_resolver_xgb.go
+++ /dev/null
@@ -1,64 +0,0 @@
-// Copyright 2019 The SQLFlow Authors. All rights reserved.
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-// http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-package sql
-
-import (
-	"fmt"
-	"strconv"
-)
-
-type resolvedXGBTrainClause struct {
-	NumBoostRound int
-	Maximize      bool
-	ParamsAttr    map[string]*attribute
-}
-
-func resolveXGBTrainClause(tc *trainClause) (*resolvedXGBTrainClause, error) {
-	attrs, err := resolveAttribute(&tc.trainAttrs)
-	if err != nil {
-		return nil, err
-	}
-	getIntAttr := func(key string, defaultValue int) int {
-		if p, ok := attrs[key]; ok {
-			strVal, _ := p.Value.(string)
-			intVal, err := strconv.Atoi(trimQuotes(strVal))
-			defer delete(attrs, p.FullName)
-			if err == nil {
-				return intVal
-			}
-			fmt.Printf("ignore invalid %s=%s, default is %d", key, p.Value, defaultValue)
-		}
-		return defaultValue
-	}
-	getBoolAttr := func(key string, defaultValue bool, optional bool) bool {
-		if p, ok := attrs[key]; ok {
-			strVal, _ := p.Value.(string)
-			boolVal, err := strconv.ParseBool(trimQuotes(strVal))
-			if !optional {
-				defer delete(attrs, p.FullName)
-			}
-			if err == nil {
-				return boolVal
-			} else if !optional {
-				fmt.Printf("ignore invalid %s=%s, default is %v", key, p.Value, defaultValue)
-			}
-		}
-		return defaultValue
-	}
-
-	return &resolvedXGBTrainClause{
-		NumBoostRound: getIntAttr("train.num_boost_round", 10),
-		Maximize:      getBoolAttr("train.maximize", false, true),
-	}, nil
-}

From f71b5c37edcbe7c547aba330852f87d25c568b3b Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Wed, 4 Sep 2019 16:32:48 +0800
Subject: [PATCH 17/20] update

---
 doc/xgboost_on_sqlflow_design.md | 92 +++-----------------------------
 1 file changed, 7 insertions(+), 85 deletions(-)

diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md
index cbf14cda35..aa9d9209ba 100644
--- a/doc/xgboost_on_sqlflow_design.md
+++ b/doc/xgboost_on_sqlflow_design.md
@@ -12,10 +12,9 @@ To explain the benefit of integrating XGBoost with SQLFlow, let us start with an
 SELECT * FROM train_table
 TRAIN xgboost.multi.softmax
 WITH
-    train.objective="multi:softmax",
     train.num_round=2,
-    params.max_depth=2,
-    params.eta=1
+    max_depth=2,
+    eta=1
 LABEL class
 INTO my_xgb_model;
 ```
@@ -35,89 +34,12 @@ The the above examples,
     - `multi.softmax` names an [XGBoost learning task](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters).
 - In the `WITH` clause, 
   - keys with the prefix `train.` identifies parameters of XGBoost API [`xgboost.train`](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train), and
-  - the prefix `params.` identifies [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) except the `objective` parameter, which was specified by the identifier after the keyword `TRAIN`, as explained above.
+  - keys without any prefix identifies [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) except the `objective` parameter, which was specified by the identifier after the keyword `TRAIN`, as explained above.
 
 ## The Code Generator
 
 The code generator `codegen_xgboost.go` outputs an XGBoost program in Python. It contains the following features:
-1. Generate the XGBoost input database.
-1. Pass the train/predict parameters to XGBoost Python program.
-1. Save the trained model.
-1. Using [Learning API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training) instead of [Sckiet-Learn API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) just because we prefer explain the XGBoost model by [SHAP](https://github.com/slundberg/shap).
-
-### Input Format
-
-SQLFlow implements [db_generator](/sql/python/sqlflow_submitter/db.py#db_generator) that takes the 
-`SELECT STATEMENT` as the input and outputs a iterable function which 
-yields `(features, label)` for each iteration call. `codegen_xgboost` would reuse the `db_generator`
-to generate the XGBoost input database.
-
-XGBoost using `DMatrix` as the input structure, according to [Text Input Format of DMatrix](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html), we prefer to implement `XGBoostDatabase` that
-takes `db_generator` as the input and outputs text files with LibSVM format.
-
-- For the **basic** input format
-
-    the train table can be like:
-
-    ``` text
-    col0 | col1 | col2 | label
-    1.1 NULL 2.2 1
-    0.8 2.0 2.2 2
-    0.2 3.0 NULL 0
-    0.77 4.0 2.6 2
-    ```
-
-    `codegen_xgboost.go` would write down the `train.txt` file like:
-
-    ``` text
-    1 0:1.1 2:2.2
-    2 0:0.8 1:2.0 3:2.2
-    0 0:0.2 1:3.0
-    2 0:0.77 1:4.0 2:2.6 
-    ```
-
-- For the **group** input format, users can easy to specify the group column by `train.group_column` in the WITH statement like:
-
-    ``` sql
-    SELECT * FROM train_table
-    TRAIN XGBoost
-    LABEL class
-    WITH
-        train.group_column=group
-    ...
-    ```
-
-    The group column in table can be like:
-
-    ``` text
-    col1 | col2| col3 | label | group
-    1.1 2.0 2.2 1 1
-    0.8 2.0 2.2 2 1
-    0.2 3.0 4.2 0 2
-    0.77 4.0 2.6 2 3
-    ```
-
-    `codegen_xgboost.go` would write down the `train.txt.group` file like:
-
-    ``` text
-    2
-    1
-    1
-    ```
-
-- For the **weight** input format, users can specify the weight column like `group`:
-
-    ``` sql
-    SELECT * FROM train_table
-    TRAIN XGBoost
-    LABEL class
-    WITH
-        train.weight_column=weight
-    ```
-
-    `codegen_xgboost.go` would also write the `train.txt.weight` file on the disk.
-  
-## TBD
-
-- Implement auto-train feature to search the parameter.
-- Support the sparse data format.
+1. Transport the user-typed **SELECT STATEMENT** into [XGBoost DMatrix](https://xgboost.readthedocs.io/en/latest/python/python_api.html?highlight=dmatrix#xgboost.DMatrix) which is the Data Matrix used in XGBoost.
+1. Fill the training control arguments and xgboost parameters according to the user-typed **WITH STATEMENT**.
+1. Save the trained model on disk.
+1. For the **PREDICT STATEMENT**, the submitter Python program would load the trained model and `dtest` which is a DMatrix object generated from **PREDICT SELECT STATEMENT** to output the prediction result.

From bce4a2df7f3e3e893586f44730b42105a2c53f79 Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Wed, 4 Sep 2019 16:35:30 +0800
Subject: [PATCH 18/20] fix conflict

---
 sql/executor_test.go | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/sql/executor_test.go b/sql/executor_test.go
index d62c288aaa..7914df072d 100644
--- a/sql/executor_test.go
+++ b/sql/executor_test.go
@@ -113,16 +113,6 @@ func TestExecutorTrainXGBoost(t *testing.T) {
 	})
 }
 
-func TestExecutorTrainXGBoost(t *testing.T) {
-	a := assert.New(t)
-	modelDir := ""
-	a.NotPanics(func() {
-		stream := runExtendedSQL(testXGBoostTrainSelectIris, testDB, modelDir, nil)
-		a.True(goodStream(stream.ReadAll()))
-
-	})
-}
-
 func TestExecutorTrainAndPredictDNN(t *testing.T) {
 	a := assert.New(t)
 	modelDir := ""

From 1366feecc5d8674b191edd4d2bf6f4c347dd1e46 Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Wed, 4 Sep 2019 17:12:49 +0800
Subject: [PATCH 19/20] remove some details section

---
 doc/xgboost_on_sqlflow_design.md | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md
index aa9d9209ba..e5a966ee6c 100644
--- a/doc/xgboost_on_sqlflow_design.md
+++ b/doc/xgboost_on_sqlflow_design.md
@@ -39,7 +39,9 @@ The the above examples,
 ## The Code Generator
 
 The code generator `codegen_xgboost.go` outputs an XGBoost program in Python. It contains the following features:
-1. Transport the user-typed **SELECT STATEMENT** into [XGBoost DMatrix](https://xgboost.readthedocs.io/en/latest/python/python_api.html?highlight=dmatrix#xgboost.DMatrix) which is the Data Matrix used in XGBoost.
-1. Fill the training control arguments and xgboost parameters according to the user-typed **WITH STATEMENT**.
+1. Execute the user-typed **SELECT STATEMENT** to retrieve the training data from SQL engine, then convert it to
+[XGBoost DMatrix](https://xgboost.readthedocs.io/en/latest/python/python_api.html?highlight=dmatrix#xgboost.DMatrix)
+which is the Data Matrix used inn XGBoost.
+1. Parse and resolve the **WITH** clause to fill the `xgboost.train` arguments and the XGBoost Parameters.
 1. Save the trained model on disk.
-1. For the **PREDICT STATEMENT**, the submitter Python program would load the trained model and `dtest` which is a DMatrix object generated from **PREDICT SELECT STATEMENT** to output the prediction result.
+1. For the **PREDICT STATEMENT**, the submitter Python program would load the trained model and test data to output the prediction result to a SQL engine.

From bdd68bec2841c0a3b932ce9149dd17b661a50740 Mon Sep 17 00:00:00 2001
From: Yancey1989 <yancey1989@gmail.com>
Date: Wed, 4 Sep 2019 23:29:20 +0800
Subject: [PATCH 20/20] update follows the comment

---
 doc/xgboost_on_sqlflow_design.md | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/doc/xgboost_on_sqlflow_design.md b/doc/xgboost_on_sqlflow_design.md
index e5a966ee6c..d9f5841170 100644
--- a/doc/xgboost_on_sqlflow_design.md
+++ b/doc/xgboost_on_sqlflow_design.md
@@ -39,9 +39,7 @@ The the above examples,
 ## The Code Generator
 
 The code generator `codegen_xgboost.go` outputs an XGBoost program in Python. It contains the following features:
-1. Execute the user-typed **SELECT STATEMENT** to retrieve the training data from SQL engine, then convert it to
-[XGBoost DMatrix](https://xgboost.readthedocs.io/en/latest/python/python_api.html?highlight=dmatrix#xgboost.DMatrix)
-which is the Data Matrix used inn XGBoost.
-1. Parse and resolve the **WITH** clause to fill the `xgboost.train` arguments and the XGBoost Parameters.
+1. It tells the SQL engine to run the SELECT statement and retrieve the training/test data. It saves the data into a text file, which could be loaded by XGBoost using the DMatrix interface.
+1. Parse and resolve the WITH clause to fill the `xgboost.train` arguments and the XGBoost Parameters.
 1. Save the trained model on disk.
-1. For the **PREDICT STATEMENT**, the submitter Python program would load the trained model and test data to output the prediction result to a SQL engine.
+1. For the PREDICT clause, it loads the trained model and test data and then outputs the prediction result to a SQL engine.