Skip to content
Merged
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions doc/xgboost_on_sqlflow_design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Design Doc: XGBoost on SQLFlow

## Introduction

This design explains how SQLFlow calls [XGBoost](https://xgboost.ai/) for training models and prediciton.

## Usage

To explain the benefit of integrating XGBoost with SQLFlow, let us start with an example. The following SQLFlow code snippet shows how users can train an XGBoost tree model named `my_xgb_model`.

``` sql
SELECT * FROM train_table
TRAIN xgboost.multi.softmax
WITH
train.num_round=2,
max_depth=2,
eta=1
LABEL class
INTO my_xgb_model;
```

The following example shows how to predict using the model `my_xgb_model`.

``` sql
SELECT * FROM test_table
PREDICT pred_table.result
USING my_xgb_model;
```

The the above examples,
- `my_xgb_model` names the trained model.
- `xgboost.multi.softmax` is the model spec, where
- the prefix `xgboost.` tells the model is a XGBoost one, but not a Tensorflow model, and
- `multi.softmax` names an [XGBoost learning task](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters).
- In the `WITH` clause,
- keys with the prefix `train.` identifies parameters of XGBoost API [`xgboost.train`](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train), and
- keys without any prefix identifies [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) except the `objective` parameter, which was specified by the identifier after the keyword `TRAIN`, as explained above.

## The Code Generator

The code generator `codegen_xgboost.go` outputs an XGBoost program in Python. It contains the following features:
1. It tells the SQL engine to run the SELECT statement and retrieve the training/test data. It saves the data into a text file, which could be loaded by XGBoost using the DMatrix interface.
1. Parse and resolve the WITH clause to fill the `xgboost.train` arguments and the XGBoost Parameters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the parsing of the WITH clause is the parser's work, but not the submitter's work, am I right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser can parse the WITH clause to a general attrs struct which is a Go struct map[string]*expr, and each generator would resolve theattrs to program parameters, such as XGBoost generator would convert the attrs as follows:

  • keys with train. prefix to xgboost.train arguments.
  • keys without any prefix to XGBoost Parameters which is JSON format.

1. Save the trained model on disk.
1. For the PREDICT clause, it loads the trained model and test data and then outputs the prediction result to a SQL engine.