Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
199 changes: 127 additions & 72 deletions doc/user_guide.md → doc/language_guide.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# SQLFlow User Guide
# SQLFlow Language Guide

SQLFlow is a bridge that connects a SQL engine (e.g. MySQL, Hive, or MaxCompute) and TensorFlow and other machine learning toolkits. SQLFlow extends the SQL syntax to enable model training and inference.
SQLFlow is a bridge that connects a SQL engine (e.g., MySQL, Hive, or MaxCompute) and TensorFlow and other machine learning toolkits. SQLFlow extends the SQL syntax to enable model training, prediction, and analysis.

This language guide elaborates SQLFlow extended syntax and feature column API. For specific examples, please refer to [the tutorial](/doc/tutorial).

## Overview

Expand Down Expand Up @@ -40,11 +42,11 @@ Let's assume [iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_da
</tr>
</table>

Let's train a `DNNClassifier`, which has 2 hidden layers where each layer has 10 hidden units, and then save the trained model into table `sqlflow_models.my_dnn_model` for making predictions later on.
Let's train a `DNNClassifier`, which has two hidden layers where each layer has ten hidden units, and then save the trained model into table `sqlflow_models.my_dnn_model` for making predictions later on.

Instead of writing a Python program with a lot of boilerplate code, this can be achieved easily via the following statement in SQLFlow.

```SQL
```
SELECT * FROM iris.train
TRAIN DNNClassifer
WITH hidden_units = [10, 10], n_classes = 3, EPOCHS = 10
Expand All @@ -57,15 +59,31 @@ SQLFlow will then parse the above statement and translate it to an equivalent Py

![](figures/user_overview.png)

## Syntax
## Training Syntax

A SQLFlow training statement consists of a sequence of select, train, column, label, and into clauses.

A SQLFlow training statement consists of a sequence of select, train, column, label and into clauses.
```
SELECT select_expr [, select_expr ...]
FROM table_references
[WHERE where_condition]
[LIMIT row_count]
TRAIN model_identifier
[WITH
model_attr_expr [, model_attr_expr ...]
[, train_attr_expr ...]]
COLUMN column_expr [, column_expr ...]
| COLUMN column_expr [, column_expr ...] FOR column_name
[COLUMN column_expr [, column_expr ...] FOR column_name ...]
[LABEL label_expr]
INTO table_references;
```

### Select clause
### Select Clause

The *select clause* describes the data retrieved from a particular table, e.g. `SELECT * FROM iris.train`.
The *select clause* describes the data retrieved from a particular table, e.g., `SELECT * FROM iris.train`.

```SQL
```
SELECT select_expr [, select_expr ...]
FROM table_references
[WHERE where_condition]
Expand All @@ -81,19 +99,19 @@ Equivalent to [ANSI SQL Standards](https://www.whoishostingthis.com/resources/an
For example, if you want to quickly prototype a binary classifier on a subset of the sample data, you can write
the following statement:

```SQL
```
SELECT *
FROM iris.train
WHERE class = 0 OR class = 1
LIMIT 1000
TRAIN ...
```

### Train clause
### Train Clause

The *train clause* describes the specific model type and the way the model is trained, e.g. `TRAIN DNNClassifer WITH hidden_units = [10, 10], n_classes = 3, EPOCHS = 10`.

```SQL
```
TRAIN model_identifier
WITH
model_attr_expr [, model_attr_expr ...]
Expand All @@ -104,9 +122,9 @@ WITH
- *model_attr_expr* indicates the model attribute. e.g. `model.n_classes = 3`. Please refer to [Models](#models) for details.
- *train_attr_expr* indicates the training attribute. e.g. `train.epoch = 10`. Please refer to [Hyperparameters](#hyperparameters) for details.

For example, if you want to train a `DNNClassifier`, which has 2 hidden layers where each layer has 10 hidden units, with 10 epochs, you can write the following statement:
For example, if you want to train a `DNNClassifier`, which has two hidden layers where each layer has ten hidden units, with ten epochs, you can write the following statement:

```SQL
```
SELECT ...
TRAIN DNNClassifer
WITH
Expand All @@ -116,11 +134,11 @@ WITH
...
```

### Column clause
### Column Clause

The *column clause* indicates the field name to be used as training features, along with their optional preprocessing methods, e.g. `COLUMN sepal_length, sepal_width, petal_length, petal_width`.
The *column clause* indicates the field name for training features, along with their optional pre-processing methods, e.g. `COLUMN sepal_length, sepal_width, petal_length, petal_width`.

```SQL
```
COLUMN column_expr [, column_expr ...]
| COLUMN column_expr [, column_expr ...] FOR column_name
[COLUMN column_expr [, column_expr ...] FOR column_name ...]
Expand All @@ -129,43 +147,42 @@ COLUMN column_expr [, column_expr ...]
- *column_expr* indicates the field name and the preprocessing method on the field content. e.g. `sepal_length`, `NUMERIC(dense, 3)`. Please refer to [Feature columns](#feature-columns) for preprocessing details.
- *column_name* indicates the feature column names for the model inputs. Some models such as [DNNLinearCombinedClassifier](https://www.tensorflow.org/api_docs/python/tf/estimator/DNNLinearCombinedClassifier) have`linear_feature_columns` and `dnn_feature_columns` as feature column input.

For example, if you want to use fields `sepal_length`, `sepal_width`, `petal_length`, and `petal_width` as the features
without any preprocessing, you can write the following statement:
For example, if you want to use fields `sepal_length`, `sepal_width`, `petal_length`, and `petal_width` as the features without any pre-processing, you can write the following statement:

```SQL
```
SELECT ...
TRAIN ...
COLUMN sepal_length, sepal_width, petal_length, petal_width
...
```

### Label clause
### Label Clause

The *label clause* indicates the field name to be used as the training label, along with their optional preprocessing methods, e.g. `LABEL class`.
The *label clause* indicates the field name for the training label, along with their optional pre-processing methods, e.g. `LABEL class`.

```SQL
```
LABEL label_expr
```

- *label_expr* indicates the field name and the preprocessing method on the field content. e.g. `class`.
- *label_expr* indicates the field name and the pre-processing method on the field content, e.g. `class`. For unsupervised learning job, we should skip the label clause.

Note: some field names may look like SQLFlow keywords. For example, the table may contain a field named "label". You can use double quotes around the name `LABEL "label"` to work around the parsing error.

### Into clause
### Into Clause

The *into clause* indicates the table name to save the trained model into:

```SQL
```
INTO table_references
```

- *table_references* indicates the table to save the trained model. e.g. `sqlflow_model.my_dnn_model`.

Note: SQLFlow team is actively working on supporting saving model to third-party storage services such as AWS S3, Google Storage and Alibaba OSS.
Note: SQLFlow team is actively working on supporting saving model to third-party storage services such as AWS S3, Google Storage, and Alibaba OSS.

## Feature columns
### Feature Columns

SQLFlow supports various feature columns to preprocess raw data. Below is the currently supported feature columns:
SQLFlow supports specifying various feature columns in the column clause and label clause. Below are the currently supported feature columns:

<table>
<tr>
Expand Down Expand Up @@ -206,9 +223,7 @@ SQLFlow supports various feature columns to preprocess raw data. Below is the cu
</tr>
</table>

### NUMERIC

```SQL
```
NUMERIC(field, n[, delimiter=comma])
/*
NUMERIC converts a delimiter separated string to a n dimensional Tensor
Expand All @@ -229,13 +244,8 @@ Error:
Invalid field type. field type has to be string/varchar[n]
Invalid dimension. e.g. convert "0.2,1.7,0.6" to dimension 2.
*/
```

### CATEGORY_ID

Implements [tf.feature_column.categorical_column_with_identity](https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_identity).

```SQL
CATEGORY_ID(field, n[, delimiter=comma])
/*
CATEGORY_ID splits the input field by delimiter and returns identiy values
Expand All @@ -255,13 +265,8 @@ Example:
Error:
Invalid field type. field type has to be string/varchar[n]
*/
```

### SEQ_CATEGORY_ID

Implements [tf.feature_column.sequence_categorical_column_with_identity](https://www.tensorflow.org/api_docs/python/tf/feature_column/sequence_categorical_column_with_identity).

```SQL
SEQ_CATEGORY_ID(field, n[, delimiter=comma])
/*
SEQ_CATEGORY_ID splits the input field by delimiter and returns identiy values
Expand All @@ -281,13 +286,8 @@ Example:
Error:
Invalid field type. field type has to be string/varchar[n]
*/
```

### EMBEDDING

Implements [tf.feature_column.embedding_column](https://www.tensorflow.org/api_docs/python/tf/feature_column/embedding_column).

```SQL
EMBEDDING(category_column, n[, combiner])
/*
EMBEDDING converts a delimiter separated string to an n-dimensional Tensor
Expand All @@ -305,38 +305,93 @@ Example:
*/
```

## Models
## Prediction Syntax

A SQLFlow prediction statement consists of a sequence of select, predict, and using clauses.

```
SELECT select_expr [, select_expr ...]
FROM table_references
[WHERE where_condition]
[LIMIT row_count]
PREDICT result_table_reference
[WITH
attr_expr [, attr_expr ...]]
USING model_table_reference;
```

SQLFlow supports various TensorFlow premade estimators.
### Select Clause

### DNNClassifer
The [select clause](#select-clause) syntax is the same as the select clause syntax in the training syntax. SQLFlow uses the column name to guarantee the prediction data has the same order as the training data. For example, if we have used `c1`, `c2`, `c3` and `label` column to train a model, the select clause in the prediction job should also retrieve columns that contain exactly the same names.

### Predict and Using Clause

The *predict clause* describes the result table that a prediction job should write to, the table a prediction job should load the model from, and necessary configuration attributes for a prediction job.

```SQL
TRAIN DNNClassifier
WITH
model.hidden_units=[10,10],
model.n_classes=2,
model.batch_norm=False
```
PREDICT result_table_reference
[WITH
attr_expr [, attr_expr ...]]
USING model_table_reference;
```

### DNNLinearCombinedClassifier
- *result_table_reference* indicates the table to store the prediction result. Please be aware that all the data retrieved by the select clause plus the prediction result will be stored.
- *attr_expr* indicates the configuration attributes, e.g. `predict.batch_size = 1`.
- *model_table_reference* indicates the table a prediction job should load the model from.

```SQL
TRAIN DNNLinearCombinedClassifier
WITH
model.linear_optimizer='Ftrl',
model.dnn_optimizer='Adagrad',
model.dnn_hidden_units=None,
model.n_classes=2,
model.batch_norm=False,
model.linear_sparse_combiner='sum'
COLUMN ... FOR linear_feature_columns
COLUMN ... FOR dnn_feature_columns
For example, if we want to save the predicted result into table `iris.predict` at column `class` using the model stored at `sqlflow.my_dnn_model`. We can write the following statement:

```
SELECT ...
PREDICT iris.predict.class
USING sqlflow.my_dnn_model;
```

## Hyperparameters
## Analysis Syntax

SQLFlow supports various configurable training hyperparameters.
A SQLFlow prediction statement consists of a sequence of select, analyze, and using clauses.

```
SELECT select_expr [, select_expr ...]
FROM table_references
[WHERE where_condition]
[LIMIT row_count]
ANALYZE model_table_reference
[WITH
attr_expr [, attr_expr ...]]
USING explainer;
```

### Select Clause

The [select clause](#select-clause) syntax is the same as the select clause syntax in the training syntax. SQLFlow uses the column name to guarantee the analysis data has the same order as the training data. For example, if we have used `c1`, `c2`, `c3` and `label` column to train a model, the select clause in the analysis job should also retrieve columns that contain the same names.

### Analyze and Using Clause

The *analyze clause* describes the table an analysis job should load the model from, necessary configuration attributes, and the explainer for analysis.

```
ANALYZE model_table_reference
[WITH
attr_expr [, attr_expr ...]]
USING explainer;
```

- *model_table_reference* indicates the table a prediction job should load model from.
- *attr_expr* indicates the configuration attributes, e.g. `shap_summary.plot_type="bar"`.
- *explainer* indicates the type of the explainer, e.g. `TreeExplainer`.

For example, if we want to analyze the model stored at `sqlflow_models.my_xgb_regression_model` using the tree explainer and plot the analysis results in sorted order. We can write the following statement:

```
SELECT *
FROM boston.train
ANALYZE sqlflow_models.my_xgb_regression_model
WITH
shap_summary.sort=True
USING TreeExplainer;
```

## Models

1. `train.batch_size`. Default 1.
1. `train.epoch` . Default 1.
SQLFlow supports various TensorFlow pre-made estimators, Keras customized models, and XGBoost models. A full supported parameter list is under active construction, for now, please refer to [the tutorial](/doc/tutorial) for example usage.