diff --git a/doc/user_guide.md b/doc/language_guide.md similarity index 56% rename from doc/user_guide.md rename to doc/language_guide.md index 9cb8c3d227..32f39a7246 100644 --- a/doc/user_guide.md +++ b/doc/language_guide.md @@ -1,6 +1,8 @@ -# SQLFlow User Guide +# SQLFlow Language Guide -SQLFlow is a bridge that connects a SQL engine (e.g. MySQL, Hive, or MaxCompute) and TensorFlow and other machine learning toolkits. SQLFlow extends the SQL syntax to enable model training and inference. +SQLFlow is a bridge that connects a SQL engine (e.g., MySQL, Hive, or MaxCompute) and TensorFlow and other machine learning toolkits. SQLFlow extends the SQL syntax to enable model training, prediction, and analysis. + +This language guide elaborates SQLFlow extended syntax and feature column API. For specific examples, please refer to [the tutorial](/doc/tutorial). ## Overview @@ -40,11 +42,11 @@ Let's assume [iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_da -Let's train a `DNNClassifier`, which has 2 hidden layers where each layer has 10 hidden units, and then save the trained model into table `sqlflow_models.my_dnn_model` for making predictions later on. +Let's train a `DNNClassifier`, which has two hidden layers where each layer has ten hidden units, and then save the trained model into table `sqlflow_models.my_dnn_model` for making predictions later on. Instead of writing a Python program with a lot of boilerplate code, this can be achieved easily via the following statement in SQLFlow. -```SQL +``` SELECT * FROM iris.train TRAIN DNNClassifer WITH hidden_units = [10, 10], n_classes = 3, EPOCHS = 10 @@ -57,15 +59,31 @@ SQLFlow will then parse the above statement and translate it to an equivalent Py  -## Syntax +## Training Syntax + +A SQLFlow training statement consists of a sequence of select, train, column, label, and into clauses. -A SQLFlow training statement consists of a sequence of select, train, column, label and into clauses. +``` +SELECT select_expr [, select_expr ...] +FROM table_references + [WHERE where_condition] + [LIMIT row_count] +TRAIN model_identifier +[WITH + model_attr_expr [, model_attr_expr ...] + [, train_attr_expr ...]] +COLUMN column_expr [, column_expr ...] + | COLUMN column_expr [, column_expr ...] FOR column_name + [COLUMN column_expr [, column_expr ...] FOR column_name ...] +[LABEL label_expr] +INTO table_references; +``` -### Select clause +### Select Clause -The *select clause* describes the data retrieved from a particular table, e.g. `SELECT * FROM iris.train`. +The *select clause* describes the data retrieved from a particular table, e.g., `SELECT * FROM iris.train`. -```SQL +``` SELECT select_expr [, select_expr ...] FROM table_references [WHERE where_condition] @@ -81,7 +99,7 @@ Equivalent to [ANSI SQL Standards](https://www.whoishostingthis.com/resources/an For example, if you want to quickly prototype a binary classifier on a subset of the sample data, you can write the following statement: -```SQL +``` SELECT * FROM iris.train WHERE class = 0 OR class = 1 @@ -89,11 +107,11 @@ LIMIT 1000 TRAIN ... ``` -### Train clause +### Train Clause The *train clause* describes the specific model type and the way the model is trained, e.g. `TRAIN DNNClassifer WITH hidden_units = [10, 10], n_classes = 3, EPOCHS = 10`. -```SQL +``` TRAIN model_identifier WITH model_attr_expr [, model_attr_expr ...] @@ -104,9 +122,9 @@ WITH - *model_attr_expr* indicates the model attribute. e.g. `model.n_classes = 3`. Please refer to [Models](#models) for details. - *train_attr_expr* indicates the training attribute. e.g. `train.epoch = 10`. Please refer to [Hyperparameters](#hyperparameters) for details. -For example, if you want to train a `DNNClassifier`, which has 2 hidden layers where each layer has 10 hidden units, with 10 epochs, you can write the following statement: +For example, if you want to train a `DNNClassifier`, which has two hidden layers where each layer has ten hidden units, with ten epochs, you can write the following statement: -```SQL +``` SELECT ... TRAIN DNNClassifer WITH @@ -116,11 +134,11 @@ WITH ... ``` -### Column clause +### Column Clause -The *column clause* indicates the field name to be used as training features, along with their optional preprocessing methods, e.g. `COLUMN sepal_length, sepal_width, petal_length, petal_width`. +The *column clause* indicates the field name for training features, along with their optional pre-processing methods, e.g. `COLUMN sepal_length, sepal_width, petal_length, petal_width`. -```SQL +``` COLUMN column_expr [, column_expr ...] | COLUMN column_expr [, column_expr ...] FOR column_name [COLUMN column_expr [, column_expr ...] FOR column_name ...] @@ -129,43 +147,42 @@ COLUMN column_expr [, column_expr ...] - *column_expr* indicates the field name and the preprocessing method on the field content. e.g. `sepal_length`, `NUMERIC(dense, 3)`. Please refer to [Feature columns](#feature-columns) for preprocessing details. - *column_name* indicates the feature column names for the model inputs. Some models such as [DNNLinearCombinedClassifier](https://www.tensorflow.org/api_docs/python/tf/estimator/DNNLinearCombinedClassifier) have`linear_feature_columns` and `dnn_feature_columns` as feature column input. -For example, if you want to use fields `sepal_length`, `sepal_width`, `petal_length`, and `petal_width` as the features -without any preprocessing, you can write the following statement: +For example, if you want to use fields `sepal_length`, `sepal_width`, `petal_length`, and `petal_width` as the features without any pre-processing, you can write the following statement: -```SQL +``` SELECT ... TRAIN ... COLUMN sepal_length, sepal_width, petal_length, petal_width ... ``` -### Label clause +### Label Clause -The *label clause* indicates the field name to be used as the training label, along with their optional preprocessing methods, e.g. `LABEL class`. +The *label clause* indicates the field name for the training label, along with their optional pre-processing methods, e.g. `LABEL class`. -```SQL +``` LABEL label_expr ``` -- *label_expr* indicates the field name and the preprocessing method on the field content. e.g. `class`. +- *label_expr* indicates the field name and the pre-processing method on the field content, e.g. `class`. For unsupervised learning job, we should skip the label clause. Note: some field names may look like SQLFlow keywords. For example, the table may contain a field named "label". You can use double quotes around the name `LABEL "label"` to work around the parsing error. -### Into clause +### Into Clause The *into clause* indicates the table name to save the trained model into: -```SQL +``` INTO table_references ``` - *table_references* indicates the table to save the trained model. e.g. `sqlflow_model.my_dnn_model`. -Note: SQLFlow team is actively working on supporting saving model to third-party storage services such as AWS S3, Google Storage and Alibaba OSS. +Note: SQLFlow team is actively working on supporting saving model to third-party storage services such as AWS S3, Google Storage, and Alibaba OSS. -## Feature columns +### Feature Columns -SQLFlow supports various feature columns to preprocess raw data. Below is the currently supported feature columns: +SQLFlow supports specifying various feature columns in the column clause and label clause. Below are the currently supported feature columns: