From 3f6f824d18503e1a160292daa49773d7b16852f5 Mon Sep 17 00:00:00 2001 From: "Yang Yang(Tony)" <29932814+tonyyang-svail@users.noreply.github.com> Date: Thu, 5 Sep 2019 17:55:42 -0700 Subject: [PATCH 1/6] [Design] Intermediate Representation --- doc/design_intermediate_representation.md | 81 +++++++++++++++++++++++ 1 file changed, 81 insertions(+) create mode 100644 doc/design_intermediate_representation.md diff --git a/doc/design_intermediate_representation.md b/doc/design_intermediate_representation.md new file mode 100644 index 0000000000..225c012c61 --- /dev/null +++ b/doc/design_intermediate_representation.md @@ -0,0 +1,81 @@ +# _Design:_ Intermediate Representation + +## Overview + +As SQLFlow is supporting more and more machine learning toolkits, their corresponding code generation logics are better being orgnized as separate packages. An intermediate representation(IR) of the SQL jobs becomes necessary to connect these separate packages with the core `sql` package. + +The core `sql` package should include the following functionalities: +1. The entry point of running extended SQL statements. +1. The [parsing](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/sql_parser.md) of extended SQL statements. +1. The verification of extended SQL statements, including verifying the syntax, the existence of the selected fields. +1. The [feature derivation](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/feature_derivation.md), including name, type, shape, and preprocessing method of the select fields. +1. The [training data and validation data split](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/training_and_validation.md). + +With these functionalities, the `sql` package çan translate user typed extended SQL statements to an IR as an exposed Go struct. The codegen package takes the IR and returns a generated Python program for the `sql` package to execute. + +## Code Structure + +We propose the following code structures. + +``` +sql/ + ... + codegen/ + tensorflow/ + train.go + predict.go + analyze.go + xgboost/ + ... +``` + +The `tensorflow` package will expose function `func Train(ir sql.TrainIR) string, error`, which takes the `sql`'s `TrainIR` and returns a generated Python program. + +## Intermediate Representation + +We propose the following struct as the IR for code generation. + +```go +package sql + +// FieldMeta contains the meta information for decoding +type FieldMeta struct { + DType string // e.g. "float", "int32" + Delimiter string // e.g. "," + Shape []int // e.g. [1], [1 2 3] + IsSparse bool // e.g. false +} + +// TrainIR is the intermediate representation for code generation of a training job +type TrainIR struct { + DataSource string // e.g. "hive://root:root@localhost:10000/churn" + ExtraConfig string // Extra configuration in JSON format. e.g. OSS credential + Select string // e.g. "select * from iris.train" + ExtraSelect map[string]string // e.g. {"validation": "select * from iris.val;"} + Estimator string // e.g. "DNNClassifier" + Attribute map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]} + Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...} + Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} +} + +// PredictIR is the intermediate representation for code generation of a prediction job +type PredictIR struct { + DataSource string // e.g. "hive://root:root@localhost:10000/churn" + ExtraConfig string // Extra configuration in JSON format. e.g. OSS credential + Select string // e.g. "select * from iris.train" + Estimator string // e.g. "DNNClassifier" + Attribute map[string]interface{} // e.g. {"predict.batch_size": 32} + Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...} +} + +// AnalyzeIR is the intermediate representation for code generation of a analysis job +type AnalyzeIR struct { + DataSource string // e.g. "hive://root:root@localhost:10000/churn" + ExtraConfig string // Extra configuration in JSON format. e.g. OSS credential + Select string // e.g. "select * from iris.train" + Estimator string // e.g. "DNNClassifier" + Attribute map[string]interface{} // e.g. {"analyze.plot_type": "bar"} + Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...} + Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} +} +``` From 9952f30c28af7b577cff6b6425c064e8a0fd180f Mon Sep 17 00:00:00 2001 From: "Yang Yang(Tony)" <29932814+tonyyang-svail@users.noreply.github.com> Date: Thu, 5 Sep 2019 18:50:11 -0700 Subject: [PATCH 2/6] Update design_intermediate_representation.md --- doc/design_intermediate_representation.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/doc/design_intermediate_representation.md b/doc/design_intermediate_representation.md index 225c012c61..4ba8031ef4 100644 --- a/doc/design_intermediate_representation.md +++ b/doc/design_intermediate_representation.md @@ -79,3 +79,13 @@ type AnalyzeIR struct { Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} } ``` + +Please be aware that all the IR excludes the information of working directory. This information belongs to the `executor` in `sql` package. +- For training job + - If `executor` runs the generated program in a temporary directory, it should serialize the directory to a table for later use. + - If `executor` runs the generated program in a local directory, it should make sure the prediction and analyze job sees the same directory. +- For prediction and analyze job, the `executor` should recover everything produced by the training job. + +Please be aware that `TrainIR` excludes the saving table name. This information belongs to the `executor` in `sql` package. +- For a local training job, the result of the generated program contains the trained model. And `executor` is re +- For a distributed training job, the generated program should garantee that the local directory contains enough information, such as OSS bucket name. So that later on the prediction job find it. From 09f3fe598a462a478fa2c5d5a90bd5edfd1ee046 Mon Sep 17 00:00:00 2001 From: "Yang Yang(Tony)" <29932814+tonyyang-svail@users.noreply.github.com> Date: Fri, 6 Sep 2019 11:16:03 -0700 Subject: [PATCH 3/6] follow comments --- doc/design_intermediate_representation.md | 62 +++++++++++++---------- 1 file changed, 36 insertions(+), 26 deletions(-) diff --git a/doc/design_intermediate_representation.md b/doc/design_intermediate_representation.md index 4ba8031ef4..21fc86a574 100644 --- a/doc/design_intermediate_representation.md +++ b/doc/design_intermediate_representation.md @@ -38,45 +38,55 @@ We propose the following struct as the IR for code generation. ```go package sql -// FieldMeta contains the meta information for decoding +import ( + "github.com/sql-machine-learning/sqlflow/sql/columns" +) + +type FieldType int + +const ( + Int FieldType = iota + Float + String +) + +// FieldMeta contains the meta information for decoding and feature columns type FieldMeta struct { - DType string // e.g. "float", "int32" - Delimiter string // e.g. "," - Shape []int // e.g. [1], [1 2 3] - IsSparse bool // e.g. false + DType FieldType // e.g. "float", "int32" + Delimiter string // e.g. "," + Shape []int // e.g. [1], [1 2 3] + IsSparse bool // e.g. false + FeatureColumn []columns.FeatureColumn // e.g. [EmbeddingColumn, CategoryIDColumn] } // TrainIR is the intermediate representation for code generation of a training job type TrainIR struct { - DataSource string // e.g. "hive://root:root@localhost:10000/churn" - ExtraConfig string // Extra configuration in JSON format. e.g. OSS credential - Select string // e.g. "select * from iris.train" - ExtraSelect map[string]string // e.g. {"validation": "select * from iris.val;"} - Estimator string // e.g. "DNNClassifier" - Attribute map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]} - Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...} - Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} + DataSource string // e.g. "hive://root:root@localhost:10000/churn" + Select string // e.g. "select * from iris.train" + ValidationSelect string // e.g. "select * from iris.val;" + Estimator string // e.g. "DNNClassifier" + Attribute map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]} + Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...} + Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} } // PredictIR is the intermediate representation for code generation of a prediction job type PredictIR struct { - DataSource string // e.g. "hive://root:root@localhost:10000/churn" - ExtraConfig string // Extra configuration in JSON format. e.g. OSS credential - Select string // e.g. "select * from iris.train" - Estimator string // e.g. "DNNClassifier" - Attribute map[string]interface{} // e.g. {"predict.batch_size": 32} - Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...} + DataSource string // e.g. "hive://root:root@localhost:10000/churn" + Select string // e.g. "select * from iris.train" + Estimator string // e.g. "DNNClassifier" + Attribute map[string]interface{} // e.g. {"predict.batch_size": 32} + Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...} } // AnalyzeIR is the intermediate representation for code generation of a analysis job type AnalyzeIR struct { - DataSource string // e.g. "hive://root:root@localhost:10000/churn" - ExtraConfig string // Extra configuration in JSON format. e.g. OSS credential - Select string // e.g. "select * from iris.train" - Estimator string // e.g. "DNNClassifier" - Attribute map[string]interface{} // e.g. {"analyze.plot_type": "bar"} - Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...} - Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} + DataSource string // e.g. "hive://root:root@localhost:10000/churn" + Select string // e.g. "select * from iris.train" + Estimator string // e.g. "DNNClassifier" + Attribute map[string]interface{} // e.g. {"analyze.plot_type": "bar"} + Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...} + Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} } ``` From 1f04ed6b9d9f9be559de18f36c77fd2362c4e796 Mon Sep 17 00:00:00 2001 From: "Yang Yang(Tony)" <29932814+tonyyang-svail@users.noreply.github.com> Date: Fri, 6 Sep 2019 11:41:31 -0700 Subject: [PATCH 4/6] Update design_intermediate_representation.md --- doc/design_intermediate_representation.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/doc/design_intermediate_representation.md b/doc/design_intermediate_representation.md index 21fc86a574..6e890209a1 100644 --- a/doc/design_intermediate_representation.md +++ b/doc/design_intermediate_representation.md @@ -72,11 +72,13 @@ type TrainIR struct { // PredictIR is the intermediate representation for code generation of a prediction job type PredictIR struct { - DataSource string // e.g. "hive://root:root@localhost:10000/churn" - Select string // e.g. "select * from iris.train" - Estimator string // e.g. "DNNClassifier" - Attribute map[string]interface{} // e.g. {"predict.batch_size": 32} - Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...} + DataSource string // e.g. "hive://root:root@localhost:10000/churn" + Select string // e.g. "select * from iris.test" + Estimator string // e.g. "DNNClassifier" + Attribute map[string]interface{} // e.g. {"predict.batch_size": 32} + Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...} + Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} + ReusltTable string // e.g. "iris.predict" } // AnalyzeIR is the intermediate representation for code generation of a analysis job From ed49d4b975b84524b5b1c298c99f53b6604fcbf3 Mon Sep 17 00:00:00 2001 From: "Yang Yang(Tony)" <29932814+tonyyang-svail@users.noreply.github.com> Date: Fri, 6 Sep 2019 12:26:15 -0700 Subject: [PATCH 5/6] Update design_intermediate_representation.md --- doc/design_intermediate_representation.md | 40 +++++++++++------------ 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/doc/design_intermediate_representation.md b/doc/design_intermediate_representation.md index 6e890209a1..aed045889a 100644 --- a/doc/design_intermediate_representation.md +++ b/doc/design_intermediate_representation.md @@ -61,34 +61,34 @@ type FieldMeta struct { // TrainIR is the intermediate representation for code generation of a training job type TrainIR struct { - DataSource string // e.g. "hive://root:root@localhost:10000/churn" - Select string // e.g. "select * from iris.train" - ValidationSelect string // e.g. "select * from iris.val;" - Estimator string // e.g. "DNNClassifier" - Attribute map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]} - Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...} - Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} + DataSource string // e.g. "hive://root:root@localhost:10000/churn" + Select string // e.g. "select * from iris.train" + ValidationSelect string // e.g. "select * from iris.val;" + Estimator string // e.g. "DNNClassifier" + Attribute map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]} + Feature map[string]map[string]FieldMeta // e.g. {"feature_columns": {"sepal_length": {"float", "", [1], false}, ...}} + Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} } // PredictIR is the intermediate representation for code generation of a prediction job type PredictIR struct { - DataSource string // e.g. "hive://root:root@localhost:10000/churn" - Select string // e.g. "select * from iris.test" - Estimator string // e.g. "DNNClassifier" - Attribute map[string]interface{} // e.g. {"predict.batch_size": 32} - Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...} - Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} - ReusltTable string // e.g. "iris.predict" + DataSource string // e.g. "hive://root:root@localhost:10000/churn" + Select string // e.g. "select * from iris.test" + Estimator string // e.g. "DNNClassifier" + Attribute map[string]interface{} // e.g. {"predict.batch_size": 32} + Feature map[string]map[string]FieldMeta // e.g. {"feature_columns": {"sepal_length": {"float", "", [1], false}, ...}} + Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} + ReusltTable string // e.g. "iris.predict" } // AnalyzeIR is the intermediate representation for code generation of a analysis job type AnalyzeIR struct { - DataSource string // e.g. "hive://root:root@localhost:10000/churn" - Select string // e.g. "select * from iris.train" - Estimator string // e.g. "DNNClassifier" - Attribute map[string]interface{} // e.g. {"analyze.plot_type": "bar"} - Feature map[string]FieldMeta // e.g. {"sepal_length": {"float", "", [1], false}, ...} - Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} + DataSource string // e.g. "hive://root:root@localhost:10000/churn" + Select string // e.g. "select * from iris.train" + Estimator string // e.g. "DNNClassifier" + Attribute map[string]interface{} // e.g. {"analyze.plot_type": "bar"} + Feature map[string]map[string]FieldMeta // e.g. {"feature_columns": {"sepal_length": {"float", "", [1], false}, ...}} + Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} } ``` From 445dc5689ab566221bfa4ee5c5a3b3f66e35901a Mon Sep 17 00:00:00 2001 From: "Yang Yang(Tony)" <29932814+tonyyang-svail@users.noreply.github.com> Date: Fri, 6 Sep 2019 12:37:22 -0700 Subject: [PATCH 6/6] polish --- doc/design_intermediate_representation.md | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/doc/design_intermediate_representation.md b/doc/design_intermediate_representation.md index aed045889a..86fec3269a 100644 --- a/doc/design_intermediate_representation.md +++ b/doc/design_intermediate_representation.md @@ -2,7 +2,7 @@ ## Overview -As SQLFlow is supporting more and more machine learning toolkits, their corresponding code generation logics are better being orgnized as separate packages. An intermediate representation(IR) of the SQL jobs becomes necessary to connect these separate packages with the core `sql` package. +As SQLFlow is supporting more and more machine learning toolkits, the corresponding code generation logics are better being organized as separate packages. An intermediate representation(IR) of the SQL jobs becomes necessary to connect these separate packages with the core `sql` package. The core `sql` package should include the following functionalities: 1. The entry point of running extended SQL statements. @@ -92,12 +92,6 @@ type AnalyzeIR struct { } ``` -Please be aware that all the IR excludes the information of working directory. This information belongs to the `executor` in `sql` package. -- For training job - - If `executor` runs the generated program in a temporary directory, it should serialize the directory to a table for later use. - - If `executor` runs the generated program in a local directory, it should make sure the prediction and analyze job sees the same directory. -- For prediction and analyze job, the `executor` should recover everything produced by the training job. +Please be aware that all the IR excludes the information of the current working directory. This information belongs to the `executor` in `sql` package. For a prediction/analyze job, the `executor` should recover everything produced by the training job. Please be aware that `TrainIR` excludes the saving table name. This information belongs to the `executor` in `sql` package. -- For a local training job, the result of the generated program contains the trained model. And `executor` is re -- For a distributed training job, the generated program should garantee that the local directory contains enough information, such as OSS bucket name. So that later on the prediction job find it.