From 3f6f824d18503e1a160292daa49773d7b16852f5 Mon Sep 17 00:00:00 2001
From: "Yang Yang(Tony)" <29932814+tonyyang-svail@users.noreply.github.com>
Date: Thu, 5 Sep 2019 17:55:42 -0700
Subject: [PATCH 1/6] [Design] Intermediate Representation

---
 doc/design_intermediate_representation.md | 81 +++++++++++++++++++++++
 1 file changed, 81 insertions(+)
 create mode 100644 doc/design_intermediate_representation.md

diff --git a/doc/design_intermediate_representation.md b/doc/design_intermediate_representation.md
new file mode 100644
index 0000000000..225c012c61
--- /dev/null
+++ b/doc/design_intermediate_representation.md
@@ -0,0 +1,81 @@
+# _Design:_ Intermediate Representation
+
+## Overview
+
+As SQLFlow is supporting more and more machine learning toolkits, their corresponding code generation logics are better being orgnized as separate packages. An intermediate representation(IR) of the SQL jobs becomes necessary to connect these separate packages with the core `sql` package.
+
+The core `sql` package should include the following functionalities:
+1. The entry point of running extended SQL statements.
+1. The [parsing](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/sql_parser.md) of extended SQL statements.
+1. The verification of extended SQL statements, including verifying the syntax, the existence of the selected fields.
+1. The [feature derivation](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/feature_derivation.md), including name, type, shape, and preprocessing method of the select fields.
+1. The [training data and validation data split](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/training_and_validation.md).
+
+With these functionalities, the `sql` package çan translate user typed extended SQL statements to an IR as an exposed Go struct. The codegen package takes the IR and returns a generated Python program for the `sql` package to execute.
+
+## Code Structure
+
+We propose the following code structures.
+
+```
+sql/
+  ...
+  codegen/
+    tensorflow/
+      train.go
+      predict.go
+      analyze.go
+    xgboost/
+      ...
+```
+
+The `tensorflow` package will expose function `func Train(ir sql.TrainIR) string, error`, which takes the `sql`'s `TrainIR` and returns a generated Python program.
+
+## Intermediate Representation
+
+We propose the following struct as the IR for code generation.
+
+```go
+package sql
+
+// FieldMeta contains the meta information for decoding
+type FieldMeta struct {
+	DType     string // e.g. "float", "int32"
+	Delimiter string // e.g. ","
+	Shape     []int  // e.g. [1], [1 2 3]
+	IsSparse  bool   // e.g. false
+}
+
+// TrainIR is the intermediate representation for code generation of a training job
+type TrainIR struct {
+	DataSource  string                 // e.g. "hive://root:root@localhost:10000/churn"
+	ExtraConfig string                 // Extra configuration in JSON format. e.g. OSS credential
+	Select      string                 // e.g. "select * from iris.train"
+	ExtraSelect map[string]string      // e.g. {"validation": "select * from iris.val;"}
+	Estimator   string                 // e.g. "DNNClassifier"
+	Attribute   map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]}
+	Feature     map[string]FieldMeta   // e.g. {"sepal_length": {"float", "", [1], false}, ...}
+	Label       map[string]FieldMeta   // e.g. {"class": {"int32", "", [1], false}}
+}
+
+// PredictIR is the intermediate representation for code generation of a prediction job
+type PredictIR struct {
+	DataSource  string                 // e.g. "hive://root:root@localhost:10000/churn"
+	ExtraConfig string                 // Extra configuration in JSON format. e.g. OSS credential
+	Select      string                 // e.g. "select * from iris.train"
+	Estimator   string                 // e.g. "DNNClassifier"
+	Attribute   map[string]interface{} // e.g. {"predict.batch_size": 32}
+	Feature     map[string]FieldMeta   // e.g. {"sepal_length": {"float", "", [1], false}, ...}
+}
+
+// AnalyzeIR is the intermediate representation for code generation of a analysis job
+type AnalyzeIR struct {
+	DataSource  string                 // e.g. "hive://root:root@localhost:10000/churn"
+	ExtraConfig string                 // Extra configuration in JSON format. e.g. OSS credential
+	Select      string                 // e.g. "select * from iris.train"
+	Estimator   string                 // e.g. "DNNClassifier"
+	Attribute   map[string]interface{} // e.g. {"analyze.plot_type": "bar"}
+	Feature     map[string]FieldMeta   // e.g. {"sepal_length": {"float", "", [1], false}, ...}
+	Label       map[string]FieldMeta   // e.g. {"class": {"int32", "", [1], false}}
+}
+```

From 9952f30c28af7b577cff6b6425c064e8a0fd180f Mon Sep 17 00:00:00 2001
From: "Yang Yang(Tony)" <29932814+tonyyang-svail@users.noreply.github.com>
Date: Thu, 5 Sep 2019 18:50:11 -0700
Subject: [PATCH 2/6] Update design_intermediate_representation.md

---
 doc/design_intermediate_representation.md | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/doc/design_intermediate_representation.md b/doc/design_intermediate_representation.md
index 225c012c61..4ba8031ef4 100644
--- a/doc/design_intermediate_representation.md
+++ b/doc/design_intermediate_representation.md
@@ -79,3 +79,13 @@ type AnalyzeIR struct {
 	Label       map[string]FieldMeta   // e.g. {"class": {"int32", "", [1], false}}
 }
 ```
+
+Please be aware that all the IR excludes the information of working directory. This information belongs to the `executor` in `sql` package.
+- For training job
+  - If `executor` runs the generated program in a temporary directory, it should serialize the directory to a table for later use.
+  - If `executor` runs the generated program in a local directory, it should make sure the prediction and analyze job sees the same directory.
+- For prediction and analyze job, the `executor` should recover everything produced by the training job.
+
+Please be aware that `TrainIR` excludes the saving table name. This information belongs to the `executor` in `sql` package.
+- For a local training job, the result of the generated program contains the trained model. And `executor` is re
+- For a distributed training job, the generated program should garantee that the local directory contains enough information, such as OSS bucket name. So that later on the prediction job find it.

From 09f3fe598a462a478fa2c5d5a90bd5edfd1ee046 Mon Sep 17 00:00:00 2001
From: "Yang Yang(Tony)" <29932814+tonyyang-svail@users.noreply.github.com>
Date: Fri, 6 Sep 2019 11:16:03 -0700
Subject: [PATCH 3/6] follow comments

---
 doc/design_intermediate_representation.md | 62 +++++++++++++----------
 1 file changed, 36 insertions(+), 26 deletions(-)

diff --git a/doc/design_intermediate_representation.md b/doc/design_intermediate_representation.md
index 4ba8031ef4..21fc86a574 100644
--- a/doc/design_intermediate_representation.md
+++ b/doc/design_intermediate_representation.md
@@ -38,45 +38,55 @@ We propose the following struct as the IR for code generation.
 ```go
 package sql
 
-// FieldMeta contains the meta information for decoding
+import (
+	"github.com/sql-machine-learning/sqlflow/sql/columns"
+)
+
+type FieldType int
+
+const (
+	Int FieldType = iota
+	Float
+	String
+)
+
+// FieldMeta contains the meta information for decoding and feature columns
 type FieldMeta struct {
-	DType     string // e.g. "float", "int32"
-	Delimiter string // e.g. ","
-	Shape     []int  // e.g. [1], [1 2 3]
-	IsSparse  bool   // e.g. false
+	DType         FieldType               // e.g. "float", "int32"
+	Delimiter     string                  // e.g. ","
+	Shape         []int                   // e.g. [1], [1 2 3]
+	IsSparse      bool                    // e.g. false
+	FeatureColumn []columns.FeatureColumn // e.g. [EmbeddingColumn, CategoryIDColumn]
 }
 
 // TrainIR is the intermediate representation for code generation of a training job
 type TrainIR struct {
-	DataSource  string                 // e.g. "hive://root:root@localhost:10000/churn"
-	ExtraConfig string                 // Extra configuration in JSON format. e.g. OSS credential
-	Select      string                 // e.g. "select * from iris.train"
-	ExtraSelect map[string]string      // e.g. {"validation": "select * from iris.val;"}
-	Estimator   string                 // e.g. "DNNClassifier"
-	Attribute   map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]}
-	Feature     map[string]FieldMeta   // e.g. {"sepal_length": {"float", "", [1], false}, ...}
-	Label       map[string]FieldMeta   // e.g. {"class": {"int32", "", [1], false}}
+	DataSource       string                 // e.g. "hive://root:root@localhost:10000/churn"
+	Select           string                 // e.g. "select * from iris.train"
+	ValidationSelect string                 // e.g. "select * from iris.val;"
+	Estimator        string                 // e.g. "DNNClassifier"
+	Attribute        map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]}
+	Feature          map[string]FieldMeta   // e.g. {"sepal_length": {"float", "", [1], false}, ...}
+	Label            map[string]FieldMeta   // e.g. {"class": {"int32", "", [1], false}}
 }
 
 // PredictIR is the intermediate representation for code generation of a prediction job
 type PredictIR struct {
-	DataSource  string                 // e.g. "hive://root:root@localhost:10000/churn"
-	ExtraConfig string                 // Extra configuration in JSON format. e.g. OSS credential
-	Select      string                 // e.g. "select * from iris.train"
-	Estimator   string                 // e.g. "DNNClassifier"
-	Attribute   map[string]interface{} // e.g. {"predict.batch_size": 32}
-	Feature     map[string]FieldMeta   // e.g. {"sepal_length": {"float", "", [1], false}, ...}
+	DataSource string                 // e.g. "hive://root:root@localhost:10000/churn"
+	Select     string                 // e.g. "select * from iris.train"
+	Estimator  string                 // e.g. "DNNClassifier"
+	Attribute  map[string]interface{} // e.g. {"predict.batch_size": 32}
+	Feature    map[string]FieldMeta   // e.g. {"sepal_length": {"float", "", [1], false}, ...}
 }
 
 // AnalyzeIR is the intermediate representation for code generation of a analysis job
 type AnalyzeIR struct {
-	DataSource  string                 // e.g. "hive://root:root@localhost:10000/churn"
-	ExtraConfig string                 // Extra configuration in JSON format. e.g. OSS credential
-	Select      string                 // e.g. "select * from iris.train"
-	Estimator   string                 // e.g. "DNNClassifier"
-	Attribute   map[string]interface{} // e.g. {"analyze.plot_type": "bar"}
-	Feature     map[string]FieldMeta   // e.g. {"sepal_length": {"float", "", [1], false}, ...}
-	Label       map[string]FieldMeta   // e.g. {"class": {"int32", "", [1], false}}
+	DataSource string                 // e.g. "hive://root:root@localhost:10000/churn"
+	Select     string                 // e.g. "select * from iris.train"
+	Estimator  string                 // e.g. "DNNClassifier"
+	Attribute  map[string]interface{} // e.g. {"analyze.plot_type": "bar"}
+	Feature    map[string]FieldMeta   // e.g. {"sepal_length": {"float", "", [1], false}, ...}
+	Label      map[string]FieldMeta   // e.g. {"class": {"int32", "", [1], false}}
 }
 ```
 

From 1f04ed6b9d9f9be559de18f36c77fd2362c4e796 Mon Sep 17 00:00:00 2001
From: "Yang Yang(Tony)" <29932814+tonyyang-svail@users.noreply.github.com>
Date: Fri, 6 Sep 2019 11:41:31 -0700
Subject: [PATCH 4/6] Update design_intermediate_representation.md

---
 doc/design_intermediate_representation.md | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/doc/design_intermediate_representation.md b/doc/design_intermediate_representation.md
index 21fc86a574..6e890209a1 100644
--- a/doc/design_intermediate_representation.md
+++ b/doc/design_intermediate_representation.md
@@ -72,11 +72,13 @@ type TrainIR struct {
 
 // PredictIR is the intermediate representation for code generation of a prediction job
 type PredictIR struct {
-	DataSource string                 // e.g. "hive://root:root@localhost:10000/churn"
-	Select     string                 // e.g. "select * from iris.train"
-	Estimator  string                 // e.g. "DNNClassifier"
-	Attribute  map[string]interface{} // e.g. {"predict.batch_size": 32}
-	Feature    map[string]FieldMeta   // e.g. {"sepal_length": {"float", "", [1], false}, ...}
+	DataSource  string                 // e.g. "hive://root:root@localhost:10000/churn"
+	Select      string                 // e.g. "select * from iris.test"
+	Estimator   string                 // e.g. "DNNClassifier"
+	Attribute   map[string]interface{} // e.g. {"predict.batch_size": 32}
+	Feature     map[string]FieldMeta   // e.g. {"sepal_length": {"float", "", [1], false}, ...}
+	Label       map[string]FieldMeta   // e.g. {"class": {"int32", "", [1], false}}
+	ReusltTable string                 // e.g. "iris.predict"
 }
 
 // AnalyzeIR is the intermediate representation for code generation of a analysis job

From ed49d4b975b84524b5b1c298c99f53b6604fcbf3 Mon Sep 17 00:00:00 2001
From: "Yang Yang(Tony)" <29932814+tonyyang-svail@users.noreply.github.com>
Date: Fri, 6 Sep 2019 12:26:15 -0700
Subject: [PATCH 5/6] Update design_intermediate_representation.md

---
 doc/design_intermediate_representation.md | 40 +++++++++++------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/doc/design_intermediate_representation.md b/doc/design_intermediate_representation.md
index 6e890209a1..aed045889a 100644
--- a/doc/design_intermediate_representation.md
+++ b/doc/design_intermediate_representation.md
@@ -61,34 +61,34 @@ type FieldMeta struct {
 
 // TrainIR is the intermediate representation for code generation of a training job
 type TrainIR struct {
-	DataSource       string                 // e.g. "hive://root:root@localhost:10000/churn"
-	Select           string                 // e.g. "select * from iris.train"
-	ValidationSelect string                 // e.g. "select * from iris.val;"
-	Estimator        string                 // e.g. "DNNClassifier"
-	Attribute        map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]}
-	Feature          map[string]FieldMeta   // e.g. {"sepal_length": {"float", "", [1], false}, ...}
-	Label            map[string]FieldMeta   // e.g. {"class": {"int32", "", [1], false}}
+	DataSource       string                          // e.g. "hive://root:root@localhost:10000/churn"
+	Select           string                          // e.g. "select * from iris.train"
+	ValidationSelect string                          // e.g. "select * from iris.val;"
+	Estimator        string                          // e.g. "DNNClassifier"
+	Attribute        map[string]interface{}          // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]}
+	Feature          map[string]map[string]FieldMeta // e.g. {"feature_columns": {"sepal_length": {"float", "", [1], false}, ...}}
+	Label            map[string]FieldMeta            // e.g. {"class": {"int32", "", [1], false}}
 }
 
 // PredictIR is the intermediate representation for code generation of a prediction job
 type PredictIR struct {
-	DataSource  string                 // e.g. "hive://root:root@localhost:10000/churn"
-	Select      string                 // e.g. "select * from iris.test"
-	Estimator   string                 // e.g. "DNNClassifier"
-	Attribute   map[string]interface{} // e.g. {"predict.batch_size": 32}
-	Feature     map[string]FieldMeta   // e.g. {"sepal_length": {"float", "", [1], false}, ...}
-	Label       map[string]FieldMeta   // e.g. {"class": {"int32", "", [1], false}}
-	ReusltTable string                 // e.g. "iris.predict"
+	DataSource  string                          // e.g. "hive://root:root@localhost:10000/churn"
+	Select      string                          // e.g. "select * from iris.test"
+	Estimator   string                          // e.g. "DNNClassifier"
+	Attribute   map[string]interface{}          // e.g. {"predict.batch_size": 32}
+	Feature     map[string]map[string]FieldMeta // e.g. {"feature_columns": {"sepal_length": {"float", "", [1], false}, ...}}
+	Label       map[string]FieldMeta            // e.g. {"class": {"int32", "", [1], false}}
+	ReusltTable string                          // e.g. "iris.predict"
 }
 
 // AnalyzeIR is the intermediate representation for code generation of a analysis job
 type AnalyzeIR struct {
-	DataSource string                 // e.g. "hive://root:root@localhost:10000/churn"
-	Select     string                 // e.g. "select * from iris.train"
-	Estimator  string                 // e.g. "DNNClassifier"
-	Attribute  map[string]interface{} // e.g. {"analyze.plot_type": "bar"}
-	Feature    map[string]FieldMeta   // e.g. {"sepal_length": {"float", "", [1], false}, ...}
-	Label      map[string]FieldMeta   // e.g. {"class": {"int32", "", [1], false}}
+	DataSource string                          // e.g. "hive://root:root@localhost:10000/churn"
+	Select     string                          // e.g. "select * from iris.train"
+	Estimator  string                          // e.g. "DNNClassifier"
+	Attribute  map[string]interface{}          // e.g. {"analyze.plot_type": "bar"}
+	Feature    map[string]map[string]FieldMeta // e.g. {"feature_columns": {"sepal_length": {"float", "", [1], false}, ...}}
+	Label      map[string]FieldMeta            // e.g. {"class": {"int32", "", [1], false}}
 }
 ```
 

From 445dc5689ab566221bfa4ee5c5a3b3f66e35901a Mon Sep 17 00:00:00 2001
From: "Yang Yang(Tony)" <29932814+tonyyang-svail@users.noreply.github.com>
Date: Fri, 6 Sep 2019 12:37:22 -0700
Subject: [PATCH 6/6] polish

---
 doc/design_intermediate_representation.md | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/doc/design_intermediate_representation.md b/doc/design_intermediate_representation.md
index aed045889a..86fec3269a 100644
--- a/doc/design_intermediate_representation.md
+++ b/doc/design_intermediate_representation.md
@@ -2,7 +2,7 @@
 
 ## Overview
 
-As SQLFlow is supporting more and more machine learning toolkits, their corresponding code generation logics are better being orgnized as separate packages. An intermediate representation(IR) of the SQL jobs becomes necessary to connect these separate packages with the core `sql` package.
+As SQLFlow is supporting more and more machine learning toolkits, the corresponding code generation logics are better being organized as separate packages. An intermediate representation(IR) of the SQL jobs becomes necessary to connect these separate packages with the core `sql` package.
 
 The core `sql` package should include the following functionalities:
 1. The entry point of running extended SQL statements.
@@ -92,12 +92,6 @@ type AnalyzeIR struct {
 }
 ```
 
-Please be aware that all the IR excludes the information of working directory. This information belongs to the `executor` in `sql` package.
-- For training job
-  - If `executor` runs the generated program in a temporary directory, it should serialize the directory to a table for later use.
-  - If `executor` runs the generated program in a local directory, it should make sure the prediction and analyze job sees the same directory.
-- For prediction and analyze job, the `executor` should recover everything produced by the training job.
+Please be aware that all the IR excludes the information of the current working directory. This information belongs to the `executor` in `sql` package. For a prediction/analyze job, the `executor` should recover everything produced by the training job.
 
 Please be aware that `TrainIR` excludes the saving table name. This information belongs to the `executor` in `sql` package.
-- For a local training job, the result of the generated program contains the trained model. And `executor` is re
-- For a distributed training job, the generated program should garantee that the local directory contains enough information, such as OSS bucket name. So that later on the prediction job find it.