-
Notifications
You must be signed in to change notification settings - Fork 705
[Design] Intermediate Representation #785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
3f6f824
[Design] Intermediate Representation
tonyyang-svail 9952f30
Update design_intermediate_representation.md
tonyyang-svail 09f3fe5
follow comments
tonyyang-svail 1f04ed6
Update design_intermediate_representation.md
tonyyang-svail ed49d4b
Update design_intermediate_representation.md
tonyyang-svail 445dc56
polish
tonyyang-svail File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,97 @@ | ||
| # _Design:_ Intermediate Representation | ||
|
|
||
| ## Overview | ||
|
|
||
| As SQLFlow is supporting more and more machine learning toolkits, the corresponding code generation logics are better being organized as separate packages. An intermediate representation(IR) of the SQL jobs becomes necessary to connect these separate packages with the core `sql` package. | ||
|
|
||
| The core `sql` package should include the following functionalities: | ||
| 1. The entry point of running extended SQL statements. | ||
| 1. The [parsing](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/sql_parser.md) of extended SQL statements. | ||
| 1. The verification of extended SQL statements, including verifying the syntax, the existence of the selected fields. | ||
| 1. The [feature derivation](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/feature_derivation.md), including name, type, shape, and preprocessing method of the select fields. | ||
| 1. The [training data and validation data split](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/training_and_validation.md). | ||
|
|
||
| With these functionalities, the `sql` package çan translate user typed extended SQL statements to an IR as an exposed Go struct. The codegen package takes the IR and returns a generated Python program for the `sql` package to execute. | ||
|
|
||
| ## Code Structure | ||
|
|
||
| We propose the following code structures. | ||
|
|
||
| ``` | ||
| sql/ | ||
| ... | ||
| codegen/ | ||
| tensorflow/ | ||
| train.go | ||
| predict.go | ||
| analyze.go | ||
| xgboost/ | ||
| ... | ||
| ``` | ||
|
|
||
| The `tensorflow` package will expose function `func Train(ir sql.TrainIR) string, error`, which takes the `sql`'s `TrainIR` and returns a generated Python program. | ||
|
|
||
| ## Intermediate Representation | ||
|
|
||
| We propose the following struct as the IR for code generation. | ||
|
|
||
| ```go | ||
| package sql | ||
|
|
||
| import ( | ||
| "github.com/sql-machine-learning/sqlflow/sql/columns" | ||
| ) | ||
|
|
||
| type FieldType int | ||
|
|
||
| const ( | ||
| Int FieldType = iota | ||
| Float | ||
| String | ||
| ) | ||
|
|
||
| // FieldMeta contains the meta information for decoding and feature columns | ||
| type FieldMeta struct { | ||
| DType FieldType // e.g. "float", "int32" | ||
| Delimiter string // e.g. "," | ||
| Shape []int // e.g. [1], [1 2 3] | ||
| IsSparse bool // e.g. false | ||
| FeatureColumn []columns.FeatureColumn // e.g. [EmbeddingColumn, CategoryIDColumn] | ||
| } | ||
|
|
||
| // TrainIR is the intermediate representation for code generation of a training job | ||
| type TrainIR struct { | ||
| DataSource string // e.g. "hive://root:root@localhost:10000/churn" | ||
| Select string // e.g. "select * from iris.train" | ||
| ValidationSelect string // e.g. "select * from iris.val;" | ||
| Estimator string // e.g. "DNNClassifier" | ||
| Attribute map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]} | ||
| Feature map[string]map[string]FieldMeta // e.g. {"feature_columns": {"sepal_length": {"float", "", [1], false}, ...}} | ||
| Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} | ||
| } | ||
|
|
||
| // PredictIR is the intermediate representation for code generation of a prediction job | ||
| type PredictIR struct { | ||
| DataSource string // e.g. "hive://root:root@localhost:10000/churn" | ||
| Select string // e.g. "select * from iris.test" | ||
| Estimator string // e.g. "DNNClassifier" | ||
| Attribute map[string]interface{} // e.g. {"predict.batch_size": 32} | ||
| Feature map[string]map[string]FieldMeta // e.g. {"feature_columns": {"sepal_length": {"float", "", [1], false}, ...}} | ||
| Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} | ||
| ReusltTable string // e.g. "iris.predict" | ||
| } | ||
|
|
||
| // AnalyzeIR is the intermediate representation for code generation of a analysis job | ||
| type AnalyzeIR struct { | ||
| DataSource string // e.g. "hive://root:root@localhost:10000/churn" | ||
| Select string // e.g. "select * from iris.train" | ||
| Estimator string // e.g. "DNNClassifier" | ||
| Attribute map[string]interface{} // e.g. {"analyze.plot_type": "bar"} | ||
| Feature map[string]map[string]FieldMeta // e.g. {"feature_columns": {"sepal_length": {"float", "", [1], false}, ...}} | ||
| Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}} | ||
| } | ||
| ``` | ||
|
|
||
| Please be aware that all the IR excludes the information of the current working directory. This information belongs to the `executor` in `sql` package. For a prediction/analyze job, the `executor` should recover everything produced by the training job. | ||
|
|
||
| Please be aware that `TrainIR` excludes the saving table name. This information belongs to the `executor` in `sql` package. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't need to implement an IR for each job, how about simplifying like:
Each generator can extend the
ClauseIRas needed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yancey1989 Thanks for the suggestion. Combining all three
IRs to a singleClauseIRdoes save some code. However, I still advocate using separate IRs for different job types. Here is my reasoning.xgboost.Predictwould be confused by theValidationSelectfield inClauseIR. Also, as we adding more features to SQLFlow, more fields would be added toCluaseIR, and the confusion will increase.sqlor incodegen. However, there are manycodegens and only onesql. Distinguishing the job type insqlsaves works in allcodegens.