Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions doc/design_intermediate_representation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# _Design:_ Intermediate Representation

## Overview

As SQLFlow is supporting more and more machine learning toolkits, the corresponding code generation logics are better being organized as separate packages. An intermediate representation(IR) of the SQL jobs becomes necessary to connect these separate packages with the core `sql` package.

The core `sql` package should include the following functionalities:
1. The entry point of running extended SQL statements.
1. The [parsing](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/sql_parser.md) of extended SQL statements.
1. The verification of extended SQL statements, including verifying the syntax, the existence of the selected fields.
1. The [feature derivation](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/feature_derivation.md), including name, type, shape, and preprocessing method of the select fields.
1. The [training data and validation data split](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/training_and_validation.md).

With these functionalities, the `sql` package çan translate user typed extended SQL statements to an IR as an exposed Go struct. The codegen package takes the IR and returns a generated Python program for the `sql` package to execute.

## Code Structure

We propose the following code structures.

```
sql/
...
codegen/
tensorflow/
train.go
predict.go
analyze.go
xgboost/
...
```

The `tensorflow` package will expose function `func Train(ir sql.TrainIR) string, error`, which takes the `sql`'s `TrainIR` and returns a generated Python program.

## Intermediate Representation

We propose the following struct as the IR for code generation.

```go
package sql

import (
"github.com/sql-machine-learning/sqlflow/sql/columns"
)

type FieldType int

const (
Int FieldType = iota
Float
String
)

// FieldMeta contains the meta information for decoding and feature columns
type FieldMeta struct {
DType FieldType // e.g. "float", "int32"
Delimiter string // e.g. ","
Shape []int // e.g. [1], [1 2 3]
IsSparse bool // e.g. false
FeatureColumn []columns.FeatureColumn // e.g. [EmbeddingColumn, CategoryIDColumn]
}

// TrainIR is the intermediate representation for code generation of a training job
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need to implement an IR for each job, how about simplifying like:

type FeatureMeta struct {
    DType string
    Delimiter string 
    ...
}
type DBConn struct {
    Driver string
    User string
    ....
}
type ClauseIR struct {
    Estimator string
    SelectClause string
    Attributes map[string]interface{}
    DBConn DBConn
    Features map[string]FeatureMeta
    ...
}

Each generator can extend the ClauseIR as needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yancey1989 Thanks for the suggestion. Combining all three IRs to a single ClauseIR does save some code. However, I still advocate using separate IRs for different job types. Here is my reasoning.

  • Avoid confusion. The developer of xgboost.Predict would be confused by the ValidationSelect field in ClauseIR. Also, as we adding more features to SQLFlow, more fields would be added to CluaseIR, and the confusion will increase.
  • Less work. We either distinguish the job type in sql or in codegen. However, there are many codegens and only one sql. Distinguishing the job type in sql saves works in all codegens.

type TrainIR struct {
DataSource string // e.g. "hive://root:root@localhost:10000/churn"
Select string // e.g. "select * from iris.train"
ValidationSelect string // e.g. "select * from iris.val;"
Estimator string // e.g. "DNNClassifier"
Attribute map[string]interface{} // e.g. {"train.epoch": 1000, "model.hidden_units": [10 10]}
Feature map[string]map[string]FieldMeta // e.g. {"feature_columns": {"sepal_length": {"float", "", [1], false}, ...}}
Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}}
}

// PredictIR is the intermediate representation for code generation of a prediction job
type PredictIR struct {
DataSource string // e.g. "hive://root:root@localhost:10000/churn"
Select string // e.g. "select * from iris.test"
Estimator string // e.g. "DNNClassifier"
Attribute map[string]interface{} // e.g. {"predict.batch_size": 32}
Feature map[string]map[string]FieldMeta // e.g. {"feature_columns": {"sepal_length": {"float", "", [1], false}, ...}}
Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}}
ReusltTable string // e.g. "iris.predict"
}

// AnalyzeIR is the intermediate representation for code generation of a analysis job
type AnalyzeIR struct {
DataSource string // e.g. "hive://root:root@localhost:10000/churn"
Select string // e.g. "select * from iris.train"
Estimator string // e.g. "DNNClassifier"
Attribute map[string]interface{} // e.g. {"analyze.plot_type": "bar"}
Feature map[string]map[string]FieldMeta // e.g. {"feature_columns": {"sepal_length": {"float", "", [1], false}, ...}}
Label map[string]FieldMeta // e.g. {"class": {"int32", "", [1], false}}
}
```

Please be aware that all the IR excludes the information of the current working directory. This information belongs to the `executor` in `sql` package. For a prediction/analyze job, the `executor` should recover everything produced by the training job.

Please be aware that `TrainIR` excludes the saving table name. This information belongs to the `executor` in `sql` package.