-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18801][SQL][FOLLOWUP] Alias the view with its child #16561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #71255 has finished for PR 16561 at commit
|
| } | ||
| val newOutput = output.zip(child.output).map { | ||
| case (attr, originAttr) => | ||
| if (attr.dataType != originAttr.dataType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check hive's behavior? maybe we can use UpCast here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that Hive supports UpCast between child output and view output, for example:
hive> create table testtable as select 1 a, 2L b;
hive> create view testview as select * from testtable;
hive> select * from testview;
OK
1 2
Time taken: 0.11 seconds, Fetched: 1 row(s)
hive> alter table testtable change column a a bigint;
hive> alter table testtable change column b b string;
hive> desc testtable;
OK
a bigint
b string
Time taken: 0.15 seconds, Fetched: 2 row(s)
hive> desc testview;
OK
a int
b bigint
Time taken: 0.038 seconds, Fetched: 2 row(s)
hive> select * from testview;
OK
1 2
Time taken: 0.172 seconds, Fetched: 1 row(s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should we set for the walkedTypePath here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like Hive just forcefully cast it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hive> explain extended select * from testview;
OK
ABSTRACT SYNTAX TREE:
TOK_QUERY
TOK_FROM
TOK_TABREF
TOK_TABNAME
testview
TOK_INSERT
TOK_DESTINATION
TOK_DIR
TOK_TMP_FILE
TOK_SELECT
TOK_SELEXPR
TOK_ALLCOLREF
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: testtable
Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column stats: NONE
GatherStats: false
Select Operator
expressions: a (type: bigint), b (type: tinyint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column stats: NONE
ListSink
expressions: a (type: bigint), b (type: tinyint). I tried to alter the columns in the underlying tables to different types. I can see the types of view columns are always casted to the same one as the altered one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @yhuai
|
Test build #71400 has finished for PR 16561 at commit
|
|
|
||
| /** | ||
| * Return the output column names of the query that creates a view, the column names are used to | ||
| * resolve a view, should be None if the CatalogTable is not a View or created by older versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be Nil
| object CatalogTable { | ||
| val VIEW_DEFAULT_DATABASE = "view.default.database" | ||
| val VIEW_QUERY_OUTPUT_PREFIX = "view.query.out." | ||
| val VIEW_QUERY_OUTPUT_COLUMN_NUM = VIEW_QUERY_OUTPUT_PREFIX + "numCols" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: xxx_NUM_COLUMNS
| */ | ||
| def viewQueryColumnNames: Seq[String] = { | ||
| for { | ||
| numCols <- properties.get(VIEW_QUERY_OUTPUT_COLUMN_NUM).toSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.toSeq is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is needed to generate the correct output.
| val queryColumnNames = desc.viewQueryColumnNames | ||
| // If the view output doesn't have the same number of columns either with the child output, | ||
| // or with the query column names, throw an AnalysisException. | ||
| if (output.length != child.output.length && output.length != queryColumnNames.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment says or but the code use &&?
|
Test build #71415 has finished for PR 16561 at commit
|
|
Test build #71416 has finished for PR 16561 at commit
|
| * child by: | ||
| * 1. Generate the `queryOutput` by: | ||
| * 1.1. If the query column names are defined, map the column names to attributes in the child | ||
| * output by name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we mention that, this is mostly for SELECT * ...?
| val queryColumnNames = desc.viewQueryColumnNames | ||
| // If the view output doesn't have the same number of columns with the child output and the | ||
| // query column names, throw an AnalysisException. | ||
| if (output.length != child.output.length && output.length != queryColumnNames.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition doesn't look very clear to me. How about if (queryColumnNames.nonEmpty && output.length != queryColumnNames.length)? When queryColumnNames is empty, it means this view is created prior to Spark 2.2, and we don't need to check anything.
| } | ||
| // If the child output is the same with the view output, we don't need to generate the query | ||
| // output again. | ||
| val queryOutput = if (queryColumnNames.nonEmpty && output != child.output) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output != child.output will always be true right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a nested view, the inner view operator may have been resolved, in that case the output is the same with child.output.
I have changed the test case SQLViewSuite.test("correctly resolve a nested view") to cover this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we put this condition after the case? e.g. case v @ View(desc, output, child) if child.resolved && output != child.output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I think that's better!
| } | ||
| } else { | ||
| child.output | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
val queryOutput = if (queryColumnNames.nonEmpty) {
if (output.length != queryColumnNames.length) throw ...
desc.viewQueryColumnNames.map { colName =>
findAttributeByName(colName, child.output, resolver)
}
} else {
// For view created before Spark 2.1, the view text is already fully qualified, the plan output is view output.
child.output
}
| * Return false iff we may truncate during casting `from` type to `to` type. e.g. long -> int, | ||
| * timestamp -> date. | ||
| */ | ||
| def canUpCast(from: DataType, to: DataType): Boolean = (from, to) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about def mayTruncate? canUpCast is not accurate, we may not be able to cast even canUpCast returns true.
|
Test build #71426 has started for PR 16561 at commit |
|
LGTM |
|
Test build #71434 has finished for PR 16561 at commit
|
|
LGTM |
|
Test build #71437 has finished for PR 16561 at commit
|
|
thanks, merging to master! |
| child.output | ||
| } | ||
| // Map the attributes in the query output to the attributes in the view output by index. | ||
| val newOutput = output.zip(queryOutput).map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we need to check the size of output and queryOutput.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For views created by older versions of Spark, the view text is fully qualified, so the output is the same with the view output. Or else we have checked that the output have the same length with queryColumnNames. So perhaps we don't need to check the size of output and queryOutput here.
## What changes were proposed in this pull request? This PR is a follow-up to address the comments https://github.com/apache/spark/pull/16233/files#r95669988 and https://github.com/apache/spark/pull/16233/files#r95662299. We try to wrap the child by: 1. Generate the `queryOutput` by: 1.1. If the query column names are defined, map the column names to attributes in the child output by name; 1.2. Else set the child output attributes to `queryOutput`. 2. Map the `queryQutput` to view output by index, if the corresponding attributes don't match, try to up cast and alias the attribute in `queryOutput` to the attribute in the view output. 3. Add a Project over the child, with the new output generated by the previous steps. If the view output doesn't have the same number of columns neither with the child output, nor with the query column names, throw an AnalysisException. ## How was this patch tested? Add new test cases in `SQLViewSuite`. Author: jiangxingbo <[email protected]> Closes apache#16561 from jiangxb1987/alias-view.
## What changes were proposed in this pull request? This PR is a follow-up to address the comments https://github.com/apache/spark/pull/16233/files#r95669988 and https://github.com/apache/spark/pull/16233/files#r95662299. We try to wrap the child by: 1. Generate the `queryOutput` by: 1.1. If the query column names are defined, map the column names to attributes in the child output by name; 1.2. Else set the child output attributes to `queryOutput`. 2. Map the `queryQutput` to view output by index, if the corresponding attributes don't match, try to up cast and alias the attribute in `queryOutput` to the attribute in the view output. 3. Add a Project over the child, with the new output generated by the previous steps. If the view output doesn't have the same number of columns neither with the child output, nor with the query column names, throw an AnalysisException. ## How was this patch tested? Add new test cases in `SQLViewSuite`. Author: jiangxingbo <[email protected]> Closes apache#16561 from jiangxb1987/alias-view.
|
hi , I have a question, why we should Eliminate View in the first of the optimizer.? |
|
@QQshu1 As we have mentioned in the comment, the |
|
@jiangxb1987 thanks, What effect if we don`t Eliminate View?I means whether it effect optimize the tree namely performance or the correctness of results ? |
|
the |
What changes were proposed in this pull request?
This PR is a follow-up to address the comments https://github.com/apache/spark/pull/16233/files#r95669988 and https://github.com/apache/spark/pull/16233/files#r95662299.
We try to wrap the child by:
queryOutputby:1.1. If the query column names are defined, map the column names to attributes in the child output by name;
1.2. Else set the child output attributes to
queryOutput.queryQutputto view output by index, if the corresponding attributes don't match, try to up cast and alias the attribute inqueryOutputto the attribute in the view output.If the view output doesn't have the same number of columns neither with the child output, nor with the query column names, throw an AnalysisException.
How was this patch tested?
Add new test cases in
SQLViewSuite.