Skip to content

Conversation

@jiangxb1987
Copy link
Contributor

@jiangxb1987 jiangxb1987 commented Jan 12, 2017

What changes were proposed in this pull request?

This PR is a follow-up to address the comments https://github.com/apache/spark/pull/16233/files#r95669988 and https://github.com/apache/spark/pull/16233/files#r95662299.

We try to wrap the child by:

  1. Generate the queryOutput by:
    1.1. If the query column names are defined, map the column names to attributes in the child output by name;
    1.2. Else set the child output attributes to queryOutput.
  2. Map the queryQutput to view output by index, if the corresponding attributes don't match, try to up cast and alias the attribute in queryOutput to the attribute in the view output.
  3. Add a Project over the child, with the new output generated by the previous steps.
    If the view output doesn't have the same number of columns neither with the child output, nor with the query column names, throw an AnalysisException.

How was this patch tested?

Add new test cases in SQLViewSuite.

@SparkQA
Copy link

SparkQA commented Jan 12, 2017

Test build #71255 has finished for PR 16561 at commit 0e82340.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
val newOutput = output.zip(child.output).map {
case (attr, originAttr) =>
if (attr.dataType != originAttr.dataType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check hive's behavior? maybe we can use UpCast here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that Hive supports UpCast between child output and view output, for example:

hive> create table testtable as select 1 a, 2L b;
hive> create view testview as select * from testtable;
hive> select * from testview;
OK
1	2
Time taken: 0.11 seconds, Fetched: 1 row(s)
hive> alter table testtable change column a a bigint;
hive> alter table testtable change column b b string;
hive> desc testtable;
OK
a                   	bigint              	                    
b                   	string              	                    
Time taken: 0.15 seconds, Fetched: 2 row(s)
hive> desc testview;
OK
a                   	int                 	                    
b                   	bigint              	                    
Time taken: 0.038 seconds, Fetched: 2 row(s)
hive> select * from testview;
OK
1	2
Time taken: 0.172 seconds, Fetched: 1 row(s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we set for the walkedTypePath here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like Hive just forcefully cast it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hive> explain extended select * from testview;
OK
ABSTRACT SYNTAX TREE:
  
TOK_QUERY
   TOK_FROM
      TOK_TABREF
         TOK_TABNAME
            testview
   TOK_INSERT
      TOK_DESTINATION
         TOK_DIR
            TOK_TMP_FILE
      TOK_SELECT
         TOK_SELEXPR
            TOK_ALLCOLREF


STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: testtable
          Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column stats: NONE
          GatherStats: false
          Select Operator
            expressions: a (type: bigint), b (type: tinyint)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column stats: NONE
            ListSink

expressions: a (type: bigint), b (type: tinyint). I tried to alter the columns in the underlying tables to different types. I can see the types of view columns are always casted to the same one as the altered one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @yhuai

@jiangxb1987 jiangxb1987 changed the title [SPARK-18209][SQL] Alias the view with its child by mapping the columns by index [SPARK-18209][SQL][FOLLOWUP] Alias the view with its child by mapping the columns by index Jan 12, 2017
@jiangxb1987 jiangxb1987 changed the title [SPARK-18209][SQL][FOLLOWUP] Alias the view with its child by mapping the columns by index [SPARK-18801][SQL][FOLLOWUP] Alias the view with its child Jan 13, 2017
@SparkQA
Copy link

SparkQA commented Jan 15, 2017

Test build #71400 has finished for PR 16561 at commit 7e19803.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


/**
* Return the output column names of the query that creates a view, the column names are used to
* resolve a view, should be None if the CatalogTable is not a View or created by older versions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be Nil

object CatalogTable {
val VIEW_DEFAULT_DATABASE = "view.default.database"
val VIEW_QUERY_OUTPUT_PREFIX = "view.query.out."
val VIEW_QUERY_OUTPUT_COLUMN_NUM = VIEW_QUERY_OUTPUT_PREFIX + "numCols"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: xxx_NUM_COLUMNS

*/
def viewQueryColumnNames: Seq[String] = {
for {
numCols <- properties.get(VIEW_QUERY_OUTPUT_COLUMN_NUM).toSeq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.toSeq is not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed to generate the correct output.

val queryColumnNames = desc.viewQueryColumnNames
// If the view output doesn't have the same number of columns either with the child output,
// or with the query column names, throw an AnalysisException.
if (output.length != child.output.length && output.length != queryColumnNames.length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment says or but the code use &&?

@SparkQA
Copy link

SparkQA commented Jan 16, 2017

Test build #71415 has finished for PR 16561 at commit d6537a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2017

Test build #71416 has finished for PR 16561 at commit 16ec310.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* child by:
* 1. Generate the `queryOutput` by:
* 1.1. If the query column names are defined, map the column names to attributes in the child
* output by name;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we mention that, this is mostly for SELECT * ...?

val queryColumnNames = desc.viewQueryColumnNames
// If the view output doesn't have the same number of columns with the child output and the
// query column names, throw an AnalysisException.
if (output.length != child.output.length && output.length != queryColumnNames.length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition doesn't look very clear to me. How about if (queryColumnNames.nonEmpty && output.length != queryColumnNames.length)? When queryColumnNames is empty, it means this view is created prior to Spark 2.2, and we don't need to check anything.

}
// If the child output is the same with the view output, we don't need to generate the query
// output again.
val queryOutput = if (queryColumnNames.nonEmpty && output != child.output) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output != child.output will always be true right?

Copy link
Contributor Author

@jiangxb1987 jiangxb1987 Jan 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a nested view, the inner view operator may have been resolved, in that case the output is the same with child.output.
I have changed the test case SQLViewSuite.test("correctly resolve a nested view") to cover this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we put this condition after the case? e.g. case v @ View(desc, output, child) if child.resolved && output != child.output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think that's better!

}
} else {
child.output
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

val queryOutput = if (queryColumnNames.nonEmpty) {
  if (output.length != queryColumnNames.length) throw ...
  desc.viewQueryColumnNames.map { colName =>
    findAttributeByName(colName, child.output, resolver)
  }
} else {
  // For view created before Spark 2.1, the view text is already fully qualified, the plan output is view output.
  child.output
}

* Return false iff we may truncate during casting `from` type to `to` type. e.g. long -> int,
* timestamp -> date.
*/
def canUpCast(from: DataType, to: DataType): Boolean = (from, to) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about def mayTruncate? canUpCast is not accurate, we may not be able to cast even canUpCast returns true.

@SparkQA
Copy link

SparkQA commented Jan 16, 2017

Test build #71426 has started for PR 16561 at commit 21e63f8.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Jan 16, 2017

Test build #71434 has finished for PR 16561 at commit c86ab48.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Jan 16, 2017

Test build #71437 has finished for PR 16561 at commit 06e8855.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in e635cbb Jan 16, 2017
child.output
}
// Map the attributes in the query output to the attributes in the view output by index.
val newOutput = output.zip(queryOutput).map {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we need to check the size of output and queryOutput.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For views created by older versions of Spark, the view text is fully qualified, so the output is the same with the view output. Or else we have checked that the output have the same length with queryColumnNames. So perhaps we don't need to check the size of output and queryOutput here.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

This PR is a follow-up to address the comments https://github.com/apache/spark/pull/16233/files#r95669988 and https://github.com/apache/spark/pull/16233/files#r95662299.

We try to wrap the child by:
1. Generate the `queryOutput` by:
    1.1. If the query column names are defined, map the column names to attributes in the child output by name;
    1.2. Else set the child output attributes to `queryOutput`.
2. Map the `queryQutput` to view output by index, if the corresponding attributes don't match, try to up cast and alias the attribute in `queryOutput` to the attribute in the view output.
3. Add a Project over the child, with the new output generated by the previous steps.
If the view output doesn't have the same number of columns neither with the child output, nor with the query column names, throw an AnalysisException.

## How was this patch tested?

Add new test cases in `SQLViewSuite`.

Author: jiangxingbo <[email protected]>

Closes apache#16561 from jiangxb1987/alias-view.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

This PR is a follow-up to address the comments https://github.com/apache/spark/pull/16233/files#r95669988 and https://github.com/apache/spark/pull/16233/files#r95662299.

We try to wrap the child by:
1. Generate the `queryOutput` by:
    1.1. If the query column names are defined, map the column names to attributes in the child output by name;
    1.2. Else set the child output attributes to `queryOutput`.
2. Map the `queryQutput` to view output by index, if the corresponding attributes don't match, try to up cast and alias the attribute in `queryOutput` to the attribute in the view output.
3. Add a Project over the child, with the new output generated by the previous steps.
If the view output doesn't have the same number of columns neither with the child output, nor with the query column names, throw an AnalysisException.

## How was this patch tested?

Add new test cases in `SQLViewSuite`.

Author: jiangxingbo <[email protected]>

Closes apache#16561 from jiangxb1987/alias-view.
@jiangxb1987 jiangxb1987 deleted the alias-view branch March 16, 2017 06:43
@QQshu1
Copy link

QQshu1 commented May 4, 2017

hi , I have a question, why we should Eliminate View in the first of the optimizer.?
thank you.@jiangxb1987

@jiangxb1987
Copy link
Contributor Author

@QQshu1 As we have mentioned in the comment, the View operator is respected till the end of analysis stage to enable us better understand the analyzed logical plan. On the Beginning of optimize stage, that operator is no longer needed, so we apply the EliminateView rule.

@QQshu1
Copy link

QQshu1 commented May 4, 2017

@jiangxb1987 thanks, What effect if we don`t Eliminate View?I means whether it effect optimize the tree namely performance or the correctness of results ?

@cloud-fan
Copy link
Contributor

the View operator doesn't have a corresponding physical operator, we have to remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants