Skip to content

Conversation

@watermen
Copy link
Contributor

@watermen watermen commented Sep 7, 2016

What changes were proposed in this pull request?

The PR will override the sameResult in HiveTableScanExec to make ReuseExchange work in text format table.

How was this patch tested?

SQL

SELECT * FROM src t1
JOIN src t2 ON t1.key = t2.key
JOIN src t3 ON t1.key = t3.key;

Before

== Physical Plan ==
*BroadcastHashJoin [key#30], [key#34], Inner, BuildRight
:- *BroadcastHashJoin [key#30], [key#32], Inner, BuildRight
:  :- *Filter isnotnull(key#30)
:  :  +- HiveTableScan [key#30, value#31], MetastoreRelation default, src
:  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
:     +- *Filter isnotnull(key#32)
:        +- HiveTableScan [key#32, value#33], MetastoreRelation default, src
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   +- *Filter isnotnull(key#34)
      +- HiveTableScan [key#34, value#35], MetastoreRelation default, src

After

== Physical Plan ==
*BroadcastHashJoin [key#2], [key#6], Inner, BuildRight
:- *BroadcastHashJoin [key#2], [key#4], Inner, BuildRight
:  :- *Filter isnotnull(key#2)
:  :  +- HiveTableScan [key#2, value#3], MetastoreRelation default, src
:  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
:     +- *Filter isnotnull(key#4)
:        +- HiveTableScan [key#4, value#5], MetastoreRelation default, src
+- ReusedExchange [key#6, value#7], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))

cc: @davies @cloud-fan

@watermen watermen changed the title [SPARK-17425][SQL] Override sameResult in HiveTableScanExec to make ReusedExchange work in text format table [SPARK-17425][SQL] Override sameResult in HiveTableScanExec to make ReuseExchange work in text format table Sep 7, 2016
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the default one doesn't work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left.cleanArgs == right.cleanArgs in defalut sameResult return false, because equals in MetastoreRelation compare the output(AttributeReference) and exprIds are diff. We need to erase the exprId.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I think all leaf nodes suffer this problem, can you follow the way how they fix it? e.g.

  override def sameResult(plan: LogicalPlan): Boolean = {
    plan.canonicalized match {
      case LocalRelation(otherOutput, otherData) =>
        otherOutput.map(_.dataType) == output.map(_.dataType) && otherData == data
      case _ => false
    }
  }

Copy link
Contributor Author

@watermen watermen Sep 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReuseExchange work in parquet/orc format, because FileSourceScanExec has override the sameResult.

@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65017 has finished for PR 14988 at commit 8e537a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65018 has finished for PR 14988 at commit d9ba28d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment doesn't match the code. Can you explain more about why the default cleanExpression doesn't work?

Copy link
Contributor Author

@watermen watermen Sep 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, see the pic below, left.cleanArgs == right.cleanArgs in defalut sameResult return false, because equals in MetastoreRelation compare the output(AttributeReference) and exprIds are diff. cleanExpression can't clean the exprId in AttributeReference. So I think we need to override the sameResult in HiveTableScanExec like FileSourceScanExec. Let me know if I don't explain clearly.
sameresult

@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #65441 has finished for PR 14988 at commit d9ba28d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you follow the existing workaround?

override def sameResult(plan: LogicalPlan): Boolean = {
    plan.canonicalized match {
      case LocalRelation(otherOutput, otherData) =>
        otherOutput.map(_.dataType) == output.map(_.dataType) && otherData == data
      case _ => false
    }
  }

then we don't need to override cleanExpression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only override the sameReult like FileSourceScanExec. I think HiveTableScanExec is used to text format and FileSourceScanExec is used to parquet/orc format.

val result = relation.sameResult(other.relation) &&
output.length == other.output.length &&
output.zip(other.output)
.forall(p => p._1.name == p._2.name && p._1.dataType == p._2.dataType) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the name matter? I'm not quite sure, but LogicalRelation only checks data type

Copy link
Contributor Author

@watermen watermen Sep 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the full output of table src is (key: Int, value: Int), and output1 is (key: Int), output2 is (value: Int), their(output1, output2) dataType are same, but they are diff and can't be resued.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65646 has finished for PR 14988 at commit e410d14.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in cb324f6 Sep 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants