Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Apr 19, 2017

What changes were proposed in this pull request?

The output of InMemoryTableScanExec can be pruned and mismatch with InMemoryRelation and its child plan's output. This causes wrong output partitioning and ordering.

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@viirya
Copy link
Member Author

viirya commented Apr 19, 2017

@SparkQA
Copy link

SparkQA commented Apr 19, 2017

Test build #75927 has finished for PR 17679 at commit 17e1f9e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 19, 2017

Test build #75926 has finished for PR 17679 at commit b6e42b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

@viirya Thank you for a quick fix. The change looks good to me. I have a question. Before the fix, we changed the output partitioning of relation's child. But how come it was not reflected on the plan ? If it was reflected on the plan then we could quickly figure out whats wrong ? Here is the plan before this fix.

*HashAggregate(keys=[item#245], functions=[count(1)], output=[item#245, count#279L])
+- *HashAggregate(keys=[item#245], functions=[partial_count(1)], output=[item#245, count#295L])
   +- InMemoryTableScan [item#245]
         +- InMemoryRelation [id#237, item#245], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
               +- *HashAggregate(keys=[id#237, item#245], functions=[], output=[id#237, item#245])
                  +- Exchange hashpartitioning(id#237, item#245, 200)
                     +- *HashAggregate(keys=[id#237, item#245], functions=[], output=[id#237, item#245])
                        +- *Project [id#237, group#227 AS item#245]
                           +- *BroadcastHashJoin [item#226], [item#236], Inner, BuildRight
                              :- *Project [_1#223 AS item#226, _2#224 AS group#227]
                              :  +- *Filter isnotnull(_1#223)
                              :     +- LocalTableScan [_1#223, _2#224]
                              +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
                                 +- *Project [_1#233 AS item#236, _2#234 AS id#237]
                                    +- *Filter isnotnull(_1#233)
                                       +- LocalTableScan [_1#233, _2#234]

Is there any indication on the plan that partitioning info got changed ? Just want to learn :-)

@viirya
Copy link
Member Author

viirya commented Apr 19, 2017

@dilipbiswal As outputPartitioning/Ordering is not one of arguments of query plan, it won't be shown in the string representation.

@dilipbiswal
Copy link
Contributor

@viirya Ok.. thank you.

@viirya
Copy link
Member Author

viirya commented Apr 19, 2017

Btw, you can sense there might be a problem since the difference of output between InMemoryTableScan [item#245] and InMemoryRelation [id#237, item#245]...

@dilipbiswal
Copy link
Contributor

@viirya Isn't that a normal thing simon due to column pruning ? Is that stuff tied to partitioning somehow ?

@viirya
Copy link
Member Author

viirya commented Apr 19, 2017

Oh, as the outputPartitioning/Ordering is strongly related to output, so when the output is changed, it quite indicates the partitioning/ordering can be wrong.

@dilipbiswal
Copy link
Contributor

@viirya i see. Thanks :-)

@cloud-fan
Copy link
Contributor

will we return invalid outputPartitioning? Think about a InMemoryTableScanExec that only read column a, and the outputPartitioning may be a, b, is that expected?

@viirya
Copy link
Member Author

viirya commented Apr 19, 2017

@cloud-fan I've raised similar question before in a PR. I remember I got an answer that an invalid outputPartitioning like this won't cause problem.

@viirya
Copy link
Member Author

viirya commented Apr 19, 2017

A similar example is ProjectExec which takes child.outputPartitioning as its outputPartitioning. But a projection can also change child's output like [a, b] -> [a].

asfgit pushed a commit that referenced this pull request Apr 19, 2017
…utput partitioning and ordering

## What changes were proposed in this pull request?

The output of `InMemoryTableScanExec` can be pruned and mismatch with `InMemoryRelation` and its child plan's output. This causes wrong output partitioning and ordering.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <[email protected]>

Closes #17679 from viirya/SPARK-20356.

(cherry picked from commit 773754b)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.2!

@viirya
Copy link
Member Author

viirya commented Apr 19, 2017

Thanks! @cloud-fan

@asfgit asfgit closed this in 773754b Apr 19, 2017
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
…utput partitioning and ordering

## What changes were proposed in this pull request?

The output of `InMemoryTableScanExec` can be pruned and mismatch with `InMemoryRelation` and its child plan's output. This causes wrong output partitioning and ordering.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <[email protected]>

Closes apache#17679 from viirya/SPARK-20356.
@viirya viirya deleted the SPARK-20356 branch December 27, 2023 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants