[SPARK-20356][SQL] Pruned InMemoryTableScanExec should have correct output partitioning and ordering #17679

viirya · 2017-04-19T03:30:18Z

What changes were proposed in this pull request?

The output of InMemoryTableScanExec can be pruned and mismatch with InMemoryRelation and its child plan's output. This causes wrong output partitioning and ordering.

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

viirya · 2017-04-19T03:32:27Z

cc @cloud-fan @hvanhovell @dilipbiswal

…and ordering.

SparkQA · 2017-04-19T05:41:08Z

Test build #75927 has finished for PR 17679 at commit 17e1f9e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-19T05:42:22Z

Test build #75926 has finished for PR 17679 at commit b6e42b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2017-04-19T05:58:04Z

@viirya Thank you for a quick fix. The change looks good to me. I have a question. Before the fix, we changed the output partitioning of relation's child. But how come it was not reflected on the plan ? If it was reflected on the plan then we could quickly figure out whats wrong ? Here is the plan before this fix.

*HashAggregate(keys=[item#245], functions=[count(1)], output=[item#245, count#279L])
+- *HashAggregate(keys=[item#245], functions=[partial_count(1)], output=[item#245, count#295L])
   +- InMemoryTableScan [item#245]
         +- InMemoryRelation [id#237, item#245], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
               +- *HashAggregate(keys=[id#237, item#245], functions=[], output=[id#237, item#245])
                  +- Exchange hashpartitioning(id#237, item#245, 200)
                     +- *HashAggregate(keys=[id#237, item#245], functions=[], output=[id#237, item#245])
                        +- *Project [id#237, group#227 AS item#245]
                           +- *BroadcastHashJoin [item#226], [item#236], Inner, BuildRight
                              :- *Project [_1#223 AS item#226, _2#224 AS group#227]
                              :  +- *Filter isnotnull(_1#223)
                              :     +- LocalTableScan [_1#223, _2#224]
                              +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
                                 +- *Project [_1#233 AS item#236, _2#234 AS id#237]
                                    +- *Filter isnotnull(_1#233)
                                       +- LocalTableScan [_1#233, _2#234]

Is there any indication on the plan that partitioning info got changed ? Just want to learn :-)

viirya · 2017-04-19T06:15:08Z

@dilipbiswal As outputPartitioning/Ordering is not one of arguments of query plan, it won't be shown in the string representation.

dilipbiswal · 2017-04-19T06:39:07Z

@viirya Ok.. thank you.

viirya · 2017-04-19T06:40:29Z

Btw, you can sense there might be a problem since the difference of output between InMemoryTableScan [item#245] and InMemoryRelation [id#237, item#245]...

dilipbiswal · 2017-04-19T06:48:26Z

@viirya Isn't that a normal thing simon due to column pruning ? Is that stuff tied to partitioning somehow ?

viirya · 2017-04-19T06:51:17Z

Oh, as the outputPartitioning/Ordering is strongly related to output, so when the output is changed, it quite indicates the partitioning/ordering can be wrong.

dilipbiswal · 2017-04-19T06:52:38Z

@viirya i see. Thanks :-)

cloud-fan · 2017-04-19T07:42:32Z

will we return invalid outputPartitioning? Think about a InMemoryTableScanExec that only read column a, and the outputPartitioning may be a, b, is that expected?

viirya · 2017-04-19T07:47:18Z

@cloud-fan I've raised similar question before in a PR. I remember I got an answer that an invalid outputPartitioning like this won't cause problem.

viirya · 2017-04-19T07:51:34Z

A similar example is ProjectExec which takes child.outputPartitioning as its outputPartitioning. But a projection can also change child's output like [a, b] -> [a].

…utput partitioning and ordering ## What changes were proposed in this pull request? The output of `InMemoryTableScanExec` can be pruned and mismatch with `InMemoryRelation` and its child plan's output. This causes wrong output partitioning and ordering. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <[email protected]> Closes #17679 from viirya/SPARK-20356. (cherry picked from commit 773754b) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2017-04-19T08:02:17Z

thanks, merging to master/2.2!

viirya · 2017-04-19T08:03:23Z

Thanks! @cloud-fan

…utput partitioning and ordering ## What changes were proposed in this pull request? The output of `InMemoryTableScanExec` can be pruned and mismatch with `InMemoryRelation` and its child plan's output. This causes wrong output partitioning and ordering. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <[email protected]> Closes apache#17679 from viirya/SPARK-20356.

viirya force-pushed the SPARK-20356 branch from 43144c0 to b6e42b5 Compare April 19, 2017 03:31

Pruned InMemoryTableScanExec should have correct output partitioning …

17e1f9e

…and ordering.

viirya force-pushed the SPARK-20356 branch from b6e42b5 to 17e1f9e Compare April 19, 2017 03:35

asfgit closed this in 773754b Apr 19, 2017

viirya deleted the SPARK-20356 branch December 27, 2023 18:20

[SPARK-20356][SQL] Pruned InMemoryTableScanExec should have correct output partitioning and ordering #17679

[SPARK-20356][SQL] Pruned InMemoryTableScanExec should have correct output partitioning and ordering #17679

Uh oh!

Conversation

viirya commented Apr 19, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Apr 19, 2017

Uh oh!

SparkQA commented Apr 19, 2017

Uh oh!

SparkQA commented Apr 19, 2017

Uh oh!

dilipbiswal commented Apr 19, 2017

Uh oh!

viirya commented Apr 19, 2017

Uh oh!

dilipbiswal commented Apr 19, 2017

Uh oh!

viirya commented Apr 19, 2017

Uh oh!

dilipbiswal commented Apr 19, 2017

Uh oh!

viirya commented Apr 19, 2017

Uh oh!

dilipbiswal commented Apr 19, 2017

Uh oh!

cloud-fan commented Apr 19, 2017

Uh oh!

viirya commented Apr 19, 2017

Uh oh!

viirya commented Apr 19, 2017

Uh oh!

cloud-fan commented Apr 19, 2017

Uh oh!

viirya commented Apr 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants