-
Notifications
You must be signed in to change notification settings - Fork 28.9k
SPARK-5049: ParquetTableScan always prepends the values of partition col... #3870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…columns in output rows irrespective of the order of the partition columns in the original SELECT query - forming a Generic row by inserting column values are correct indexes
|
Can one of the admins verify this patch? |
|
Jenkins this is ok to test |
|
Test build #24989 has started for PR 3870 at commit
|
|
Test build #24989 has finished for PR 3870 at commit
|
|
Test FAILed. |
…columns in output rows irrespective of the order of the partition columns in the original SELECT query - passing newOutput(correct sequence of attributes) in OutputFaker
|
Test build #25031 has started for PR 3870 at commit
|
|
Test build #25031 has finished for PR 3870 at commit
|
|
Test PASSed. |
|
Thanks for figuring this out and proposing a solution! I guess our test cases missed this since they always perform later column reordering. I'm a little concerned about the performance impact of this part of the change though: // Fill outputRow with iter.next()._2 at the correct indexes using normalOutputIndexes
iter.next()._2
.zipWithIndex
.foreach(nI => outputRow(normalOutputIndexes(nI._2)) = nI._1)
new GenericRow(outputRow) It's both functional programming (which I normally love, but try to avoid in per-tuple codepaths) and allocates an object. What do you think of the approach I took in #3990? |
|
Thanks for reviewing. Yes, the approach you took in #3990 avoids this performance penalty. |
Followup to #3870. Props to rahulaggarwalguavus for identifying the issue. Author: Michael Armbrust <[email protected]> Closes #3990 from marmbrus/SPARK-5049 and squashes the following commits: dd03e4e [Michael Armbrust] Fill in the partition values of parquet scans instead of using JoinedRow (cherry picked from commit 5d9fa55) Signed-off-by: Michael Armbrust <[email protected]>
Followup to #3870. Props to rahulaggarwalguavus for identifying the issue. Author: Michael Armbrust <[email protected]> Closes #3990 from marmbrus/SPARK-5049 and squashes the following commits: dd03e4e [Michael Armbrust] Fill in the partition values of parquet scans instead of using JoinedRow
|
Since this issue has been fixed by #3990, we can close it. |
|
close this issue |
SPARK-5049: ParquetTableScan always prepends the values of partition columns in output rows irrespective of the order of the partition columns in the original SELECT query