Skip to content

Conversation

@rahulaggarwalguavus
Copy link

SPARK-5049: ParquetTableScan always prepends the values of partition columns in output rows irrespective of the order of the partition columns in the original SELECT query

  • now forming a GenericRow by inserting column values at correct indexes

…columns in output rows irrespective of the order of the partition columns in the original SELECT query

- forming a Generic row by inserting column values are correct indexes
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@ash211
Copy link
Contributor

ash211 commented Jan 2, 2015

Jenkins this is ok to test

@SparkQA
Copy link

SparkQA commented Jan 2, 2015

Test build #24989 has started for PR 3870 at commit 5000110.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 2, 2015

Test build #24989 has finished for PR 3870 at commit 5000110.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24989/
Test FAILed.

…columns in output rows irrespective of the order of the partition columns in the original SELECT query

- passing newOutput(correct sequence of attributes) in OutputFaker
@SparkQA
Copy link

SparkQA commented Jan 4, 2015

Test build #25031 has started for PR 3870 at commit 8253a7c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 4, 2015

Test build #25031 has finished for PR 3870 at commit 8253a7c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25031/
Test PASSed.

@marmbrus
Copy link
Contributor

Thanks for figuring this out and proposing a solution! I guess our test cases missed this since they always perform later column reordering.

I'm a little concerned about the performance impact of this part of the change though:

// Fill outputRow with iter.next()._2 at the correct indexes using normalOutputIndexes
iter.next()._2
  .zipWithIndex
  .foreach(nI => outputRow(normalOutputIndexes(nI._2)) = nI._1)
  new GenericRow(outputRow) 

It's both functional programming (which I normally love, but try to avoid in per-tuple codepaths) and allocates an object.

What do you think of the approach I took in #3990?

@rahulaggarwalguavus
Copy link
Author

Thanks for reviewing. Yes, the approach you took in #3990 avoids this performance penalty.

asfgit pushed a commit that referenced this pull request Jan 12, 2015
Followup to #3870.  Props to rahulaggarwalguavus for identifying the issue.

Author: Michael Armbrust <[email protected]>

Closes #3990 from marmbrus/SPARK-5049 and squashes the following commits:

dd03e4e [Michael Armbrust] Fill in the partition values of parquet scans instead of using JoinedRow

(cherry picked from commit 5d9fa55)
Signed-off-by: Michael Armbrust <[email protected]>
asfgit pushed a commit that referenced this pull request Jan 12, 2015
Followup to #3870.  Props to rahulaggarwalguavus for identifying the issue.

Author: Michael Armbrust <[email protected]>

Closes #3990 from marmbrus/SPARK-5049 and squashes the following commits:

dd03e4e [Michael Armbrust] Fill in the partition values of parquet scans instead of using JoinedRow
@yhuai
Copy link
Contributor

yhuai commented Jan 21, 2015

Since this issue has been fixed by #3990, we can close it.

@yhuai
Copy link
Contributor

yhuai commented Jan 21, 2015

close this issue

@asfgit asfgit closed this in 622ff09 Jan 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants