[SPARK-1436] In-memory columnar storage bug fixes #374

liancheng · 2014-04-09T23:01:04Z

Fixed several bugs of in-memory columnar storage to make HiveInMemoryCompatibilitySuite pass.

@rxin @marmbrus It is reasonable to include HiveInMemoryCompatibilitySuite in this PR, but I didn't, since it significantly increases test execution time. What do you think?

UPDATE HiveCompatibilitySuite has been made to cache tables in memory. HiveInMemoryCompatibilitySuite was removed.

AmplabJenkins · 2014-04-09T23:02:23Z

Merged build triggered.

AmplabJenkins · 2014-04-09T23:02:31Z

Merged build started.

marmbrus · 2014-04-09T23:10:19Z

A few notes.

This is tracked by SPARK-1436
We are trying to include this in 1.0. If we can't we will need to flip off cachingTables or something as this currently results in the SparkSQL returning wrong answers.
Running the Hive tests found tons of bugs, so I think we need to include them until there is better coverage for columnar stuff in unit tests. SPARK-1455 will make the increase in test time less of an issue.

AmplabJenkins · 2014-04-09T23:17:23Z

Merged build triggered.

AmplabJenkins · 2014-04-09T23:17:31Z

Merged build started.

AmplabJenkins · 2014-04-09T23:39:39Z

Merged build finished.

AmplabJenkins · 2014-04-09T23:39:40Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13967/

AmplabJenkins · 2014-04-09T23:55:48Z

Merged build finished.

AmplabJenkins · 2014-04-09T23:55:48Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13968/

AmplabJenkins · 2014-04-10T03:57:23Z

Merged build triggered.

AmplabJenkins · 2014-04-10T03:57:32Z

Merged build started.

AmplabJenkins · 2014-04-10T04:07:43Z

Merged build finished.

AmplabJenkins · 2014-04-10T04:07:43Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13978/

marmbrus · 2014-04-10T04:08:27Z

Jenkins, test this please.

AmplabJenkins · 2014-04-10T04:12:23Z

Merged build triggered.

AmplabJenkins · 2014-04-10T04:12:32Z

Merged build started.

AmplabJenkins · 2014-04-10T05:43:28Z

Merged build finished.

AmplabJenkins · 2014-04-10T05:43:28Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13980/

…the attributes argument

…ableScan

…t does not work correctly with mutable rows.

marmbrus · 2014-04-11T17:35:49Z

Okay, that's a good point. Why don't you just move the before and after
functions that turn on caching into hive compatibility and remove the in
memory sub class.
On Apr 11, 2014 10:18 AM, "Cheng Lian" [email protected] wrote:

@marmbrus https://github.com/marmbrus I think we'd better remove either
HiveInMemoryCompatabilitySuite or HiveCompatabilitySuite. Travis
complains that the build time is too long (> 50 min). I prefer removing
HiveCompatabilitySuite since the in-memory version should have already
covered it.

Reply to this email directly or view it on GitHubhttps://github.com//pull/374#issuecomment-40227452
.

liancheng · 2014-04-11T18:25:59Z

OK, merged these two giant suites, hope both Travis and Jenkins are happy with this.

AmplabJenkins · 2014-04-11T18:28:11Z

Merged build triggered.

AmplabJenkins · 2014-04-11T18:28:20Z

Merged build started.

AmplabJenkins · 2014-04-11T18:29:12Z

Merged build finished.

AmplabJenkins · 2014-04-11T18:29:12Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14055/

marmbrus · 2014-04-11T18:55:16Z

Hmm, this failure seems to be related to the new Python Spark SQL API (which isn't even part of this PR). I will investigate.

pwendell · 2014-04-11T20:14:50Z

@marmbrus thanks for catching this I've submitted #393 to fix it.

marmbrus · 2014-04-14T18:46:40Z

Jenkins, test this please.

AmplabJenkins · 2014-04-14T18:48:13Z

Merged build triggered.

AmplabJenkins · 2014-04-14T18:48:19Z

Merged build started.

AmplabJenkins · 2014-04-14T20:04:28Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-14T20:04:28Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14115/

marmbrus · 2014-04-14T21:04:32Z

@pwendell this is ready to merge.

pwendell · 2014-04-14T22:23:14Z

Thanks merged into master and 1.0

@rxin

Fixed several bugs of in-memory columnar storage to make `HiveInMemoryCompatibilitySuite` pass. @rxin @marmbrus It is reasonable to include `HiveInMemoryCompatibilitySuite` in this PR, but I didn't, since it significantly increases test execution time. What do you think? **UPDATE** `HiveCompatibilitySuite` has been made to cache tables in memory. `HiveInMemoryCompatibilitySuite` was removed. Author: Cheng Lian <[email protected]> Author: Michael Armbrust <[email protected]> Closes #374 from liancheng/inMemBugFix and squashes the following commits: 6ad6d9b [Cheng Lian] Merged HiveCompatibilitySuite and HiveInMemoryCompatibilitySuite 5bdbfe7 [Cheng Lian] Revert 882c538 & 8426ddc, which introduced regression 882c538 [Cheng Lian] Remove attributes field from InMemoryColumnarTableScan 32cc9ce [Cheng Lian] Code style cleanup 99382bf [Cheng Lian] Enable compression by default 4390bcc [Cheng Lian] Report error for any Throwable in HiveComparisonTest d1df4fd [Michael Armbrust] Remove test tables that might always get created anyway? ab9e807 [Michael Armbrust] Fix the logged console version of failed test cases to use the new syntax. 1965123 [Michael Armbrust] Don't use coalesce for gathering all data to a single partition, as it does not work correctly with mutable rows. e36cdd0 [Michael Armbrust] Spelling. 2d0e168 [Michael Armbrust] Run Hive tests in-memory too. 6360723 [Cheng Lian] Made PreInsertionCasts support SparkLogicalPlan and InMemoryColumnarTableScan c9b0f6f [Cheng Lian] Let InsertIntoTable support InMemoryColumnarTableScan 9c8fc40 [Cheng Lian] Disable compression by default e619995 [Cheng Lian] Bug fix: incorrect byte order in CompressionScheme.columnHeaderSize 8426ddc [Cheng Lian] Bug fix: InMemoryColumnarTableScan should cache columns specified by the attributes argument 036cd09 [Cheng Lian] Clean up unused imports 44591a5 [Cheng Lian] Bug fix: NullableColumnAccessor.hasNext must take nulls into account 052bf41 [Cheng Lian] Bug fix: should only gather compressibility info for non-null values 95b3301 [Cheng Lian] Fixed bugs in IntegralDelta (cherry picked from commit 7dbca68) Signed-off-by: Patrick Wendell <[email protected]>

@rxin

Fixed several bugs of in-memory columnar storage to make `HiveInMemoryCompatibilitySuite` pass. @rxin @marmbrus It is reasonable to include `HiveInMemoryCompatibilitySuite` in this PR, but I didn't, since it significantly increases test execution time. What do you think? **UPDATE** `HiveCompatibilitySuite` has been made to cache tables in memory. `HiveInMemoryCompatibilitySuite` was removed. Author: Cheng Lian <[email protected]> Author: Michael Armbrust <[email protected]> Closes apache#374 from liancheng/inMemBugFix and squashes the following commits: 6ad6d9b [Cheng Lian] Merged HiveCompatibilitySuite and HiveInMemoryCompatibilitySuite 5bdbfe7 [Cheng Lian] Revert 882c538 & 8426ddc, which introduced regression 882c538 [Cheng Lian] Remove attributes field from InMemoryColumnarTableScan 32cc9ce [Cheng Lian] Code style cleanup 99382bf [Cheng Lian] Enable compression by default 4390bcc [Cheng Lian] Report error for any Throwable in HiveComparisonTest d1df4fd [Michael Armbrust] Remove test tables that might always get created anyway? ab9e807 [Michael Armbrust] Fix the logged console version of failed test cases to use the new syntax. 1965123 [Michael Armbrust] Don't use coalesce for gathering all data to a single partition, as it does not work correctly with mutable rows. e36cdd0 [Michael Armbrust] Spelling. 2d0e168 [Michael Armbrust] Run Hive tests in-memory too. 6360723 [Cheng Lian] Made PreInsertionCasts support SparkLogicalPlan and InMemoryColumnarTableScan c9b0f6f [Cheng Lian] Let InsertIntoTable support InMemoryColumnarTableScan 9c8fc40 [Cheng Lian] Disable compression by default e619995 [Cheng Lian] Bug fix: incorrect byte order in CompressionScheme.columnHeaderSize 8426ddc [Cheng Lian] Bug fix: InMemoryColumnarTableScan should cache columns specified by the attributes argument 036cd09 [Cheng Lian] Clean up unused imports 44591a5 [Cheng Lian] Bug fix: NullableColumnAccessor.hasNext must take nulls into account 052bf41 [Cheng Lian] Bug fix: should only gather compressibility info for non-null values 95b3301 [Cheng Lian] Fixed bugs in IntegralDelta

Otherwise we can get a Scalastyle error when building from SBT.

Use different properties and existed id to test confliction cases

…rnetes (apache#374)

…#374) ### What changes were proposed in this pull request? The rule `ExtractGenerator` does not define any trigger condition when rewriting generator functions in `Project`, which makes the behavior quite unstable and heavily depends on the execution order of analyzer rules. Two bugs I've found so far: 1. By design, we want to forbid users from using more than one generator function in SELECT. However, we can't really enforce it if two generator functions are not resolved at the same time: the rule thinks there is only one generate function (the other is still unresolved), then rewrite it. The other one gets resolved later and gets rewritten later. 2. When a generator function is put after `SELECT *`, it's possible that `*` is not expanded yet when we enter `ExtractGenerator`. The rule rewrites the generator function: insert a `Generate` operator below, and add a new column to the projectList for the generator function output. Then we expand `*` to the child plan output which is `Generate`, we end up with two identical columns for the generate function output. This PR fixes it by adding a trigger condition when rewriting generator functions in `Project`: the projectList should be resolved or a generator function. This is the same trigger condition we used for `Aggregate`. To avoid breaking changes, this PR also allows multiple generator functions in `Project`, which works totally fine. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now multiple generator functions are allowed in `Project`. And there won't be duplicated columns for generator function output. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45350 from cloud-fan/generate. Lead-authored-by: Wenchen Fan <[email protected]> (cherry picked from commit 51f4cfa) Signed-off-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]>

liancheng and others added 12 commits April 11, 2014 10:00

Fixed bugs in IntegralDelta

95b3301

Bug fix: should only gather compressibility info for non-null values

052bf41

Bug fix: NullableColumnAccessor.hasNext must take nulls into account

44591a5

Clean up unused imports

036cd09

Bug fix: InMemoryColumnarTableScan should cache columns specified by …

8426ddc

…the attributes argument

Bug fix: incorrect byte order in CompressionScheme.columnHeaderSize

e619995

Disable compression by default

9c8fc40

Let InsertIntoTable support InMemoryColumnarTableScan

c9b0f6f

Made PreInsertionCasts support SparkLogicalPlan and InMemoryColumnarT…

6360723

…ableScan

Run Hive tests in-memory too.

2d0e168

Spelling.

e36cdd0

Don't use coalesce for gathering all data to a single partition, as i…

1965123

…t does not work correctly with mutable rows.

Merged HiveCompatibilitySuite and HiveInMemoryCompatibilitySuite

6ad6d9b

asfgit closed this in 7dbca68 Apr 14, 2014

liancheng changed the title ~~[BUGFIX] In-memory columnar storage bug fixes~~ [SPARK-1436] In-memory columnar storage bug fixes Apr 15, 2014

liancheng deleted the inMemBugFix branch July 3, 2014 21:27

tangzhankun pushed a commit to tangzhankun/spark that referenced this pull request Jul 25, 2017

Add implicit conversions to imports. (apache#374)

8c35d81

Otherwise we can get a Scalastyle error when building from SBT.

erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017

Add implicit conversions to imports. (apache#374)

f46443e

Otherwise we can get a Scalastyle error when building from SBT.

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#374 from liu-sheng/138

1ec7ad1

Use different properties and existed id to test confliction cases

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

PIC-34: Rename default configmap name to be consistent with mapr-kube…

112649b

…rnetes (apache#374)

[SPARK-1436] In-memory columnar storage bug fixes #374

[SPARK-1436] In-memory columnar storage bug fixes #374

Uh oh!

Conversation

liancheng commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

marmbrus commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

marmbrus commented Apr 10, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

marmbrus commented Apr 11, 2014

Uh oh!

liancheng commented Apr 11, 2014

Uh oh!

AmplabJenkins commented Apr 11, 2014

Uh oh!

AmplabJenkins commented Apr 11, 2014

Uh oh!

AmplabJenkins commented Apr 11, 2014

Uh oh!

AmplabJenkins commented Apr 11, 2014

Uh oh!

marmbrus commented Apr 11, 2014

Uh oh!

pwendell commented Apr 11, 2014

Uh oh!

marmbrus commented Apr 14, 2014

Uh oh!

AmplabJenkins commented Apr 14, 2014

Uh oh!

AmplabJenkins commented Apr 14, 2014

Uh oh!

AmplabJenkins commented Apr 14, 2014

Uh oh!

AmplabJenkins commented Apr 14, 2014

Uh oh!

marmbrus commented Apr 14, 2014

Uh oh!

pwendell commented Apr 14, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants