-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-1436] In-memory columnar storage bug fixes #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Merged build triggered. |
|
Merged build started. |
|
A few notes.
|
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13967/ |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13968/ |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13978/ |
|
Jenkins, test this please. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13980/ |
…the attributes argument
…t does not work correctly with mutable rows.
|
Okay, that's a good point. Why don't you just move the before and after
|
|
OK, merged these two giant suites, hope both Travis and Jenkins are happy with this. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14055/ |
|
Hmm, this failure seems to be related to the new Python Spark SQL API (which isn't even part of this PR). I will investigate. |
|
Jenkins, test this please. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
@pwendell this is ready to merge. |
|
Thanks merged into master and 1.0 |
Fixed several bugs of in-memory columnar storage to make `HiveInMemoryCompatibilitySuite` pass. @rxin @marmbrus It is reasonable to include `HiveInMemoryCompatibilitySuite` in this PR, but I didn't, since it significantly increases test execution time. What do you think? **UPDATE** `HiveCompatibilitySuite` has been made to cache tables in memory. `HiveInMemoryCompatibilitySuite` was removed. Author: Cheng Lian <[email protected]> Author: Michael Armbrust <[email protected]> Closes #374 from liancheng/inMemBugFix and squashes the following commits: 6ad6d9b [Cheng Lian] Merged HiveCompatibilitySuite and HiveInMemoryCompatibilitySuite 5bdbfe7 [Cheng Lian] Revert 882c538 & 8426ddc, which introduced regression 882c538 [Cheng Lian] Remove attributes field from InMemoryColumnarTableScan 32cc9ce [Cheng Lian] Code style cleanup 99382bf [Cheng Lian] Enable compression by default 4390bcc [Cheng Lian] Report error for any Throwable in HiveComparisonTest d1df4fd [Michael Armbrust] Remove test tables that might always get created anyway? ab9e807 [Michael Armbrust] Fix the logged console version of failed test cases to use the new syntax. 1965123 [Michael Armbrust] Don't use coalesce for gathering all data to a single partition, as it does not work correctly with mutable rows. e36cdd0 [Michael Armbrust] Spelling. 2d0e168 [Michael Armbrust] Run Hive tests in-memory too. 6360723 [Cheng Lian] Made PreInsertionCasts support SparkLogicalPlan and InMemoryColumnarTableScan c9b0f6f [Cheng Lian] Let InsertIntoTable support InMemoryColumnarTableScan 9c8fc40 [Cheng Lian] Disable compression by default e619995 [Cheng Lian] Bug fix: incorrect byte order in CompressionScheme.columnHeaderSize 8426ddc [Cheng Lian] Bug fix: InMemoryColumnarTableScan should cache columns specified by the attributes argument 036cd09 [Cheng Lian] Clean up unused imports 44591a5 [Cheng Lian] Bug fix: NullableColumnAccessor.hasNext must take nulls into account 052bf41 [Cheng Lian] Bug fix: should only gather compressibility info for non-null values 95b3301 [Cheng Lian] Fixed bugs in IntegralDelta (cherry picked from commit 7dbca68) Signed-off-by: Patrick Wendell <[email protected]>
Fixed several bugs of in-memory columnar storage to make `HiveInMemoryCompatibilitySuite` pass. @rxin @marmbrus It is reasonable to include `HiveInMemoryCompatibilitySuite` in this PR, but I didn't, since it significantly increases test execution time. What do you think? **UPDATE** `HiveCompatibilitySuite` has been made to cache tables in memory. `HiveInMemoryCompatibilitySuite` was removed. Author: Cheng Lian <[email protected]> Author: Michael Armbrust <[email protected]> Closes apache#374 from liancheng/inMemBugFix and squashes the following commits: 6ad6d9b [Cheng Lian] Merged HiveCompatibilitySuite and HiveInMemoryCompatibilitySuite 5bdbfe7 [Cheng Lian] Revert 882c538 & 8426ddc, which introduced regression 882c538 [Cheng Lian] Remove attributes field from InMemoryColumnarTableScan 32cc9ce [Cheng Lian] Code style cleanup 99382bf [Cheng Lian] Enable compression by default 4390bcc [Cheng Lian] Report error for any Throwable in HiveComparisonTest d1df4fd [Michael Armbrust] Remove test tables that might always get created anyway? ab9e807 [Michael Armbrust] Fix the logged console version of failed test cases to use the new syntax. 1965123 [Michael Armbrust] Don't use coalesce for gathering all data to a single partition, as it does not work correctly with mutable rows. e36cdd0 [Michael Armbrust] Spelling. 2d0e168 [Michael Armbrust] Run Hive tests in-memory too. 6360723 [Cheng Lian] Made PreInsertionCasts support SparkLogicalPlan and InMemoryColumnarTableScan c9b0f6f [Cheng Lian] Let InsertIntoTable support InMemoryColumnarTableScan 9c8fc40 [Cheng Lian] Disable compression by default e619995 [Cheng Lian] Bug fix: incorrect byte order in CompressionScheme.columnHeaderSize 8426ddc [Cheng Lian] Bug fix: InMemoryColumnarTableScan should cache columns specified by the attributes argument 036cd09 [Cheng Lian] Clean up unused imports 44591a5 [Cheng Lian] Bug fix: NullableColumnAccessor.hasNext must take nulls into account 052bf41 [Cheng Lian] Bug fix: should only gather compressibility info for non-null values 95b3301 [Cheng Lian] Fixed bugs in IntegralDelta
Otherwise we can get a Scalastyle error when building from SBT.
Otherwise we can get a Scalastyle error when building from SBT.
Use different properties and existed id to test confliction cases
…#374) ### What changes were proposed in this pull request? The rule `ExtractGenerator` does not define any trigger condition when rewriting generator functions in `Project`, which makes the behavior quite unstable and heavily depends on the execution order of analyzer rules. Two bugs I've found so far: 1. By design, we want to forbid users from using more than one generator function in SELECT. However, we can't really enforce it if two generator functions are not resolved at the same time: the rule thinks there is only one generate function (the other is still unresolved), then rewrite it. The other one gets resolved later and gets rewritten later. 2. When a generator function is put after `SELECT *`, it's possible that `*` is not expanded yet when we enter `ExtractGenerator`. The rule rewrites the generator function: insert a `Generate` operator below, and add a new column to the projectList for the generator function output. Then we expand `*` to the child plan output which is `Generate`, we end up with two identical columns for the generate function output. This PR fixes it by adding a trigger condition when rewriting generator functions in `Project`: the projectList should be resolved or a generator function. This is the same trigger condition we used for `Aggregate`. To avoid breaking changes, this PR also allows multiple generator functions in `Project`, which works totally fine. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now multiple generator functions are allowed in `Project`. And there won't be duplicated columns for generator function output. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45350 from cloud-fan/generate. Lead-authored-by: Wenchen Fan <[email protected]> (cherry picked from commit 51f4cfa) Signed-off-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]>
Fixed several bugs of in-memory columnar storage to make
HiveInMemoryCompatibilitySuitepass.@rxin @marmbrus It is reasonable to include
HiveInMemoryCompatibilitySuitein this PR, but I didn't, since it significantly increases test execution time. What do you think?UPDATE
HiveCompatibilitySuitehas been made to cache tables in memory.HiveInMemoryCompatibilitySuitewas removed.