[SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values #36672

dtenedor · 2022-05-25T22:03:51Z

What changes were proposed in this pull request?

Support vectorized Parquet scans when the table schema has associated DEFAULT column values.

Example:

create table t(i int) using parquet;
insert into t values(42);
alter table t add column s string default concat('abc', def');
select * from t;
> 42, 'abcdef'

Why are the changes needed?

This change makes it easier to build, query, and maintain tables backed by Parquet data.

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

This PR includes new test coverage.

remove unused imports Add more end to end testing Add more end to end testing

fetch latest changes from master

dtenedor · 2022-05-26T17:00:56Z

Synced latest changes from master, this PR no longer depends on any other unmerged PRs anymore

HyukjinKwon · 2022-05-27T06:28:57Z

cc @sadikovi too FYI

AmplabJenkins · 2022-05-27T14:01:16Z

Can one of the admins verify this patch?

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

sadikovi · 2022-05-30T22:26:25Z

Also, can we add a test to check that the DEFAULT values work? Thanks.

dtenedor · 2022-06-01T00:50:29Z

Also, can we add a test to check that the DEFAULT values work? Thanks.

@sadikovi Sure, this is done in InsertSuite, by adding a new configuration to the case covering Parquet files (previously it only covered the non-vectorized case, but now with Config(None) it also runs the test over the vectorized case as well):

      TestCase(
        dataSource = "parquet",
        Seq(
          Config(
            None),
          Config(
            Some(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false"),
            insertNullsToStorage = false)))

sadikovi

LGTM. Thanks for updating the test.

fetch latest changes from master

fix test restore project files

HyukjinKwon · 2022-06-02T03:58:20Z

The test log is a bit messy .. just copying and pasting the error I saw:

2022-06-02T02:22:55.9427269Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m- SPARK-38336 INSERT INTO statements with tables with default columns: negative tests *** FAILED *** (13 milliseconds)�[0m�[0m
2022-06-02T02:22:55.9442627Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  org.apache.spark.sql.AnalysisException: Failed to execute CREATE TABLE command because the destination table column s has a DEFAULT value of badvalue which fails to resolve as a valid expression: [MISSING_COLUMN] Column 'badvalue' does not exist. Did you mean one of the following? []; line 1 pos 0�[0m�[0m
2022-06-02T02:22:55.9443938Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.analyze(ResolveDefaultColumnsUtil.scala:141)�[0m�[0m
2022-06-02T02:22:55.9445083Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.$anonfun$constantFoldCurrentDefaultsToExistDefaults$1(ResolveDefaultColumnsUtil.scala:96)�[0m�[0m
2022-06-02T02:22:55.9449272Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)�[0m�[0m
2022-06-02T02:22:55.9451745Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)�[0m�[0m
2022-06-02T02:22:55.9452379Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)�[0m�[0m
2022-06-02T02:22:55.9452963Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)�[0m�[0m
2022-06-02T02:22:55.9453505Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at scala.collection.TraversableLike.map(TraversableLike.scala:286)�[0m�[0m
2022-06-02T02:22:55.9454049Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at scala.collection.TraversableLike.map$(TraversableLike.scala:279)�[0m�[0m
2022-06-02T02:22:55.9454583Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)�[0m�[0m
2022-06-02T02:22:55.9494729Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.constantFoldCurrentDefaultsToExistDefaults(ResolveDefaultColumnsUtil.scala:94)�[0m�[0m
2022-06-02T02:22:55.9495832Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:153)�[0m�[0m
2022-06-02T02:22:55.9496639Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:148)�[0m�[0m
2022-06-02T02:22:55.9497537Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$2(AnalysisHelper.scala:170)�[0m�[0m
2022-06-02T02:22:55.9498263Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)�[0m�[0m
2022-06-02T02:22:55.9499007Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$1(AnalysisHelper.scala:170)�[0m�[0m
2022-06-02T02:22:55.9499830Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:323)�[0m�[0m
2022-06-02T02:22:55.9500716Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning(AnalysisHelper.scala:168)�[0m�[0m
2022-06-02T02:22:55.9501658Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning$(AnalysisHelper.scala:164)�[0m�[0m
2022-06-02T02:22:55.9502569Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDownWithPruning(LogicalPlan.scala:30)�[0m�[0m
2022-06-02T02:22:55.9503469Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning(AnalysisHelper.scala:99)�[0m�[0m
2022-06-02T02:22:55.9504354Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning$(AnalysisHelper.scala:96)�[0m�[0m
2022-06-02T02:22:55.9505333Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsWithPruning(LogicalPlan.scala:30)�[0m�[0m
2022-06-02T02:22:55.9506175Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:76)�[0m�[0m
2022-06-02T02:22:55.9506968Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:75)�[0m�[0m
2022-06-02T02:22:55.9507748Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:30)�[0m�[0m
2022-06-02T02:22:55.9508709Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.execution.datasources.DataSourceAnalysis.apply(DataSourceStrategy.scala:148)�[0m�[0m
2022-06-02T02:22:55.9509477Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.execution.datasources.DataSourceAnalysis.apply(DataSourceStrategy.scala:64)�[0m�[0m
2022-06-02T02:22:55.9510186Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:211)�[0m�[0m
2022-06-02T02:22:55.9510811Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)�[0m�[0m
2022-06-02T02:22:55.9511420Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)�[0m�[0m
2022-06-02T02:22:55.9511976Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at scala.collection.immutable.List.foldLeft(List.scala:91)�[0m�[0m
2022-06-02T02:22:55.9512576Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)�[0m�[0m
2022-06-02T02:22:55.9513240Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)�[0m�[0m
2022-06-02T02:22:55.9513896Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at scala.collection.immutable.List.foreach(List.scala:431)�[0m�[0m

dtenedor · 2022-06-02T18:50:11Z

@HyukjinKwon the CI passes now :)

dtenedor · 2022-06-03T16:36:32Z

@gengliangwang @HyukjinKwon @cloud-fan can someone please merge this in (or leave more review comment(s) if desired for another pass)?

gengliangwang

Thanks for the work!

### What changes were proposed in this pull request? Support vectorized Orc scans when the table schema has associated DEFAULT column values. (Note, this PR depends on #36672 which adds the same for Parquet files.) Example: ``` create table t(i int) using orc; insert into t values(42); alter table t add column s string default concat('abc', def'); select * from t; > 42, 'abcdef' ``` ### Why are the changes needed? This change makes it easier to build, query, and maintain tables backed by Orc data. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? This PR includes new test coverage. Closes #36675 from dtenedor/default-orc-vectorized. Authored-by: Daniel Tenedorio <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

dtenedor added 30 commits May 9, 2022 17:54

initial implementation

474eb88

update

f7904fc

remove unused imports Add more end to end testing Add more end to end testing

add more unit test coverage

3de3224

simplify implementation

d62df95

respond to code review comments

9c18abc

stash

cb5084e

respond to code review comments

ac300ac

respond to code review comments

c9da8c1

fix bad merge

e56c629

fix bad merge

5d79a84

update

c4fab60

fetch latest changes from master

adc91bd

update test suite to cover JSON_GENERATOR_IGNORE_NULL_FIELDS config

406bd3f

update test suite to cover JSON_GENERATOR_IGNORE_NULL_FIELDS config

730466e

update test suite to cover JSON_GENERATOR_IGNORE_NULL_FIELDS config

8663df4

rename defaultValues -> existenceDefaultValues

3896f23

fix tests

885b217

respond to code review comments

0eefb9f

respond to code review comments

2440e2f

respond to code review comments

e75bd38

respond to code review comments

7530643

respond to code review comments

df98284

respond to code review comments

63ccbe4

respond to code review comments

1ca9e7f

respond to code review comments

983a2b9

Initial implementation

fae2369

fetch latest changes from master

57c3627

simplify testing

8313d51

add more test coverage for different datatypes

8d9e637

Initial implementation

525942a

dtenedor mentioned this pull request May 26, 2022

[SPARK-39294][SQL] Support vectorized Orc scans with DEFAULT values #36675

Closed

dtenedor added 2 commits May 26, 2022 09:52

Merge branch 'master' into default-parquet

7982c8d

fetch latest changes from master

Merge branch 'default-parquet' into default-parquet-vectorized

16be605

fetch latest changes from master

sadikovi reviewed May 30, 2022

View reviewed changes

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java Outdated Show resolved Hide resolved

dtenedor added 4 commits May 31, 2022 17:11

fix tests

0cc8d17

fix tests

8c156df

fix tests

c1f7a67

fix tests

56cc970

sadikovi approved these changes Jun 1, 2022

View reviewed changes

dtenedor added 6 commits May 31, 2022 19:50

Merge branch 'master' into default-parquet-vectorized

2de6219

fetch latest changes from master

fix tests

c456402

Merge branch 'master' into default-parquet-vectorized

0d56f29

fetch latest changes from master

fix tests

6c12693

fix test restore project files

fix tests

900d4b6

fix tests

679042a

fix tests

c3594f7

gengliangwang approved these changes Jun 3, 2022

View reviewed changes

gengliangwang closed this in 3e6598e Jun 3, 2022

firestarman mentioned this pull request Jun 7, 2022

Add shims for AnsiCast [databricks] NVIDIA/spark-rapids#5749

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values #36672

[SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values #36672

Uh oh!

dtenedor commented May 25, 2022 •

edited

Loading

Uh oh!

dtenedor commented May 26, 2022

Uh oh!

HyukjinKwon commented May 27, 2022

Uh oh!

AmplabJenkins commented May 27, 2022

Uh oh!

Uh oh!

sadikovi commented May 30, 2022

Uh oh!

dtenedor commented Jun 1, 2022 •

edited

Loading

Uh oh!

sadikovi left a comment

Uh oh!

HyukjinKwon commented Jun 2, 2022 •

edited

Loading

Uh oh!

dtenedor commented Jun 2, 2022

Uh oh!

dtenedor commented Jun 3, 2022

Uh oh!

gengliangwang left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values #36672

[SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values #36672

Uh oh!

Conversation

dtenedor commented May 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dtenedor commented May 26, 2022

Uh oh!

HyukjinKwon commented May 27, 2022

Uh oh!

AmplabJenkins commented May 27, 2022

Uh oh!

Uh oh!

sadikovi commented May 30, 2022

Uh oh!

dtenedor commented Jun 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sadikovi left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dtenedor commented Jun 2, 2022

Uh oh!

dtenedor commented Jun 3, 2022

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dtenedor commented May 25, 2022 •

edited

Loading

dtenedor commented Jun 1, 2022 •

edited

Loading

HyukjinKwon commented Jun 2, 2022 •

edited

Loading