[SPARK-29462][SQL] The data type of "array()" should be array<null> #27521

HyukjinKwon · 2020-02-10T10:28:51Z

What changes were proposed in this pull request?

This brings #26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI.

In case of PostgreSQL seems coercing NULL literal to TEXT type.
Presto seems coercing array() + array(1) -> array of int.
Hive seems array() + array(1) -> array of strings

Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense.

Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "" states:

If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case:

a) If ES simply contains ARRAY, then ET ARRAY[0].

b) If ES simply contains MULTISET, then ET MULTISET.

ES is effectively replaced by CAST ( ES AS DT )

From reading other related context, doing it to NullType. Given the investigation made, choosing to null seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back.

Why are the changes needed?

When empty array is created, it should be declared as array.

Does this PR introduce any user-facing change?

Yes, array() creates array<null>. Now array(1) + array() can correctly create array(1) instead of array("1").

How was this patch tested?

Tested manually

During creation of array, if CreateArray does not gets any children to set data type for array, it will create an array of null type . When empty array is created, it should be declared as array<null>. No Tested manually Closes apache#26324 from amanomer/29462. Authored-by: Aman Omer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

HyukjinKwon · 2020-02-10T10:29:29Z

cc @cloud-fan, @maropu, @amanomer , @gengliangwang

docs/sql-migration-guide.md

SparkQA · 2020-02-10T15:20:54Z

Test build #118153 has finished for PR 27521 at commit 130f808.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-02-10T15:27:39Z

retest this please

SparkQA · 2020-02-10T15:44:39Z

Test build #118151 has finished for PR 27521 at commit 5b7fad9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

docs/sql-migration-guide.md

SparkQA · 2020-02-10T20:23:38Z

Test build #118170 has finished for PR 27521 at commit 130f808.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-11T06:47:00Z

Test build #118195 has finished for PR 27521 at commit 90b4660.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-02-11T08:21:38Z

Thank you @maropu.

Merged to master and branch-3.0.

### What changes were proposed in this pull request? This brings #26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI. - In case of PostgreSQL seems coercing NULL literal to TEXT type. - Presto seems coercing `array() + array(1)` -> array of int. - Hive seems `array() + array(1)` -> array of strings Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense. Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states: > If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case: > > a) If ES simply contains ARRAY, then ET ARRAY[0]. > > b) If ES simply contains MULTISET, then ET MULTISET. > > ES is effectively replaced by CAST ( ES AS DT ) From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back. ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`. ### How was this patch tested? Tested manually Closes #27521 from HyukjinKwon/SPARK-29462. Lead-authored-by: HyukjinKwon <[email protected]> Co-authored-by: Aman Omer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 0045be7) Signed-off-by: HyukjinKwon <[email protected]>

cloud-fan · 2020-02-11T09:06:11Z

to be consistent, I think we should do the same thing for map. @Ngone51 can you help with it?

iRakson · 2020-02-11T13:37:35Z

@cloud-fan I will raise PR for map.

### What changes were proposed in this pull request? `spark.sql("select map()")` returns {}. After these changes it will return map<null,null> ### Why are the changes needed? After changes introduced due to #27521, it is important to maintain consistency while using map(). ### Does this PR introduce any user-facing change? Yes. Now map() will give map<null,null> instead of {}. ### How was this patch tested? UT added. Migration guide updated as well Closes #27542 from iRakson/SPARK-30790. Authored-by: iRakson <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? `spark.sql("select map()")` returns {}. After these changes it will return map<null,null> ### Why are the changes needed? After changes introduced due to #27521, it is important to maintain consistency while using map(). ### Does this PR introduce any user-facing change? Yes. Now map() will give map<null,null> instead of {}. ### How was this patch tested? UT added. Migration guide updated as well Closes #27542 from iRakson/SPARK-30790. Authored-by: iRakson <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 926e3a1) Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This brings apache#26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI. - In case of PostgreSQL seems coercing NULL literal to TEXT type. - Presto seems coercing `array() + array(1)` -> array of int. - Hive seems `array() + array(1)` -> array of strings Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense. Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states: > If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case: > > a) If ES simply contains ARRAY, then ET ARRAY[0]. > > b) If ES simply contains MULTISET, then ET MULTISET. > > ES is effectively replaced by CAST ( ES AS DT ) From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back. ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`. ### How was this patch tested? Tested manually Closes apache#27521 from HyukjinKwon/SPARK-29462. Lead-authored-by: HyukjinKwon <[email protected]> Co-authored-by: Aman Omer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? `spark.sql("select map()")` returns {}. After these changes it will return map<null,null> ### Why are the changes needed? After changes introduced due to apache#27521, it is important to maintain consistency while using map(). ### Does this PR introduce any user-facing change? Yes. Now map() will give map<null,null> instead of {}. ### How was this patch tested? UT added. Migration guide updated as well Closes apache#27542 from iRakson/SPARK-30790. Authored-by: iRakson <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan reviewed Feb 10, 2020

View reviewed changes

docs/sql-migration-guide.md Outdated Show resolved Hide resolved

cloud-fan approved these changes Feb 10, 2020

View reviewed changes

Address comments

130f808

maropu approved these changes Feb 10, 2020

View reviewed changes

dongjoon-hyun added the SQL label Feb 10, 2020

dongjoon-hyun changed the title ~~[SPARK-29462] The data type of "array()" should be array<null>~~ [SPARK-29462][SQL] The data type of "array()" should be array<null> Feb 10, 2020

dongjoon-hyun reviewed Feb 10, 2020

View reviewed changes

docs/sql-migration-guide.md Outdated Show resolved Hide resolved

gengliangwang reviewed Feb 10, 2020

View reviewed changes

docs/sql-migration-guide.md Outdated Show resolved Hide resolved

Address comments and add a legacy configuration

90b4660

HyukjinKwon force-pushed the SPARK-29462 branch from c74331f to 90b4660 Compare February 11, 2020 02:01

maropu approved these changes Feb 11, 2020

View reviewed changes

HyukjinKwon closed this in 0045be7 Feb 11, 2020

iRakson mentioned this pull request Feb 11, 2020

[SPARK-30790][SQL] The dataType of map() should be map<null,null> #27542

Closed

HyukjinKwon deleted the SPARK-29462 branch March 3, 2020 01:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-29462][SQL] The data type of "array()" should be array<null> #27521

[SPARK-29462][SQL] The data type of "array()" should be array<null> #27521

Uh oh!

HyukjinKwon commented Feb 10, 2020 •

edited

Loading

Uh oh!

HyukjinKwon commented Feb 10, 2020

Uh oh!

Uh oh!

SparkQA commented Feb 10, 2020

Uh oh!

HyukjinKwon commented Feb 10, 2020

Uh oh!

SparkQA commented Feb 10, 2020

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Feb 10, 2020

Uh oh!

SparkQA commented Feb 11, 2020

Uh oh!

HyukjinKwon commented Feb 11, 2020

Uh oh!

cloud-fan commented Feb 11, 2020

Uh oh!

iRakson commented Feb 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

[SPARK-29462][SQL] The data type of "array()" should be array<null> #27521

[SPARK-29462][SQL] The data type of "array()" should be array<null> #27521

Uh oh!

Conversation

HyukjinKwon commented Feb 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Feb 10, 2020

Uh oh!

Uh oh!

SparkQA commented Feb 10, 2020

Uh oh!

HyukjinKwon commented Feb 10, 2020

Uh oh!

SparkQA commented Feb 10, 2020

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Feb 10, 2020

Uh oh!

SparkQA commented Feb 11, 2020

Uh oh!

HyukjinKwon commented Feb 11, 2020

Uh oh!

cloud-fan commented Feb 11, 2020

Uh oh!

iRakson commented Feb 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

HyukjinKwon commented Feb 10, 2020 •

edited

Loading