[SPARK-29462] The data type of "array()" should be array<null> #26324

amanomer · 2019-10-30T12:09:52Z

What changes were proposed in this pull request?

During creation of array, if CreateArray does not gets any children to set data type for array, it will create an array of null type .

Why are the changes needed?

When empty array is created, it should be declared as array.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested manually

amanomer · 2019-10-30T13:32:53Z

cc @srowen @MaxGekk @dongjoon-hyun

srowen · 2019-10-30T14:36:16Z

To clarify, why do you think it should be null? not saying it shouldn't be, but is this in comparison to a standard or other DB?

maropu · 2019-10-30T15:05:21Z

I remember this behaivour is the same with hive array(): #18516 (comment)

amanomer · 2019-10-30T16:12:12Z

Ah... I see. That was done intentionally to match hive behavior. I will close this PR.
Thanks @srowen @maropu

We also need to update SPARK-29462

amanomer · 2019-10-30T16:12:43Z

cc @gengliangwang

gengliangwang · 2019-10-30T16:39:01Z

I think the proposal is reasonable. If no values are provided, then handling it as null type makes sense.
After #26107, we can't store string type into numeric type without explicit casting.
So, if we want to insert an empty array into a column of array<int>, simply using value array() will fail the analysis check.
Since Spark can insert null type to other types, the problem is resolved.
But the concern is that we should we do with map(), should we make it map<NullType, NullType>?

gengliangwang · 2019-10-30T17:33:01Z

ok to test

gengliangwang · 2019-10-30T17:33:30Z

@amanomer Could you add some test cases?

amanomer · 2019-10-30T17:44:04Z

Sure. I will add test cases.

SparkQA · 2019-10-30T20:32:51Z

Test build #112946 has finished for PR 26324 at commit 39ed807.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

amanomer · 2019-10-31T04:28:27Z

Test which ensures that empty array and map should be of string type is failing.

amanomer · 2019-10-31T04:29:24Z

@maropu

maropu · 2019-10-31T04:51:58Z

Can you remove the test? Anyway, I'm neutral on this change; I'm a bit worried that hive users get confused with this behaviour change. cc: @dongjoon-hyun @wangyum

amanomer · 2019-10-31T05:15:46Z

@gengliangwang @maropu kindly review

amanomer · 2019-10-31T05:42:16Z

Failure is not related. Retest this please.

SparkQA · 2019-10-31T07:05:02Z

Test build #112994 has finished for PR 26324 at commit 6fc258b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-10-31T18:43:44Z

retest this please.

SparkQA · 2019-10-31T22:32:42Z

Test build #113042 has finished for PR 26324 at commit 6fc258b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

amanomer · 2019-11-02T04:18:41Z

@gengliangwang Can you help in finding out why is this failing?
Thanks

gengliangwang · 2019-11-02T06:31:59Z

@maropu @srowen Just double confirm, are you ok with the proposal?

srowen · 2019-11-02T13:54:55Z

I agree that compatibility is a minor issue here. But it also sounds like this change allows things to work that did not before. I'm trying to think of cases where something works with the current behavior and not with the new... is there one?

maropu · 2019-11-02T23:19:17Z

yea, that's ok to change it. Because of the behaviour change as @srowen said, we might need to update the migration guide, too.

gengliangwang · 2019-11-03T06:33:14Z

I agree that compatibility is a minor issue here. But it also sounds like this change allows things to work that did not before. I'm trying to think of cases where something works with the current behavior and not with the new... is there one?

Previously array() can't be stored into array<array<?>> type since string value can't store into array type.
After the change in this PR, array() can be stored as array<array<?>> type, which also seems reasonable..

gengliangwang · 2019-11-03T06:37:25Z

yea, that's ok to change it. Because of the behaviour change as @srowen said, we might need to update the migration guide, too.

+1, @amanomer could you add migration guide?

amanomer · 2019-11-03T09:06:39Z

Yes, I will add.

srowen · 2019-11-03T12:55:57Z

I'm asking the reverse question - is there anything that works before this change but not after? I understand it makes something else work.

SparkQA · 2019-11-03T14:13:15Z

Test build #113169 has finished for PR 26324 at commit acc2864.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-11-03T20:44:30Z

I'm asking the reverse question - is there anything that works before this change but not after? I understand it makes something else work.

oh I see.. I can't think of any.

docs/sql-migration-guide.md

SparkQA · 2019-11-04T06:37:18Z

Test build #113188 has finished for PR 26324 at commit bb5e68a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

amanomer · 2019-11-04T07:04:57Z

cc @srowen @maropu

maropu · 2019-11-05T02:13:31Z

I think its better for more other reviewers (e.g., @dongjoon-hyun, @wangyum, ...) who's familiar with hive to check this before merging (I'm not sure about why hive regards array() as array<string>

HyukjinKwon · 2019-11-05T05:51:34Z

It's a tough call but ..

Since Spark can insert null type to other types, the problem is resolved.

This makes sense to me. I personally follow Hive whenever I am not sure but if we have another good coherent reason like ANSI standard, let's try to stick to that.

amanomer · 2019-11-06T09:25:16Z

@maropu @srowen @dongjoon-hyun kindly review this PR?

HyukjinKwon · 2019-11-06T09:39:22Z

Merged to master.

amanomer · 2019-11-06T10:19:39Z

Thanks all.

HyukjinKwon · 2019-11-13T04:11:00Z

Sorry, actually I think I have to revert this back for the reasons below:

per [SPARK-29462] The data type of "array()" should be array<null> #26324 (comment) we should think about what to do with map type together.
Seems like null type coercion depends and some DBMSes don't event have null types. Hive has it but Hive case was array<string>.
In case of PostgreSQL seems it coerces NULL literal to TEXT type.

Using null as default type might cause a corner case as below. Should we add void as NullType? needs some investigations.

CREATE TABLE weird_null USING JSON AS SELECT NULL AS null_col;
SHOW CREATE TABLE weird_null

CREATE TABLE `weird_null` (`null_col` NULL)
USING JSON

CREATE TABLE `weird_null_1` (`null_col` NULL)
USING JSON

Error in SQL statement: ParseException: 
DataType null is not supported.(line 1, pos 40)

== SQL ==
CREATE TABLE `weird_null_1` (`null_col` NULL)
-------------------------

This PR basically means a revert of [SPARK-21281][SQL] Use string types by default if array and map have no argument #18516 (comment) - let's don't make it complicated for now.

amanomer · 2019-11-13T05:27:52Z

@HyukjinKwon @maropu review this PR #26317 also.

cloud-fan · 2020-02-10T08:36:29Z

I don't agree we should revert it. The Hive behavior is really confusing and we shouldn't inherit it anymore. concat(array(), array(1)) returns array of string, which doesn't make sense. Presto returns array of int.

I think map() can be inferred as map<void, void>. The map key can be null type but can't be null, so this is fine as long as map<void, void> is empty.

For CTAS, I don't think it's related. If we can create table with null-type column using CTAS, why not allow array<void> as well?

cloud-fan · 2020-02-10T08:37:05Z

@Ngone51 can you help revert the revert?

HyukjinKwon · 2020-02-10T10:01:36Z

Okay, I am good to bring it back. Let me open a PR to revert of a revert.

During creation of array, if CreateArray does not gets any children to set data type for array, it will create an array of null type . When empty array is created, it should be declared as array<null>. No Tested manually Closes apache#26324 from amanomer/29462. Authored-by: Aman Omer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

HyukjinKwon · 2020-02-10T10:14:45Z

@cloud-fan, we didn't add void type yet BTW.

cloud-fan · 2020-02-10T10:16:08Z

null type is the same as void type, or unknown type. We can change the string representation of NullType if needed.

HyukjinKwon · 2020-02-10T10:18:32Z

Yeah, so my point was that I wonder if it's right to promote the type we're not supporting correctly.

### What changes were proposed in this pull request? This brings #26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI. - In case of PostgreSQL seems coercing NULL literal to TEXT type. - Presto seems coercing `array() + array(1)` -> array of int. - Hive seems `array() + array(1)` -> array of strings Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense. Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states: > If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case: > > a) If ES simply contains ARRAY, then ET ARRAY[0]. > > b) If ES simply contains MULTISET, then ET MULTISET. > > ES is effectively replaced by CAST ( ES AS DT ) From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back. ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`. ### How was this patch tested? Tested manually Closes #27521 from HyukjinKwon/SPARK-29462. Lead-authored-by: HyukjinKwon <[email protected]> Co-authored-by: Aman Omer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? This brings #26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI. - In case of PostgreSQL seems coercing NULL literal to TEXT type. - Presto seems coercing `array() + array(1)` -> array of int. - Hive seems `array() + array(1)` -> array of strings Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense. Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states: > If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case: > > a) If ES simply contains ARRAY, then ET ARRAY[0]. > > b) If ES simply contains MULTISET, then ET MULTISET. > > ES is effectively replaced by CAST ( ES AS DT ) From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back. ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`. ### How was this patch tested? Tested manually Closes #27521 from HyukjinKwon/SPARK-29462. Lead-authored-by: HyukjinKwon <[email protected]> Co-authored-by: Aman Omer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 0045be7) Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? This brings apache#26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI. - In case of PostgreSQL seems coercing NULL literal to TEXT type. - Presto seems coercing `array() + array(1)` -> array of int. - Hive seems `array() + array(1)` -> array of strings Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense. Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states: > If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case: > > a) If ES simply contains ARRAY, then ET ARRAY[0]. > > b) If ES simply contains MULTISET, then ET MULTISET. > > ES is effectively replaced by CAST ( ES AS DT ) From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back. ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`. ### How was this patch tested? Tested manually Closes apache#27521 from HyukjinKwon/SPARK-29462. Lead-authored-by: HyukjinKwon <[email protected]> Co-authored-by: Aman Omer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

amanomer closed this Oct 30, 2019

amanomer reopened this Oct 30, 2019

dongjoon-hyun added the SQL label Oct 30, 2019

amanomer added 2 commits October 31, 2019 11:33

[SPARK-29462] Initial commit

f35aac2

Updated previous test case and added new test case

b1473b9

gengliangwang reviewed Nov 3, 2019

View reviewed changes

docs/sql-migration-guide.md Outdated Show resolved Hide resolved

Handled review comments

bb5e68a

gengliangwang approved these changes Nov 4, 2019

View reviewed changes

HyukjinKwon approved these changes Nov 5, 2019

View reviewed changes

HyukjinKwon closed this in 0dcd739 Nov 6, 2019

HyukjinKwon mentioned this pull request Feb 10, 2020

[SPARK-29462][SQL] The data type of "array()" should be array<null> #27521

Closed

[SPARK-29462] The data type of "array()" should be array<null> #26324

[SPARK-29462] The data type of "array()" should be array<null> #26324

Uh oh!

Conversation

amanomer commented Oct 30, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

amanomer commented Oct 30, 2019

Uh oh!

srowen commented Oct 30, 2019

Uh oh!

maropu commented Oct 30, 2019

Uh oh!

amanomer commented Oct 30, 2019

Uh oh!

amanomer commented Oct 30, 2019

Uh oh!

gengliangwang commented Oct 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang commented Oct 30, 2019

Uh oh!

gengliangwang commented Oct 30, 2019

Uh oh!

amanomer commented Oct 30, 2019

Uh oh!

SparkQA commented Oct 30, 2019

Uh oh!

amanomer commented Oct 31, 2019

Uh oh!

amanomer commented Oct 31, 2019

Uh oh!

maropu commented Oct 31, 2019

Uh oh!

amanomer commented Oct 31, 2019

Uh oh!

amanomer commented Oct 31, 2019

Uh oh!

SparkQA commented Oct 31, 2019

Uh oh!

gengliangwang commented Oct 31, 2019

Uh oh!

SparkQA commented Oct 31, 2019

Uh oh!

amanomer commented Nov 2, 2019

Uh oh!

gengliangwang commented Nov 2, 2019

Uh oh!

srowen commented Nov 2, 2019

Uh oh!

maropu commented Nov 2, 2019

Uh oh!

gengliangwang commented Nov 3, 2019

Uh oh!

gengliangwang commented Nov 3, 2019

Uh oh!

amanomer commented Nov 3, 2019

Uh oh!

srowen commented Nov 3, 2019

Uh oh!

SparkQA commented Nov 3, 2019

Uh oh!

gengliangwang commented Nov 3, 2019

Uh oh!

Uh oh!

SparkQA commented Nov 4, 2019

Uh oh!

amanomer commented Nov 4, 2019

Uh oh!

maropu commented Nov 5, 2019

Uh oh!

HyukjinKwon commented Nov 5, 2019

Uh oh!

amanomer commented Nov 6, 2019

Uh oh!

HyukjinKwon commented Nov 6, 2019

Uh oh!

gengliangwang commented Oct 30, 2019 •

edited

Loading

HyukjinKwon commented Nov 13, 2019 •

edited

Loading