-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29462] The data type of "array()" should be array<null> #26324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
To clarify, why do you think it should be null? not saying it shouldn't be, but is this in comparison to a standard or other DB? |
|
I remember this behaivour is the same with hive array(): #18516 (comment) |
|
Ah... I see. That was done intentionally to match hive behavior. I will close this PR. We also need to update SPARK-29462 |
|
I think the proposal is reasonable. If no values are provided, then handling it as null type makes sense. |
|
ok to test |
|
@amanomer Could you add some test cases? |
|
Sure. I will add test cases. |
|
Test build #112946 has finished for PR 26324 at commit
|
|
Test which ensures that empty array and map should be of string type is failing. |
|
Can you remove the test? Anyway, I'm neutral on this change; I'm a bit worried that hive users get confused with this behaviour change. cc: @dongjoon-hyun @wangyum |
|
@gengliangwang @maropu kindly review |
|
Failure is not related. Retest this please. |
|
Test build #112994 has finished for PR 26324 at commit
|
|
retest this please. |
|
Test build #113042 has finished for PR 26324 at commit
|
|
@gengliangwang Can you help in finding out why is this failing? |
|
I agree that compatibility is a minor issue here. But it also sounds like this change allows things to work that did not before. I'm trying to think of cases where something works with the current behavior and not with the new... is there one? |
|
yea, that's ok to change it. Because of the behaviour change as @srowen said, we might need to update the migration guide, too. |
Previously |
|
Yes, I will add. |
|
I'm asking the reverse question - is there anything that works before this change but not after? I understand it makes something else work. |
|
Test build #113169 has finished for PR 26324 at commit
|
oh I see.. I can't think of any. |
|
Test build #113188 has finished for PR 26324 at commit
|
|
I think its better for more other reviewers (e.g., @dongjoon-hyun, @wangyum, ...) who's familiar with hive to check this before merging (I'm not sure about why hive regards |
|
It's a tough call but ..
This makes sense to me. I personally follow Hive whenever I am not sure but if we have another good coherent reason like ANSI standard, let's try to stick to that. |
|
@maropu @srowen @dongjoon-hyun kindly review this PR? |
|
Merged to master. |
|
Thanks all. |
|
Sorry, actually I think I have to revert this back for the reasons below:
|
|
@HyukjinKwon @maropu review this PR #26317 also. |
|
I don't agree we should revert it. The Hive behavior is really confusing and we shouldn't inherit it anymore. I think For CTAS, I don't think it's related. If we can create table with null-type column using CTAS, why not allow |
|
@Ngone51 can you help revert the revert? |
|
Okay, I am good to bring it back. Let me open a PR to revert of a revert. |
During creation of array, if CreateArray does not gets any children to set data type for array, it will create an array of null type . When empty array is created, it should be declared as array<null>. No Tested manually Closes apache#26324 from amanomer/29462. Authored-by: Aman Omer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
|
@cloud-fan, we didn't add |
|
null type is the same as void type, or unknown type. We can change the string representation of NullType if needed. |
|
Yeah, so my point was that I wonder if it's right to promote the type we're not supporting correctly. |
### What changes were proposed in this pull request? This brings #26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI. - In case of PostgreSQL seems coercing NULL literal to TEXT type. - Presto seems coercing `array() + array(1)` -> array of int. - Hive seems `array() + array(1)` -> array of strings Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense. Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states: > If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case: > > a) If ES simply contains ARRAY, then ET ARRAY[0]. > > b) If ES simply contains MULTISET, then ET MULTISET. > > ES is effectively replaced by CAST ( ES AS DT ) From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back. ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`. ### How was this patch tested? Tested manually Closes #27521 from HyukjinKwon/SPARK-29462. Lead-authored-by: HyukjinKwon <[email protected]> Co-authored-by: Aman Omer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request? This brings #26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI. - In case of PostgreSQL seems coercing NULL literal to TEXT type. - Presto seems coercing `array() + array(1)` -> array of int. - Hive seems `array() + array(1)` -> array of strings Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense. Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states: > If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case: > > a) If ES simply contains ARRAY, then ET ARRAY[0]. > > b) If ES simply contains MULTISET, then ET MULTISET. > > ES is effectively replaced by CAST ( ES AS DT ) From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back. ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`. ### How was this patch tested? Tested manually Closes #27521 from HyukjinKwon/SPARK-29462. Lead-authored-by: HyukjinKwon <[email protected]> Co-authored-by: Aman Omer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 0045be7) Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request? This brings apache#26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI. - In case of PostgreSQL seems coercing NULL literal to TEXT type. - Presto seems coercing `array() + array(1)` -> array of int. - Hive seems `array() + array(1)` -> array of strings Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense. Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states: > If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case: > > a) If ES simply contains ARRAY, then ET ARRAY[0]. > > b) If ES simply contains MULTISET, then ET MULTISET. > > ES is effectively replaced by CAST ( ES AS DT ) From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back. ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`. ### How was this patch tested? Tested manually Closes apache#27521 from HyukjinKwon/SPARK-29462. Lead-authored-by: HyukjinKwon <[email protected]> Co-authored-by: Aman Omer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
During creation of array, if CreateArray does not gets any children to set data type for array, it will create an array of null type .
Why are the changes needed?
When empty array is created, it should be declared as array.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Tested manually