-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29462][SQL] The data type of "array()" should be array<null> #27521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
During creation of array, if CreateArray does not gets any children to set data type for array, it will create an array of null type . When empty array is created, it should be declared as array<null>. No Tested manually Closes apache#26324 from amanomer/29462. Authored-by: Aman Omer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
|
cc @cloud-fan, @maropu, @amanomer , @gengliangwang |
|
Test build #118153 has finished for PR 27521 at commit
|
|
retest this please |
|
Test build #118151 has finished for PR 27521 at commit
|
|
Test build #118170 has finished for PR 27521 at commit
|
c74331f to
90b4660
Compare
|
Test build #118195 has finished for PR 27521 at commit
|
|
Thank you @maropu. Merged to master and branch-3.0. |
### What changes were proposed in this pull request? This brings #26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI. - In case of PostgreSQL seems coercing NULL literal to TEXT type. - Presto seems coercing `array() + array(1)` -> array of int. - Hive seems `array() + array(1)` -> array of strings Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense. Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states: > If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case: > > a) If ES simply contains ARRAY, then ET ARRAY[0]. > > b) If ES simply contains MULTISET, then ET MULTISET. > > ES is effectively replaced by CAST ( ES AS DT ) From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back. ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`. ### How was this patch tested? Tested manually Closes #27521 from HyukjinKwon/SPARK-29462. Lead-authored-by: HyukjinKwon <[email protected]> Co-authored-by: Aman Omer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 0045be7) Signed-off-by: HyukjinKwon <[email protected]>
|
to be consistent, I think we should do the same thing for map. @Ngone51 can you help with it? |
|
@cloud-fan I will raise PR for map. |
### What changes were proposed in this pull request?
`spark.sql("select map()")` returns {}.
After these changes it will return map<null,null>
### Why are the changes needed?
After changes introduced due to #27521, it is important to maintain consistency while using map().
### Does this PR introduce any user-facing change?
Yes. Now map() will give map<null,null> instead of {}.
### How was this patch tested?
UT added. Migration guide updated as well
Closes #27542 from iRakson/SPARK-30790.
Authored-by: iRakson <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
`spark.sql("select map()")` returns {}.
After these changes it will return map<null,null>
### Why are the changes needed?
After changes introduced due to #27521, it is important to maintain consistency while using map().
### Does this PR introduce any user-facing change?
Yes. Now map() will give map<null,null> instead of {}.
### How was this patch tested?
UT added. Migration guide updated as well
Closes #27542 from iRakson/SPARK-30790.
Authored-by: iRakson <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 926e3a1)
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This brings apache#26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI. - In case of PostgreSQL seems coercing NULL literal to TEXT type. - Presto seems coercing `array() + array(1)` -> array of int. - Hive seems `array() + array(1)` -> array of strings Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense. Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states: > If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case: > > a) If ES simply contains ARRAY, then ET ARRAY[0]. > > b) If ES simply contains MULTISET, then ET MULTISET. > > ES is effectively replaced by CAST ( ES AS DT ) From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back. ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`. ### How was this patch tested? Tested manually Closes apache#27521 from HyukjinKwon/SPARK-29462. Lead-authored-by: HyukjinKwon <[email protected]> Co-authored-by: Aman Omer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?
`spark.sql("select map()")` returns {}.
After these changes it will return map<null,null>
### Why are the changes needed?
After changes introduced due to apache#27521, it is important to maintain consistency while using map().
### Does this PR introduce any user-facing change?
Yes. Now map() will give map<null,null> instead of {}.
### How was this patch tested?
UT added. Migration guide updated as well
Closes apache#27542 from iRakson/SPARK-30790.
Authored-by: iRakson <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This brings #26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI.
array() + array(1)-> array of int.array() + array(1)-> array of stringsGiven that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense.
Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "" states:
From reading other related context, doing it to
NullType. Given the investigation made, choosing tonullseems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back.Why are the changes needed?
When empty array is created, it should be declared as array.
Does this PR introduce any user-facing change?
Yes,
array()createsarray<null>. Nowarray(1) + array()can correctly createarray(1)instead ofarray("1").How was this patch tested?
Tested manually