Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Feb 10, 2020

What changes were proposed in this pull request?

This brings #26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI.

  • In case of PostgreSQL seems coercing NULL literal to TEXT type.
  • Presto seems coercing array() + array(1) -> array of int.
  • Hive seems array() + array(1) -> array of strings

Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense.

Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "" states:

If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case:

a) If ES simply contains ARRAY, then ET ARRAY[0].

b) If ES simply contains MULTISET, then ET MULTISET.

ES is effectively replaced by CAST ( ES AS DT )

From reading other related context, doing it to NullType. Given the investigation made, choosing to null seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back.

Why are the changes needed?

When empty array is created, it should be declared as array.

Does this PR introduce any user-facing change?

Yes, array() creates array<null>. Now array(1) + array() can correctly create array(1) instead of array("1").

How was this patch tested?

Tested manually

During creation of array, if CreateArray does not gets any children to set data type for array, it will create an array of null type .

When empty array is created, it should be declared as array<null>.

No

Tested manually

Closes apache#26324 from amanomer/29462.

Authored-by: Aman Omer <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@HyukjinKwon
Copy link
Member Author

@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118153 has finished for PR 27521 at commit 130f808.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118151 has finished for PR 27521 at commit 5b7fad9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29462] The data type of "array()" should be array<null> [SPARK-29462][SQL] The data type of "array()" should be array<null> Feb 10, 2020
@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118170 has finished for PR 27521 at commit 130f808.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 11, 2020

Test build #118195 has finished for PR 27521 at commit 90b4660.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Thank you @maropu.

Merged to master and branch-3.0.

HyukjinKwon added a commit that referenced this pull request Feb 11, 2020
### What changes were proposed in this pull request?

This brings #26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI.

- In case of PostgreSQL seems coercing NULL literal to TEXT type.
- Presto seems coercing `array() + array(1)` -> array of int.
- Hive seems  `array() + array(1)` -> array of strings

 Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense.

Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states:

> If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case:
>
> a) If ES simply contains ARRAY, then ET ARRAY[0].
>
> b) If ES simply contains MULTISET, then ET MULTISET.
>
> ES is effectively replaced by CAST ( ES AS DT )

From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back.

### Why are the changes needed?
When empty array is created, it should be declared as array<null>.

### Does this PR introduce any user-facing change?
Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`.

### How was this patch tested?
Tested manually

Closes #27521 from HyukjinKwon/SPARK-29462.

Lead-authored-by: HyukjinKwon <[email protected]>
Co-authored-by: Aman Omer <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 0045be7)
Signed-off-by: HyukjinKwon <[email protected]>
@cloud-fan
Copy link
Contributor

to be consistent, I think we should do the same thing for map. @Ngone51 can you help with it?

@iRakson
Copy link
Contributor

iRakson commented Feb 11, 2020

@cloud-fan I will raise PR for map.

cloud-fan pushed a commit that referenced this pull request Feb 13, 2020
### What changes were proposed in this pull request?

`spark.sql("select map()")` returns {}.

After these changes it will return map<null,null>

### Why are the changes needed?
After changes introduced due to #27521, it is important to maintain consistency while using map().

### Does this PR introduce any user-facing change?
Yes. Now map() will give map<null,null> instead of {}.

### How was this patch tested?
UT added. Migration guide updated as well

Closes #27542 from iRakson/SPARK-30790.

Authored-by: iRakson <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Feb 13, 2020
### What changes were proposed in this pull request?

`spark.sql("select map()")` returns {}.

After these changes it will return map<null,null>

### Why are the changes needed?
After changes introduced due to #27521, it is important to maintain consistency while using map().

### Does this PR introduce any user-facing change?
Yes. Now map() will give map<null,null> instead of {}.

### How was this patch tested?
UT added. Migration guide updated as well

Closes #27542 from iRakson/SPARK-30790.

Authored-by: iRakson <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 926e3a1)
Signed-off-by: Wenchen Fan <[email protected]>
@HyukjinKwon HyukjinKwon deleted the SPARK-29462 branch March 3, 2020 01:16
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

This brings apache#26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI.

- In case of PostgreSQL seems coercing NULL literal to TEXT type.
- Presto seems coercing `array() + array(1)` -> array of int.
- Hive seems  `array() + array(1)` -> array of strings

 Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense.

Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states:

> If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case:
>
> a) If ES simply contains ARRAY, then ET ARRAY[0].
>
> b) If ES simply contains MULTISET, then ET MULTISET.
>
> ES is effectively replaced by CAST ( ES AS DT )

From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back.

### Why are the changes needed?
When empty array is created, it should be declared as array<null>.

### Does this PR introduce any user-facing change?
Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`.

### How was this patch tested?
Tested manually

Closes apache#27521 from HyukjinKwon/SPARK-29462.

Lead-authored-by: HyukjinKwon <[email protected]>
Co-authored-by: Aman Omer <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

`spark.sql("select map()")` returns {}.

After these changes it will return map<null,null>

### Why are the changes needed?
After changes introduced due to apache#27521, it is important to maintain consistency while using map().

### Does this PR introduce any user-facing change?
Yes. Now map() will give map<null,null> instead of {}.

### How was this patch tested?
UT added. Migration guide updated as well

Closes apache#27542 from iRakson/SPARK-30790.

Authored-by: iRakson <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants