Skip to content

Conversation

@amanomer
Copy link
Contributor

What changes were proposed in this pull request?

During creation of array, if CreateArray does not gets any children to set data type for array, it will create an array of null type .

Why are the changes needed?

When empty array is created, it should be declared as array.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested manually

@amanomer
Copy link
Contributor Author

cc @srowen @MaxGekk @dongjoon-hyun

@srowen
Copy link
Member

srowen commented Oct 30, 2019

To clarify, why do you think it should be null? not saying it shouldn't be, but is this in comparison to a standard or other DB?

@maropu
Copy link
Member

maropu commented Oct 30, 2019

I remember this behaivour is the same with hive array(): #18516 (comment)

@amanomer
Copy link
Contributor Author

Ah... I see. That was done intentionally to match hive behavior. I will close this PR.
Thanks @srowen @maropu

We also need to update SPARK-29462

@amanomer
Copy link
Contributor Author

cc @gengliangwang

@amanomer amanomer closed this Oct 30, 2019
@gengliangwang
Copy link
Member

gengliangwang commented Oct 30, 2019

I think the proposal is reasonable. If no values are provided, then handling it as null type makes sense.
After #26107, we can't store string type into numeric type without explicit casting.
So, if we want to insert an empty array into a column of array<int>, simply using value array() will fail the analysis check.
Since Spark can insert null type to other types, the problem is resolved.
But the concern is that we should we do with map(), should we make it map<NullType, NullType>?

@amanomer amanomer reopened this Oct 30, 2019
@gengliangwang
Copy link
Member

ok to test

@gengliangwang
Copy link
Member

@amanomer Could you add some test cases?

@amanomer
Copy link
Contributor Author

Sure. I will add test cases.

@SparkQA
Copy link

SparkQA commented Oct 30, 2019

Test build #112946 has finished for PR 26324 at commit 39ed807.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@amanomer
Copy link
Contributor Author

Test which ensures that empty array and map should be of string type is failing.

@amanomer
Copy link
Contributor Author

@maropu

@maropu
Copy link
Member

maropu commented Oct 31, 2019

Can you remove the test? Anyway, I'm neutral on this change; I'm a bit worried that hive users get confused with this behaviour change. cc: @dongjoon-hyun @wangyum

@amanomer
Copy link
Contributor Author

@gengliangwang @maropu kindly review

@amanomer
Copy link
Contributor Author

Failure is not related. Retest this please.

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #112994 has finished for PR 26324 at commit 6fc258b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

retest this please.

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #113042 has finished for PR 26324 at commit 6fc258b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@amanomer
Copy link
Contributor Author

amanomer commented Nov 2, 2019

@gengliangwang Can you help in finding out why is this failing?
Thanks

@gengliangwang
Copy link
Member

@maropu @srowen Just double confirm, are you ok with the proposal?

@srowen
Copy link
Member

srowen commented Nov 2, 2019

I agree that compatibility is a minor issue here. But it also sounds like this change allows things to work that did not before. I'm trying to think of cases where something works with the current behavior and not with the new... is there one?

@maropu
Copy link
Member

maropu commented Nov 2, 2019

yea, that's ok to change it. Because of the behaviour change as @srowen said, we might need to update the migration guide, too.

@gengliangwang
Copy link
Member

I agree that compatibility is a minor issue here. But it also sounds like this change allows things to work that did not before. I'm trying to think of cases where something works with the current behavior and not with the new... is there one?

Previously array() can't be stored into array<array<?>> type since string value can't store into array type.
After the change in this PR, array() can be stored as array<array<?>> type, which also seems reasonable..

@gengliangwang
Copy link
Member

yea, that's ok to change it. Because of the behaviour change as @srowen said, we might need to update the migration guide, too.

+1, @amanomer could you add migration guide?

@amanomer
Copy link
Contributor Author

amanomer commented Nov 3, 2019

Yes, I will add.

@srowen
Copy link
Member

srowen commented Nov 3, 2019

I'm asking the reverse question - is there anything that works before this change but not after? I understand it makes something else work.

@SparkQA
Copy link

SparkQA commented Nov 3, 2019

Test build #113169 has finished for PR 26324 at commit acc2864.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

I'm asking the reverse question - is there anything that works before this change but not after? I understand it makes something else work.

oh I see.. I can't think of any.

@SparkQA
Copy link

SparkQA commented Nov 4, 2019

Test build #113188 has finished for PR 26324 at commit bb5e68a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@amanomer
Copy link
Contributor Author

amanomer commented Nov 4, 2019

cc @srowen @maropu

@maropu
Copy link
Member

maropu commented Nov 5, 2019

I think its better for more other reviewers (e.g., @dongjoon-hyun, @wangyum, ...) who's familiar with hive to check this before merging (I'm not sure about why hive regards array() as array<string>

@HyukjinKwon
Copy link
Member

It's a tough call but ..

Since Spark can insert null type to other types, the problem is resolved.

This makes sense to me. I personally follow Hive whenever I am not sure but if we have another good coherent reason like ANSI standard, let's try to stick to that.

@amanomer
Copy link
Contributor Author

amanomer commented Nov 6, 2019

@maropu @srowen @dongjoon-hyun kindly review this PR?

@HyukjinKwon
Copy link
Member

Merged to master.

@amanomer
Copy link
Contributor Author

amanomer commented Nov 6, 2019

Thanks all.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 13, 2019

Sorry, actually I think I have to revert this back for the reasons below:

  1. per [SPARK-29462] The data type of "array()" should be array<null> #26324 (comment) we should think about what to do with map type together.

  2. Seems like null type coercion depends and some DBMSes don't event have null types. Hive has it but Hive case was array<string>.

  3. In case of PostgreSQL seems it coerces NULL literal to TEXT type.

  4. Using null as default type might cause a corner case as below. Should we add void as NullType? needs some investigations.

    CREATE TABLE weird_null USING JSON AS SELECT NULL AS null_col;
    SHOW CREATE TABLE weird_null
    CREATE TABLE `weird_null` (`null_col` NULL)
    USING JSON
    
    CREATE TABLE `weird_null_1` (`null_col` NULL)
    USING JSON
    Error in SQL statement: ParseException: 
    DataType null is not supported.(line 1, pos 40)
    
    == SQL ==
    CREATE TABLE `weird_null_1` (`null_col` NULL)
    -------------------------
    
  5. This PR basically means a revert of [SPARK-21281][SQL] Use string types by default if array and map have no argument #18516 (comment) - let's don't make it complicated for now.

@amanomer
Copy link
Contributor Author

@HyukjinKwon @maropu review this PR #26317 also.

@cloud-fan
Copy link
Contributor

I don't agree we should revert it. The Hive behavior is really confusing and we shouldn't inherit it anymore. concat(array(), array(1)) returns array of string, which doesn't make sense. Presto returns array of int.

I think map() can be inferred as map<void, void>. The map key can be null type but can't be null, so this is fine as long as map<void, void> is empty.

For CTAS, I don't think it's related. If we can create table with null-type column using CTAS, why not allow array<void> as well?

@cloud-fan
Copy link
Contributor

@Ngone51 can you help revert the revert?

@HyukjinKwon
Copy link
Member

Okay, I am good to bring it back. Let me open a PR to revert of a revert.

HyukjinKwon pushed a commit to HyukjinKwon/spark that referenced this pull request Feb 10, 2020
During creation of array, if CreateArray does not gets any children to set data type for array, it will create an array of null type .

When empty array is created, it should be declared as array<null>.

No

Tested manually

Closes apache#26324 from amanomer/29462.

Authored-by: Aman Omer <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@HyukjinKwon
Copy link
Member

@cloud-fan, we didn't add void type yet BTW.

@cloud-fan
Copy link
Contributor

null type is the same as void type, or unknown type. We can change the string representation of NullType if needed.

@HyukjinKwon
Copy link
Member

Yeah, so my point was that I wonder if it's right to promote the type we're not supporting correctly.

HyukjinKwon added a commit that referenced this pull request Feb 11, 2020
### What changes were proposed in this pull request?

This brings #26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI.

- In case of PostgreSQL seems coercing NULL literal to TEXT type.
- Presto seems coercing `array() + array(1)` -> array of int.
- Hive seems  `array() + array(1)` -> array of strings

 Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense.

Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states:

> If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case:
>
> a) If ES simply contains ARRAY, then ET ARRAY[0].
>
> b) If ES simply contains MULTISET, then ET MULTISET.
>
> ES is effectively replaced by CAST ( ES AS DT )

From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back.

### Why are the changes needed?
When empty array is created, it should be declared as array<null>.

### Does this PR introduce any user-facing change?
Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`.

### How was this patch tested?
Tested manually

Closes #27521 from HyukjinKwon/SPARK-29462.

Lead-authored-by: HyukjinKwon <[email protected]>
Co-authored-by: Aman Omer <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Feb 11, 2020
### What changes were proposed in this pull request?

This brings #26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI.

- In case of PostgreSQL seems coercing NULL literal to TEXT type.
- Presto seems coercing `array() + array(1)` -> array of int.
- Hive seems  `array() + array(1)` -> array of strings

 Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense.

Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states:

> If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case:
>
> a) If ES simply contains ARRAY, then ET ARRAY[0].
>
> b) If ES simply contains MULTISET, then ET MULTISET.
>
> ES is effectively replaced by CAST ( ES AS DT )

From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back.

### Why are the changes needed?
When empty array is created, it should be declared as array<null>.

### Does this PR introduce any user-facing change?
Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`.

### How was this patch tested?
Tested manually

Closes #27521 from HyukjinKwon/SPARK-29462.

Lead-authored-by: HyukjinKwon <[email protected]>
Co-authored-by: Aman Omer <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 0045be7)
Signed-off-by: HyukjinKwon <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

This brings apache#26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI.

- In case of PostgreSQL seems coercing NULL literal to TEXT type.
- Presto seems coercing `array() + array(1)` -> array of int.
- Hive seems  `array() + array(1)` -> array of strings

 Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense.

Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states:

> If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case:
>
> a) If ES simply contains ARRAY, then ET ARRAY[0].
>
> b) If ES simply contains MULTISET, then ET MULTISET.
>
> ES is effectively replaced by CAST ( ES AS DT )

From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back.

### Why are the changes needed?
When empty array is created, it should be declared as array<null>.

### Does this PR introduce any user-facing change?
Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`.

### How was this patch tested?
Tested manually

Closes apache#27521 from HyukjinKwon/SPARK-29462.

Lead-authored-by: HyukjinKwon <[email protected]>
Co-authored-by: Aman Omer <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants