Skip to content

Commit 0045be7

Browse files
HyukjinKwonamanomer
andcommitted
[SPARK-29462][SQL] The data type of "array()" should be array<null>
### What changes were proposed in this pull request? This brings #26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI. - In case of PostgreSQL seems coercing NULL literal to TEXT type. - Presto seems coercing `array() + array(1)` -> array of int. - Hive seems `array() + array(1)` -> array of strings Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense. Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states: > If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case: > > a) If ES simply contains ARRAY, then ET ARRAY[0]. > > b) If ES simply contains MULTISET, then ET MULTISET. > > ES is effectively replaced by CAST ( ES AS DT ) From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back. ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`. ### How was this patch tested? Tested manually Closes #27521 from HyukjinKwon/SPARK-29462. Lead-authored-by: HyukjinKwon <[email protected]> Co-authored-by: Aman Omer <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
1 parent 2bc765a commit 0045be7

File tree

4 files changed

+34
-5
lines changed

4 files changed

+34
-5
lines changed

docs/sql-migration-guide.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -215,6 +215,8 @@ license: |
215215
For example `SELECT timestamp 'tomorrow';`.
216216

217217
- Since Spark 3.0, the `size` function returns `NULL` for the `NULL` input. In Spark version 2.4 and earlier, this function gives `-1` for the same input. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.sizeOfNull` to `true`.
218+
219+
- Since Spark 3.0, when the `array` function is called without any parameters, it returns an empty array of `NullType`. In Spark version 2.4 and earlier, it returns an empty array of string type. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.arrayDefaultToStringType.enabled` to `true`.
218220

219221
- Since Spark 3.0, the interval literal syntax does not allow multiple from-to units anymore. For example, `SELECT INTERVAL '1-1' YEAR TO MONTH '2-2' YEAR TO MONTH'` throws parser exception.
220222

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ import org.apache.spark.sql.catalyst.analysis.FunctionRegistry.FunctionBuilder
2323
import org.apache.spark.sql.catalyst.expressions.codegen._
2424
import org.apache.spark.sql.catalyst.expressions.codegen.Block._
2525
import org.apache.spark.sql.catalyst.util._
26+
import org.apache.spark.sql.internal.SQLConf
2627
import org.apache.spark.sql.types._
2728
import org.apache.spark.unsafe.types.UTF8String
2829

@@ -44,10 +45,18 @@ case class CreateArray(children: Seq[Expression]) extends Expression {
4445
TypeUtils.checkForSameTypeInputExpr(children.map(_.dataType), s"function $prettyName")
4546
}
4647

48+
private val defaultElementType: DataType = {
49+
if (SQLConf.get.getConf(SQLConf.LEGACY_ARRAY_DEFAULT_TO_STRING)) {
50+
StringType
51+
} else {
52+
NullType
53+
}
54+
}
55+
4756
override def dataType: ArrayType = {
4857
ArrayType(
4958
TypeCoercion.findCommonTypeDifferentOnlyInNullFlags(children.map(_.dataType))
50-
.getOrElse(StringType),
59+
.getOrElse(defaultElementType),
5160
containsNull = children.exists(_.nullable))
5261
}
5362

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2007,6 +2007,15 @@ object SQLConf {
20072007
.booleanConf
20082008
.createWithDefault(false)
20092009

2010+
val LEGACY_ARRAY_DEFAULT_TO_STRING =
2011+
buildConf("spark.sql.legacy.arrayDefaultToStringType.enabled")
2012+
.internal()
2013+
.doc("When set to true, it returns an empty array of string type when the `array` " +
2014+
"function is called without any parameters. Otherwise, it returns an empty " +
2015+
"array of `NullType`")
2016+
.booleanConf
2017+
.createWithDefault(false)
2018+
20102019
val TRUNCATE_TABLE_IGNORE_PERMISSION_ACL =
20112020
buildConf("spark.sql.truncateTable.ignorePermissionAcl.enabled")
20122021
.internal()

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3499,12 +3499,9 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession {
34993499
).foreach(assertValuesDoNotChangeAfterCoalesceOrUnion(_))
35003500
}
35013501

3502-
test("SPARK-21281 use string types by default if array and map have no argument") {
3502+
test("SPARK-21281 use string types by default if map have no argument") {
35033503
val ds = spark.range(1)
35043504
var expectedSchema = new StructType()
3505-
.add("x", ArrayType(StringType, containsNull = false), nullable = false)
3506-
assert(ds.select(array().as("x")).schema == expectedSchema)
3507-
expectedSchema = new StructType()
35083505
.add("x", MapType(StringType, StringType, valueContainsNull = false), nullable = false)
35093506
assert(ds.select(map().as("x")).schema == expectedSchema)
35103507
}
@@ -3577,6 +3574,18 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession {
35773574
}.getMessage
35783575
assert(nonFoldableError.contains("The 'escape' parameter must be a string literal"))
35793576
}
3577+
3578+
test("SPARK-29462: Empty array of NullType for array function with no arguments") {
3579+
Seq((true, StringType), (false, NullType)).foreach {
3580+
case (arrayDefaultToString, expectedType) =>
3581+
withSQLConf(SQLConf.LEGACY_ARRAY_DEFAULT_TO_STRING.key -> arrayDefaultToString.toString) {
3582+
val schema = spark.range(1).select(array()).schema
3583+
assert(schema.nonEmpty && schema.head.dataType.isInstanceOf[ArrayType])
3584+
val actualType = schema.head.dataType.asInstanceOf[ArrayType].elementType
3585+
assert(actualType === expectedType)
3586+
}
3587+
}
3588+
}
35803589
}
35813590

35823591
object DataFrameFunctionsSuite {

0 commit comments

Comments
 (0)