Skip to content

Commit b39b721

Browse files
sadikoviMaxGekk
authored andcommitted
[SPARK-40292][SQL] Fix column names in "arrays_zip" function when arrays are referenced from nested structs
### What changes were proposed in this pull request? This PR fixes an issue in `arrays_zip` function where a field index was used as a column name in the resulting schema which was a regression from Spark 3.1. With this change, the original behaviour is restored: a corresponding struct field name will be used instead of a field index. Example: ```sql with q as ( select named_struct( 'my_array', array(1, 2, 3), 'my_array2', array(4, 5, 6) ) as my_struct ) select arrays_zip(my_struct.my_array, my_struct.my_array2) from q ``` would return schema: ``` root |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false) | |-- element: struct (containsNull = false) | | |-- 0: integer (nullable = true) | | |-- 1: integer (nullable = true) ``` which is somewhat inaccurate. PR adds handling of `GetStructField` expression to return the struct field names like this: ``` root |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false) | |-- element: struct (containsNull = false) | | |-- my_array: integer (nullable = true) | | |-- my_array2: integer (nullable = true) ``` ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes, `arrays_zip` function returns struct field names now as in Spark 3.1 instead of field indices. Some users might have worked around this issue so this patch would affect them by bringing back the original behaviour. ### How was this patch tested? Existing unit tests. I also added a test case that reproduces the problem. Closes #37833 from sadikovi/SPARK-40292. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Max Gekk <[email protected]> (cherry picked from commit 443eea9) Signed-off-by: Max Gekk <[email protected]>
1 parent 8a882d5 commit b39b721

File tree

2 files changed

+20
-0
lines changed

2 files changed

+20
-0
lines changed

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,7 @@ case class ArraysZip(children: Seq[Expression], names: Seq[Expression])
191191
case (u: UnresolvedAttribute, _) => Literal(u.nameParts.last)
192192
case (e: NamedExpression, _) if e.resolved => Literal(e.name)
193193
case (e: NamedExpression, _) => NamePlaceholder
194+
case (e: GetStructField, _) => Literal(e.extractFieldName)
194195
case (_, idx) => Literal(idx.toString)
195196
})
196197
}

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -622,6 +622,25 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession {
622622
}
623623
}
624624

625+
test("SPARK-40292: arrays_zip should retain field names in nested structs") {
626+
val df = spark.sql("""
627+
select
628+
named_struct(
629+
'arr_1', array(named_struct('a', 1, 'b', 2)),
630+
'arr_2', array(named_struct('p', 1, 'q', 2)),
631+
'field', named_struct(
632+
'arr_3', array(named_struct('x', 1, 'y', 2))
633+
)
634+
) as obj
635+
""")
636+
637+
val res = df.selectExpr("arrays_zip(obj.arr_1, obj.arr_2, obj.field.arr_3) as arr")
638+
639+
val fieldNames = res.schema.head.dataType.asInstanceOf[ArrayType]
640+
.elementType.asInstanceOf[StructType].fieldNames
641+
assert(fieldNames.toSeq === Seq("arr_1", "arr_2", "arr_3"))
642+
}
643+
625644
def testSizeOfMap(sizeOfNull: Any): Unit = {
626645
val df = Seq(
627646
(Map[Int, Int](1 -> 1, 2 -> 2), "x"),

0 commit comments

Comments
 (0)