You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-30279][SQL] Support 32 or more grouping attributes for GROUPING_ID
### What changes were proposed in this pull request?
This pr intends to support 32 or more grouping attributes for GROUPING_ID. In the current master, an integer overflow can occur to compute grouping IDs;
https://github.com/apache/spark/blob/e75d9afb2f282ce79c9fd8bce031287739326a4f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L613
For example, the query below generates wrong grouping IDs in the master;
```
scala> val numCols = 32 // or, 31
scala> val cols = (0 until numCols).map { i => s"c$i" }
scala> sql(s"create table test_$numCols (${cols.map(c => s"$c int").mkString(",")}, v int) using parquet")
scala> val insertVals = (0 until numCols).map { _ => 1 }.mkString(",")
scala> sql(s"insert into test_$numCols values ($insertVals,3)")
scala> sql(s"select grouping_id(), sum(v) from test_$numCols group by grouping sets ((${cols.mkString(",")}), (${cols.init.mkString(",")}))").show(10, false)
scala> sql(s"drop table test_$numCols")
// numCols = 32
+-------------+------+
|grouping_id()|sum(v)|
+-------------+------+
|0 |3 |
|0 |3 | // Wrong Grouping ID
+-------------+------+
// numCols = 31
+-------------+------+
|grouping_id()|sum(v)|
+-------------+------+
|0 |3 |
|1 |3 |
+-------------+------+
```
To fix this issue, this pr change code to use long values for `GROUPING_ID` instead of int values.
### Why are the changes needed?
To support more cases in `GROUPING_ID`.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added unit tests.
Closesapache#26918 from maropu/FixGroupingIdIssue.
Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
Copy file name to clipboardExpand all lines: docs/sql-migration-guide.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,6 +22,9 @@ license: |
22
22
* Table of contents
23
23
{:toc}
24
24
25
+
## Upgrading from Spark SQL 3.0 to 3.1
26
+
- Since Spark 3.1, grouping_id() returns long values. In Spark version 3.0 and earlier, this function returns int values. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.integerGroupingId` to `true`.
27
+
25
28
## Upgrading from Spark SQL 2.4 to 3.0
26
29
- Since Spark 3.0, when inserting a value into a table column with a different data type, the type coercion is performed as per ANSI SQL standard. Certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean` are disallowed. A runtime exception will be thrown if the value is out-of-range for the data type of the column. In Spark version 2.4 and earlier, type conversions during table insertion are allowed as long as they are valid `Cast`. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of byte type, the result is 1. The behavior is controlled by the option `spark.sql.storeAssignmentPolicy`, with a default value as "ANSI". Setting the option as "Legacy" restores the previous behavior.
0 commit comments