Skip to content

Commit 7a78f5e

Browse files
yaooqinncloud-fan
authored andcommitted
[SPARK-34130][SQL] Impove preformace for char varchar padding and length check with StaticInvoke
### What changes were proposed in this pull request? This could reduce the `generate.java` size to prevent codegen fallback which causes performance regression. here is a case from tpcds that could be fixed by this improvement https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133964/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/ The original case generate 20K bytes, we are trying to reduce it to less than 8k ### Why are the changes needed? performance improvement as in the PR benchmark test, the performance w/ codegen is 2~3x better than w/o codegen. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? yes, it's a code reflect so the existing ut should be enough cross-check with #31012 where the tpcds shall all pass benchmark compared with master ```logtalk ================================================================================================ Char Varchar Read Side Perf ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 20, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 20 1571 1667 83 63.6 15.7 1.0X read char with length 20 1710 1764 58 58.5 17.1 0.9X read varchar with length 20 1774 1792 16 56.4 17.7 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 40, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 40 1824 1927 91 54.8 18.2 1.0X read char with length 40 1788 1928 137 55.9 17.9 1.0X read varchar with length 40 1676 1700 40 59.7 16.8 1.1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 60, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 60 1727 1762 30 57.9 17.3 1.0X read char with length 60 1628 1674 43 61.4 16.3 1.1X read varchar with length 60 1651 1665 13 60.6 16.5 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 80, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 80 1748 1778 28 57.2 17.5 1.0X read char with length 80 1673 1678 9 59.8 16.7 1.0X read varchar with length 80 1667 1684 27 60.0 16.7 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 100, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 100 1709 1743 48 58.5 17.1 1.0X read char with length 100 1610 1664 67 62.1 16.1 1.1X read varchar with length 100 1614 1673 53 61.9 16.1 1.1X ================================================================================================ Char Varchar Write Side Perf ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 20, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 20 2277 2327 67 4.4 227.7 1.0X write char with length 20 2421 2443 19 4.1 242.1 0.9X write varchar with length 20 2393 2419 27 4.2 239.3 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 40, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 40 2249 2290 38 4.4 224.9 1.0X write char with length 40 2386 2444 57 4.2 238.6 0.9X write varchar with length 40 2397 2405 12 4.2 239.7 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 60, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 60 2326 2367 41 4.3 232.6 1.0X write char with length 60 2478 2501 37 4.0 247.8 0.9X write varchar with length 60 2475 2503 24 4.0 247.5 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 80, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 80 9367 9773 354 1.1 936.7 1.0X write char with length 80 10454 10621 238 1.0 1045.4 0.9X write varchar with length 80 18943 19503 571 0.5 1894.3 0.5X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 100, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 100 11055 11104 59 0.9 1105.5 1.0X write char with length 100 12204 12275 63 0.8 1220.4 0.9X write varchar with length 100 21737 22275 574 0.5 2173.7 0.5X ``` Closes #31199 from yaooqinn/SPARK-34130. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 6fa2fb9) Signed-off-by: Wenchen Fan <[email protected]>
1 parent 9562629 commit 7a78f5e

File tree

4 files changed

+318
-39
lines changed

4 files changed

+318
-39
lines changed
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.spark.sql.catalyst.util;
19+
20+
import org.apache.spark.unsafe.types.UTF8String;
21+
22+
public class CharVarcharCodegenUtils {
23+
private static final UTF8String SPACE = UTF8String.fromString(" ");
24+
25+
/**
26+
* Trailing spaces do not count in the length check. We don't need to retain the trailing
27+
* spaces, as we will pad char type columns/fields at read time.
28+
*/
29+
public static UTF8String charTypeWriteSideCheck(UTF8String inputStr, int limit) {
30+
if (inputStr == null) {
31+
return null;
32+
} else {
33+
UTF8String trimmed = inputStr.trimRight();
34+
if (trimmed.numChars() > limit) {
35+
throw new RuntimeException("Exceeds char type length limitation: " + limit);
36+
}
37+
return trimmed;
38+
}
39+
}
40+
41+
public static UTF8String charTypeReadSideCheck(UTF8String inputStr, int limit) {
42+
if (inputStr == null) return null;
43+
if (inputStr.numChars() > limit) {
44+
throw new RuntimeException("Exceeds char type length limitation: " + limit);
45+
}
46+
return inputStr.rpad(limit, SPACE);
47+
}
48+
49+
public static UTF8String varcharTypeWriteSideCheck(UTF8String inputStr, int limit) {
50+
if (inputStr != null && inputStr.numChars() <= limit) {
51+
return inputStr;
52+
} else if (inputStr != null) {
53+
// Trailing spaces do not count in the length check. We need to retain the trailing spaces
54+
// (truncate to length N), as there is no read-time padding for varchar type.
55+
// TODO: create a special TrimRight function that can trim to a certain length.
56+
UTF8String trimmed = inputStr.trimRight();
57+
if (trimmed.numChars() > limit) {
58+
throw new RuntimeException("Exceeds varchar type length limitation: " + limit);
59+
}
60+
return inputStr.substring(0, limit);
61+
} else {
62+
return null;
63+
}
64+
}
65+
66+
public static UTF8String varcharTypeReadSideCheck(UTF8String inputStr, int limit) {
67+
if (inputStr != null && inputStr.numChars() > limit) {
68+
throw new RuntimeException("Exceeds varchar type length limitation: " + limit);
69+
}
70+
return inputStr;
71+
}
72+
}

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CharVarcharUtils.scala

Lines changed: 35 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,10 @@ import scala.collection.mutable
2222
import org.apache.spark.internal.Logging
2323
import org.apache.spark.sql.AnalysisException
2424
import org.apache.spark.sql.catalyst.expressions._
25+
import org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke
2526
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
2627
import org.apache.spark.sql.internal.SQLConf
2728
import org.apache.spark.sql.types._
28-
import org.apache.spark.unsafe.types.UTF8String
2929

3030
object CharVarcharUtils extends Logging {
3131

@@ -161,9 +161,20 @@ object CharVarcharUtils extends Logging {
161161

162162
private def paddingWithLengthCheck(expr: Expression, dt: DataType): Expression = dt match {
163163
case CharType(length) =>
164-
StringRPad(stringLengthCheck(expr, dt, needTrim = false), Literal(length))
165-
166-
case VarcharType(_) => stringLengthCheck(expr, dt, needTrim = false)
164+
StaticInvoke(
165+
classOf[CharVarcharCodegenUtils],
166+
StringType,
167+
"charTypeReadSideCheck",
168+
expr :: Literal(length) :: Nil,
169+
propagateNull = false)
170+
171+
case VarcharType(length) =>
172+
StaticInvoke(
173+
classOf[CharVarcharCodegenUtils],
174+
StringType,
175+
"varcharTypeReadSideCheck",
176+
expr :: Literal(length) :: Nil,
177+
propagateNull = false)
167178

168179
case StructType(fields) =>
169180
val struct = CreateNamedStruct(fields.zipWithIndex.flatMap { case (f, i) =>
@@ -200,69 +211,54 @@ object CharVarcharUtils extends Logging {
200211
*/
201212
def stringLengthCheck(expr: Expression, targetAttr: Attribute): Expression = {
202213
getRawType(targetAttr.metadata).map { rawType =>
203-
stringLengthCheck(expr, rawType, needTrim = true)
214+
stringLengthCheck(expr, rawType)
204215
}.getOrElse(expr)
205216
}
206217

207-
private def raiseError(typeName: String, length: Int): Expression = {
208-
val errMsg = UTF8String.fromString(s"Exceeds $typeName type length limitation: $length")
209-
RaiseError(Literal(errMsg, StringType), StringType)
210-
}
211-
212-
private def stringLengthCheck(expr: Expression, dt: DataType, needTrim: Boolean): Expression = {
218+
private def stringLengthCheck(expr: Expression, dt: DataType): Expression = {
213219
dt match {
214220
case CharType(length) =>
215-
val trimmed = if (needTrim) StringTrimRight(expr) else expr
216-
// Trailing spaces do not count in the length check. We don't need to retain the trailing
217-
// spaces, as we will pad char type columns/fields at read time.
218-
If(
219-
GreaterThan(Length(trimmed), Literal(length)),
220-
raiseError("char", length),
221-
trimmed)
221+
StaticInvoke(
222+
classOf[CharVarcharCodegenUtils],
223+
StringType,
224+
"charTypeWriteSideCheck",
225+
expr :: Literal(length) :: Nil,
226+
propagateNull = false)
222227

223228
case VarcharType(length) =>
224-
if (needTrim) {
225-
val trimmed = StringTrimRight(expr)
226-
// Trailing spaces do not count in the length check. We need to retain the trailing spaces
227-
// (truncate to length N), as there is no read-time padding for varchar type.
228-
// TODO: create a special TrimRight function that can trim to a certain length.
229-
If(
230-
LessThanOrEqual(Length(expr), Literal(length)),
231-
expr,
232-
If(
233-
GreaterThan(Length(trimmed), Literal(length)),
234-
raiseError("varchar", length),
235-
StringRPad(trimmed, Literal(length))))
236-
} else {
237-
If(GreaterThan(Length(expr), Literal(length)), raiseError("varchar", length), expr)
238-
}
229+
StaticInvoke(
230+
classOf[CharVarcharCodegenUtils],
231+
StringType,
232+
"varcharTypeWriteSideCheck",
233+
expr :: Literal(length) :: Nil,
234+
propagateNull = false)
239235

240236
case StructType(fields) =>
241237
val struct = CreateNamedStruct(fields.zipWithIndex.flatMap { case (f, i) =>
242238
Seq(Literal(f.name),
243-
stringLengthCheck(GetStructField(expr, i, Some(f.name)), f.dataType, needTrim))
239+
stringLengthCheck(GetStructField(expr, i, Some(f.name)), f.dataType))
244240
})
245241
if (expr.nullable) {
246242
If(IsNull(expr), Literal(null, struct.dataType), struct)
247243
} else {
248244
struct
249245
}
250246

251-
case ArrayType(et, containsNull) => stringLengthCheckInArray(expr, et, containsNull, needTrim)
247+
case ArrayType(et, containsNull) => stringLengthCheckInArray(expr, et, containsNull)
252248

253249
case MapType(kt, vt, valueContainsNull) =>
254-
val newKeys = stringLengthCheckInArray(MapKeys(expr), kt, containsNull = false, needTrim)
255-
val newValues = stringLengthCheckInArray(MapValues(expr), vt, valueContainsNull, needTrim)
250+
val newKeys = stringLengthCheckInArray(MapKeys(expr), kt, containsNull = false)
251+
val newValues = stringLengthCheckInArray(MapValues(expr), vt, valueContainsNull)
256252
MapFromArrays(newKeys, newValues)
257253

258254
case _ => expr
259255
}
260256
}
261257

262258
private def stringLengthCheckInArray(
263-
arr: Expression, et: DataType, containsNull: Boolean, needTrim: Boolean): Expression = {
259+
arr: Expression, et: DataType, containsNull: Boolean): Expression = {
264260
val param = NamedLambdaVariable("x", replaceCharVarcharWithString(et), containsNull)
265-
val func = LambdaFunction(stringLengthCheck(param, et, needTrim), Seq(param))
261+
val func = LambdaFunction(stringLengthCheck(param, et), Seq(param))
266262
ArrayTransform(arr, func)
267263
}
268264

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
================================================================================================
2+
Char Varchar Read Side Perf
3+
================================================================================================
4+
5+
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
6+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
7+
Read with length 20, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
8+
------------------------------------------------------------------------------------------------------------------------
9+
read string with length 20 1504 1508 4 66.5 15.0 1.0X
10+
read char with length 20 1680 1684 3 59.5 16.8 0.9X
11+
read varchar with length 20 1659 1682 26 60.3 16.6 0.9X
12+
13+
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
14+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
15+
Read with length 40, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
16+
------------------------------------------------------------------------------------------------------------------------
17+
read string with length 40 1662 1678 15 60.2 16.6 1.0X
18+
read char with length 40 1721 1731 9 58.1 17.2 1.0X
19+
read varchar with length 40 1694 1706 12 59.0 16.9 1.0X
20+
21+
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
22+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
23+
Read with length 60, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
24+
------------------------------------------------------------------------------------------------------------------------
25+
read string with length 60 1623 1643 23 61.6 16.2 1.0X
26+
read char with length 60 1644 1685 66 60.8 16.4 1.0X
27+
read varchar with length 60 1660 1680 18 60.2 16.6 1.0X
28+
29+
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
30+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
31+
Read with length 80, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
32+
------------------------------------------------------------------------------------------------------------------------
33+
read string with length 80 1629 1678 57 61.4 16.3 1.0X
34+
read char with length 80 1630 1667 65 61.3 16.3 1.0X
35+
read varchar with length 80 1664 1684 34 60.1 16.6 1.0X
36+
37+
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
38+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
39+
Read with length 100, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
40+
------------------------------------------------------------------------------------------------------------------------
41+
read string with length 100 1594 1612 17 62.7 15.9 1.0X
42+
read char with length 100 1631 1642 11 61.3 16.3 1.0X
43+
read varchar with length 100 1635 1644 13 61.1 16.4 1.0X
44+
45+
46+
================================================================================================
47+
Char Varchar Write Side Perf
48+
================================================================================================
49+
50+
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
51+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
52+
Write with length 20, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
53+
------------------------------------------------------------------------------------------------------------------------
54+
write string with length 20 2760 2784 21 3.6 276.0 1.0X
55+
write char with length 20 2898 2917 22 3.5 289.8 1.0X
56+
write varchar with length 20 2876 2892 14 3.5 287.6 1.0X
57+
58+
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
59+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
60+
Write with length 40, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
61+
------------------------------------------------------------------------------------------------------------------------
62+
write string with length 40 2726 2734 9 3.7 272.6 1.0X
63+
write char with length 40 2885 2898 16 3.5 288.5 0.9X
64+
write varchar with length 40 2844 2860 15 3.5 284.4 1.0X
65+
66+
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
67+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
68+
Write with length 60, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
69+
------------------------------------------------------------------------------------------------------------------------
70+
write string with length 60 2724 2739 21 3.7 272.4 1.0X
71+
write char with length 60 2868 2912 44 3.5 286.8 0.9X
72+
write varchar with length 60 2870 2896 23 3.5 287.0 0.9X
73+
74+
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
75+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
76+
Write with length 80, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
77+
------------------------------------------------------------------------------------------------------------------------
78+
write string with length 80 9094 9154 71 1.1 909.4 1.0X
79+
write char with length 80 9471 9489 19 1.1 947.1 1.0X
80+
write varchar with length 80 15099 15130 28 0.7 1509.9 0.6X
81+
82+
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
83+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
84+
Write with length 100, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
85+
------------------------------------------------------------------------------------------------------------------------
86+
write string with length 100 10152 10253 94 1.0 1015.2 1.0X
87+
write char with length 100 10831 10834 3 0.9 1083.1 0.9X
88+
write varchar with length 100 19486 19560 73 0.5 1948.6 0.5X
89+
90+

0 commit comments

Comments
 (0)