-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16323] [SQL] Add IntegerDivide to avoid unnecessary cast #14036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Before:
```
scala> spark.sql("select 6 div 3").explain(true)
...
== Analyzed Logical Plan ==
CAST((6 / 3) AS BIGINT): bigint
Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS
CAST((6 / 3) AS BIGINT)#5L]
+- OneRowRelation$
...
```
After:
```
scala> spark.sql("select 6 div 3").explain(true)
...
== Analyzed Logical Plan ==
(6 / 3): int
Project [(6 / 3) AS (6 / 3)apache#11]
+- OneRowRelation$
...
```
|
cc: @cloud-fan |
|
Test build #61682 has finished for PR 14036 at commit
|
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Funct ionRegistry.scala # sql/core/src/main/scala/org/apache/spark/sql/functions.scala # sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.sca la # sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.sca la
|
Test build #61686 has finished for PR 14036 at commit
|
| expression[Subtract]("-"), | ||
| expression[Multiply]("*"), | ||
| expression[Divide]("/"), | ||
| expression[IntegerDivide]("div"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does hive support this syntax? i.e. div(4, 2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan yes, hive support div and / .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lianhuiwang doing div(4,2) gives
hive> div(4, 2);
NoViableAltException(14@[])
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1099)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:204)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:440)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:319)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1249)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1295)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1178)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1166)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:236)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:187)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:782)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:721)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:648)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
FAILED: ParseException line 1:0 cannot recognize input near 'div' '(' '4'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'select 4 div 2' is the right code.
|
Test build #61689 has finished for PR 14036 at commit
|
|
Test build #61692 has finished for PR 14036 at commit
|
|
Test build #61695 has finished for PR 14036 at commit
|
|
Test build #61706 has finished for PR 14036 at commit
|
| } | ||
|
|
||
| // Used by doGenCode | ||
| def divide(eval1: ExprCode, eval2: ExprCode, javaType: String): String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mark as protected?
| * | ||
| * @group expr_ops | ||
| * @since 2.1.0 | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we just use the normal / ?
|
@techaddict why do we need to introduce a new user facing function for this? |
|
Test build #61766 has finished for PR 14036 at commit
|
|
@rxin , the In spark sql we implement |
|
OK I think it's fine to add it in SQL for compatibility but I wouldn't add functions in the DataFrame API for that. |
|
Test build #62212 has finished for PR 14036 at commit
|
|
Test build #62213 has finished for PR 14036 at commit
|
|
|
||
| @ExpressionDescription( | ||
| usage = "a _FUNC_ b - Divides a by b.", | ||
| extended = "> SELECT 3 _FUNC_ 2;\n 1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntegralDivide?
|
LGTM except 2 naming comments, thanks for working on it! |
|
@cloud-fan Done 👍 |
|
Test build #62224 has finished for PR 14036 at commit
|
| } | ||
| } | ||
|
|
||
| @ExpressionDescription( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about Divides a by b of fraction type?
|
last 2 comments. @liancheng @clockfly can you also take a look? |
| Remainder(left, right) | ||
| case SqlBaseParser.DIV => | ||
| Cast(Divide(left, right), LongType) | ||
| IntegralDivide(left, right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add SqlBaseParser.DIVIDE for '/'. BTW: SqlBaseParser.DIV for 'div' .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's ok because I find SparkSQL has SqlBaseParser.SLASH for '/' .
|
@techaddict Can you test the performance with and without your change? |
|
@yhuai sure, doing performance test using sql query or expression ? |
|
Having a query just to test this expression is good. |
|
@techaddict How is the perf test going? |
|
@HyukjinKwon didn't have bandwidth will try to finish this weekend |
## What changes were proposed in this pull request? The PR takes over #14036 and it introduces a new expression `IntegralDivide` in order to avoid the several unneded cast added previously. In order to prove the performance gain, the following benchmark has been run: ``` test("Benchmark IntegralDivide") { val r = new scala.util.Random(91) val nData = 1000000 val testDataInt = (1 to nData).map(_ => (r.nextInt(), r.nextInt())) val testDataLong = (1 to nData).map(_ => (r.nextLong(), r.nextLong())) val testDataShort = (1 to nData).map(_ => (r.nextInt().toShort, r.nextInt().toShort)) // old code val oldExprsInt = testDataInt.map(x => Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType)) val oldExprsLong = testDataLong.map(x => Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType)) val oldExprsShort = testDataShort.map(x => Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType)) // new code val newExprsInt = testDataInt.map(x => IntegralDivide(x._1, x._2)) val newExprsLong = testDataLong.map(x => IntegralDivide(x._1, x._2)) val newExprsShort = testDataShort.map(x => IntegralDivide(x._1, x._2)) Seq(("Long", "old", oldExprsLong), ("Long", "new", newExprsLong), ("Int", "old", oldExprsInt), ("Int", "new", newExprsShort), ("Short", "old", oldExprsShort), ("Short", "new", oldExprsShort)).foreach { case (dt, t, ds) => val start = System.nanoTime() ds.foreach(e => e.eval(EmptyRow)) val endNoCodegen = System.nanoTime() println(s"Running $nData op with $t code on $dt (no-codegen): ${(endNoCodegen - start) / 1000000} ms") } } ``` The results on my laptop are: ``` Running 1000000 op with old code on Long (no-codegen): 600 ms Running 1000000 op with new code on Long (no-codegen): 112 ms Running 1000000 op with old code on Int (no-codegen): 560 ms Running 1000000 op with new code on Int (no-codegen): 135 ms Running 1000000 op with old code on Short (no-codegen): 317 ms Running 1000000 op with new code on Short (no-codegen): 153 ms ``` Showing a 2-5X improvement. The benchmark doesn't include code generation as it is pretty hard to test the performance there as for such simple operations the most of the time is spent in the code generation/compilation process. ## How was this patch tested? added UTs Closes #22395 from mgaido91/SPARK-16323. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Add IntegerDivide to avoid unnecessary cast
Before:
After:
How was this patch tested?
Existing Tests and added new ones