Skip to content

Conversation

@techaddict
Copy link
Contributor

What changes were proposed in this pull request?

Add IntegerDivide to avoid unnecessary cast

Before:

scala> spark.sql("select 6 div 3").explain(true)
...
== Analyzed Logical Plan ==
CAST((6 / 3) AS BIGINT): bigint
Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS
CAST((6 / 3) AS BIGINT)#5L]
+- OneRowRelation$
...

After:

scala> spark.sql("select 6 div 3").explain(true)
...
== Analyzed Logical Plan ==
(6 / 3): int
Project [(6 / 3) AS (6 / 3)#11]
+- OneRowRelation$
...

How was this patch tested?

Existing Tests and added new ones

Before:
```
scala> spark.sql("select 6 div 3").explain(true)
...
== Analyzed Logical Plan ==
CAST((6 / 3) AS BIGINT): bigint
Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS
CAST((6 / 3) AS BIGINT)#5L]
+- OneRowRelation$
...
```

After:
```
scala> spark.sql("select 6 div 3").explain(true)
...
== Analyzed Logical Plan ==
(6 / 3): int
Project [(6 / 3) AS (6 / 3)apache#11]
+- OneRowRelation$
...
```
@techaddict
Copy link
Contributor Author

cc: @cloud-fan

@SparkQA
Copy link

SparkQA commented Jul 3, 2016

Test build #61682 has finished for PR 14036 at commit e4e42c3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class MapKeys(child: Expression)
    • case class MapValues(child: Expression)

# Conflicts:
#
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Funct
ionRegistry.scala
#	sql/core/src/main/scala/org/apache/spark/sql/functions.scala
#
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.sca
la
#
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.sca
la
@SparkQA
Copy link

SparkQA commented Jul 3, 2016

Test build #61686 has finished for PR 14036 at commit 5db17fd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

expression[Subtract]("-"),
expression[Multiply]("*"),
expression[Divide]("/"),
expression[IntegerDivide]("div"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does hive support this syntax? i.e. div(4, 2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan yes, hive support div and / .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lianhuiwang doing div(4,2) gives

hive> div(4, 2);
NoViableAltException(14@[])
    at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1099)
    at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:204)
    at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:440)
    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:319)
    at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1249)
    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1295)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1178)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1166)
    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:236)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:187)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
    at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:782)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:721)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:648)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
FAILED: ParseException line 1:0 cannot recognize input near 'div' '(' '4'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'select 4 div 2' is the right code.

@SparkQA
Copy link

SparkQA commented Jul 3, 2016

Test build #61689 has finished for PR 14036 at commit faa2fd3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 3, 2016

Test build #61692 has finished for PR 14036 at commit e30747a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 3, 2016

Test build #61695 has finished for PR 14036 at commit 054d54e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 4, 2016

Test build #61706 has finished for PR 14036 at commit ff97457.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class DivisionArithmetic extends BinaryArithmetic with NullIntolerant
    • case class Divide(left: Expression, right: Expression)

}

// Used by doGenCode
def divide(eval1: ExprCode, eval2: ExprCode, javaType: String): String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark as protected?

*
* @group expr_ops
* @since 2.1.0
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we just use the normal / ?

@rxin
Copy link
Contributor

rxin commented Jul 5, 2016

@techaddict why do we need to introduce a new user facing function for this?

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61766 has finished for PR 14036 at commit e89ffc0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

@rxin , the / in hive/mysql means fraction division, e.g. 3 / 2 will get 1.5. And the div is a special operator for integral division, 3 div 2 will get 1.

In spark sql we implement div by casting the result of / to int, which is inefficient.

@rxin
Copy link
Contributor

rxin commented Jul 6, 2016

OK I think it's fine to add it in SQL for compatibility but I wouldn't add functions in the DataFrame API for that.

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62212 has finished for PR 14036 at commit ab6858c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62213 has finished for PR 14036 at commit 8d9a04d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


@ExpressionDescription(
usage = "a _FUNC_ b - Divides a by b.",
extended = "> SELECT 3 _FUNC_ 2;\n 1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntegralDivide?

@cloud-fan
Copy link
Contributor

LGTM except 2 naming comments, thanks for working on it!

@techaddict
Copy link
Contributor Author

@cloud-fan Done 👍

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62224 has finished for PR 14036 at commit 16eff20.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Divide(left: Expression, right: Expression) extends DivideBase
    • case class IntegralDivide(left: Expression, right: Expression) extends DivideBase

}
}

@ExpressionDescription(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about Divides a by b of fraction type?

@cloud-fan
Copy link
Contributor

last 2 comments. @liancheng @clockfly can you also take a look?

Remainder(left, right)
case SqlBaseParser.DIV =>
Cast(Divide(left, right), LongType)
IntegralDivide(left, right)
Copy link
Contributor

@lianhuiwang lianhuiwang Jul 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add SqlBaseParser.DIVIDE for '/'. BTW: SqlBaseParser.DIV for 'div' .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's ok because I find SparkSQL has SqlBaseParser.SLASH for '/' .

@yhuai
Copy link
Contributor

yhuai commented Jul 19, 2016

@techaddict Can you test the performance with and without your change?

@techaddict
Copy link
Contributor Author

techaddict commented Jul 21, 2016

@yhuai sure, doing performance test using sql query or expression ?

@yhuai
Copy link
Contributor

yhuai commented Jul 21, 2016

Having a query just to test this expression is good.

@HyukjinKwon
Copy link
Member

@techaddict How is the perf test going?

@techaddict
Copy link
Contributor Author

@HyukjinKwon didn't have bandwidth will try to finish this weekend

@asfgit asfgit closed this in b771fed Jun 8, 2017
asfgit pushed a commit that referenced this pull request Sep 17, 2018
## What changes were proposed in this pull request?

The PR takes over #14036 and it introduces a new expression `IntegralDivide` in order to avoid the several unneded cast added previously.

In order to prove the performance gain, the following benchmark has been run:

```
  test("Benchmark IntegralDivide") {
    val r = new scala.util.Random(91)
    val nData = 1000000
    val testDataInt = (1 to nData).map(_ => (r.nextInt(), r.nextInt()))
    val testDataLong = (1 to nData).map(_ => (r.nextLong(), r.nextLong()))
    val testDataShort = (1 to nData).map(_ => (r.nextInt().toShort, r.nextInt().toShort))

    // old code
    val oldExprsInt = testDataInt.map(x =>
      Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType))
    val oldExprsLong = testDataLong.map(x =>
      Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType))
    val oldExprsShort = testDataShort.map(x =>
      Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType))

    // new code
    val newExprsInt = testDataInt.map(x => IntegralDivide(x._1, x._2))
    val newExprsLong = testDataLong.map(x => IntegralDivide(x._1, x._2))
    val newExprsShort = testDataShort.map(x => IntegralDivide(x._1, x._2))

    Seq(("Long", "old", oldExprsLong),
      ("Long", "new", newExprsLong),
      ("Int", "old", oldExprsInt),
      ("Int", "new", newExprsShort),
      ("Short", "old", oldExprsShort),
      ("Short", "new", oldExprsShort)).foreach { case (dt, t, ds) =>
      val start = System.nanoTime()
      ds.foreach(e => e.eval(EmptyRow))
      val endNoCodegen = System.nanoTime()
      println(s"Running $nData op with $t code on $dt (no-codegen): ${(endNoCodegen - start) / 1000000} ms")
    }
  }
```

The results on my laptop are:

```
Running 1000000 op with old code on Long (no-codegen): 600 ms
Running 1000000 op with new code on Long (no-codegen): 112 ms
Running 1000000 op with old code on Int (no-codegen): 560 ms
Running 1000000 op with new code on Int (no-codegen): 135 ms
Running 1000000 op with old code on Short (no-codegen): 317 ms
Running 1000000 op with new code on Short (no-codegen): 153 ms
```

Showing a 2-5X improvement. The benchmark doesn't include code generation as it is pretty hard to test the performance there as for such simple operations the most of the time is spent in the code generation/compilation process.

## How was this patch tested?

added UTs

Closes #22395 from mgaido91/SPARK-16323.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants