[SPARK-28395][SQL] Division operator support integral division #25158

wangyum · 2019-07-15T07:34:07Z

What changes were proposed in this pull request?

PostgreSQL, Teradata, SQL Server, DB2 and Presto perform integral division with the / operator.
But Oracle, Vertica, Hive, MySQL and MariaDB perform fractional division with the / operator.

This pr add a flag(spark.sql.function.preferIntegralDivision) to control whether to use integral division with the / operator.

Examples:

PostgreSQL:

postgres=# select substr(version(), 0, 16), cast(10 as int) / cast(3 as int), cast(10.1 as float8) / cast(3 as int), cast(10 as int) / cast(3.1 as float8), cast(10.1 as float8)/cast(3.1 as float8);
     substr      | ?column? |     ?column?     |    ?column?     |     ?column?
-----------------+----------+------------------+-----------------+------------------
 PostgreSQL 11.3 |        3 | 3.36666666666667 | 3.2258064516129 | 3.25806451612903
(1 row)

SQL Server:

1> select cast(10 as int) / cast(3 as int), cast(10.1 as float) / cast(3 as int), cast(10 as int) / cast(3.1 as float), cast(10.1 as float)/cast(3.1 as float);
2> go

----------- ------------------------ ------------------------ ------------------------
          3       3.3666666666666667        3.225806451612903        3.258064516129032

(1 rows affected)

DB2:

[db2inst1@2f3c821d36b7 ~]$ db2 "select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double) from table (sysproc.env_get_inst_info())"

1           2                        3                        4
----------- ------------------------ ------------------------ ------------------------
          3   +3.36666666666667E+000   +3.22580645161290E+000   +3.25806451612903E+000

  1 record(s) selected.

Presto:

presto> select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double);
 _col0 |       _col1        |       _col2       |       _col3
-------+--------------------+-------------------+-------------------
     3 | 3.3666666666666667 | 3.225806451612903 | 3.258064516129032
(1 row)

Teradata:

Oracle:

SQL> select 10 / 3 from dual;

      10/3
----------
3.33333333

Vertica

dbadmin=> select version(), cast(10 as int) / cast(3 as int), cast(10.1 as float8) / cast(3 as int), cast(10 as int) / cast(3.1 as float8), cast(10.1 as float8)/cast(3.1 as float8);
              version               |       ?column?       |     ?column?     |    ?column?     |     ?column?
------------------------------------+----------------------+------------------+-----------------+------------------
 Vertica Analytic Database v9.1.1-0 | 3.333333333333333333 | 3.36666666666667 | 3.2258064516129 | 3.25806451612903
(1 row)

Hive:

hive> select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double);
OK
3.3333333333333335	3.3666666666666667	3.225806451612903	3.258064516129032
Time taken: 0.143 seconds, Fetched: 1 row(s)

MariaDB:

MariaDB [(none)]> select version(), cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double);
+--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+
| version()                            | cast(10 as int) / cast(3 as int) | cast(10.1 as double) / cast(3 as int) | cast(10 as int) / cast(3.1 as double) | cast(10.1 as double)/cast(3.1 as double) |
+--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+
| 10.4.6-MariaDB-1:10.4.6+maria~bionic |                           3.3333 |                    3.3666666666666667 |                     3.225806451612903 |                        3.258064516129032 |
+--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+
1 row in set (0.000 sec)

MySQL:

mysql>  select version(), 10 / 3, 10 / 3.1, 10.1 / 3, 10.1 / 3.1;
+-----------+--------+----------+----------+------------+
| version() | 10 / 3 | 10 / 3.1 | 10.1 / 3 | 10.1 / 3.1 |
+-----------+--------+----------+----------+------------+
| 8.0.16    | 3.3333 |   3.2258 |  3.36667 |    3.25806 |
+-----------+--------+----------+----------+------------+
1 row in set (0.00 sec)

How was this patch tested?

unit tests

SparkQA · 2019-07-15T09:03:27Z

Test build #107673 has finished for PR 25158 at commit c1990b5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Division(conf: SQLConf) extends TypeCoercionRule

wangyum · 2019-07-15T09:14:55Z

retest this please

SparkQA · 2019-07-15T12:33:50Z

Test build #107677 has finished for PR 25158 at commit c1990b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Division(conf: SQLConf) extends TypeCoercionRule

wangyum · 2019-07-16T02:44:18Z

Benchmark and benchmark result:

cat <<EOF > sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/SPARK_28395_Benchmark.scala
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.sql.execution.benchmark

import org.apache.spark.benchmark.Benchmark
import org.apache.spark.sql.internal.SQLConf

/**
 * To run this benchmark:
 *   build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.SPARK_28395_Benchmark"
 */
object SPARK_28395_Benchmark extends SqlBasedBenchmark {

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {

    val title = "Benchmark SPARK-28395"
    runBenchmark(title) {
      withTempPath { dir =>
        val N = 6000000
        val df = spark.range(N)

        df.selectExpr("id as id1", "cast(id % 999999 as bigint) as id2")
          .write.mode("overwrite").parquet(dir.getCanonicalPath)

        val benchmark = new Benchmark(title, N, minNumIters = 5, output = output)
        Seq(false, true).foreach { integralDivision =>
          val name = if (integralDivision) "Integral division" else "Fractional division"
          benchmark.addCase(name) { _ =>
            withSQLConf(SQLConf.PREFER_INTEGRAL_DIVISION.key -> integralDivision.toString) {
              spark.read.parquet(dir.getCanonicalPath).selectExpr("id1 / id2").collect()
            }
          }
        }
        benchmark.run()
      }
    }
  }
}

EOF

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_211-b12 on Linux 3.10.0-957.1.3.el7.x86_64
[info] Intel Core Processor (Broadwell)
[info] Benchmark SPARK-28395:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Fractional division                                3103           5975         877          1.9         517.2       1.0X
[info] Integral division                                  2084           2293         195          2.9         347.3       1.5X

gatorsmile · 2019-07-16T06:01:12Z

cc @cloud-fan @gengliangwang

cloud-fan · 2019-07-16T07:31:30Z

thanks, merging to master!

## What changes were proposed in this pull request? PostgreSQL, Teradata, SQL Server, DB2 and Presto perform integral division with the `/` operator. But Oracle, Vertica, Hive, MySQL and MariaDB perform fractional division with the `/` operator. This pr add a flag(`spark.sql.function.preferIntegralDivision`) to control whether to use integral division with the `/` operator. Examples: **PostgreSQL**: ```sql postgres=# select substr(version(), 0, 16), cast(10 as int) / cast(3 as int), cast(10.1 as float8) / cast(3 as int), cast(10 as int) / cast(3.1 as float8), cast(10.1 as float8)/cast(3.1 as float8); substr | ?column? | ?column? | ?column? | ?column? -----------------+----------+------------------+-----------------+------------------ PostgreSQL 11.3 | 3 | 3.36666666666667 | 3.2258064516129 | 3.25806451612903 (1 row) ``` **SQL Server**: ```sql 1> select cast(10 as int) / cast(3 as int), cast(10.1 as float) / cast(3 as int), cast(10 as int) / cast(3.1 as float), cast(10.1 as float)/cast(3.1 as float); 2> go ----------- ------------------------ ------------------------ ------------------------ 3 3.3666666666666667 3.225806451612903 3.258064516129032 (1 rows affected) ``` **DB2**: ```sql [db2inst12f3c821d36b7 ~]$ db2 "select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double) from table (sysproc.env_get_inst_info())" 1 2 3 4 ----------- ------------------------ ------------------------ ------------------------ 3 +3.36666666666667E+000 +3.22580645161290E+000 +3.25806451612903E+000 1 record(s) selected. ``` **Presto**: ```sql presto> select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double); _col0 | _col1 | _col2 | _col3 -------+--------------------+-------------------+------------------- 3 | 3.3666666666666667 | 3.225806451612903 | 3.258064516129032 (1 row) ``` **Teradata**: ![image](https://user-images.githubusercontent.com/5399861/61200701-e97d5380-a714-11e9-9a1d-57fd99d38c8d.png) **Oracle**: ```sql SQL> select 10 / 3 from dual; 10/3 ---------- 3.33333333 ``` **Vertica** ```sql dbadmin=> select version(), cast(10 as int) / cast(3 as int), cast(10.1 as float8) / cast(3 as int), cast(10 as int) / cast(3.1 as float8), cast(10.1 as float8)/cast(3.1 as float8); version | ?column? | ?column? | ?column? | ?column? ------------------------------------+----------------------+------------------+-----------------+------------------ Vertica Analytic Database v9.1.1-0 | 3.333333333333333333 | 3.36666666666667 | 3.2258064516129 | 3.25806451612903 (1 row) ``` **Hive**: ```sql hive> select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double); OK 3.3333333333333335 3.3666666666666667 3.225806451612903 3.258064516129032 Time taken: 0.143 seconds, Fetched: 1 row(s) ``` **MariaDB**: ```sql MariaDB [(none)]> select version(), cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double); +--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+ | version() | cast(10 as int) / cast(3 as int) | cast(10.1 as double) / cast(3 as int) | cast(10 as int) / cast(3.1 as double) | cast(10.1 as double)/cast(3.1 as double) | +--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+ | 10.4.6-MariaDB-1:10.4.6+maria~bionic | 3.3333 | 3.3666666666666667 | 3.225806451612903 | 3.258064516129032 | +--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+ 1 row in set (0.000 sec) ``` **MySQL**: ```sql mysql> select version(), 10 / 3, 10 / 3.1, 10.1 / 3, 10.1 / 3.1; +-----------+--------+----------+----------+------------+ | version() | 10 / 3 | 10 / 3.1 | 10.1 / 3 | 10.1 / 3.1 | +-----------+--------+----------+----------+------------+ | 8.0.16 | 3.3333 | 3.2258 | 3.36667 | 3.25806 | +-----------+--------+----------+----------+------------+ 1 row in set (0.00 sec) ``` ## How was this patch tested? unit tests Closes apache#25158 from wangyum/SPARK-28395. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2019-08-01T15:37:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .booleanConf
    .createWithDefault(false)

+  val PREFER_INTEGRAL_DIVISION = buildConf("spark.sql.function.preferIntegralDivision")


After a second thought, I think we should not add a new behavior that is not SQL standard. If we only need it in tests, shall we make it clear in the config name? and make it an internal config.

It seems that SQL standard does not explain this case(integral / integral ), different databases have different implementations.

OK so I'd like to treat it as an internal config that is only used in the ported pgsql test cases. @wangyum can you send a follow-up PR? thanks!

…ivision internal ## What changes were proposed in this pull request? This PR makes `spark.sql.function.preferIntegralDivision` to internal configuration because it is only used for PostgreSQL test cases. More details: apache#25158 (comment) ## How was this patch tested? N/A Closes apache#25376 from wangyum/SPARK-28395-2. Authored-by: Yuming Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? After #25158 and #25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL. In such a case, this proposal is to add a configuration `spark.sql.dialect` for choosing a database dialect. After this PR, Spark supports two database dialects, `Spark` and `PostgreSQL`. With `PostgreSQL` dialect, Spark will: 1. perform integral division with the / operator if both sides are integral types; 2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. ### Why are the changes needed? Unify the external database dialect with one configuration, instead of small flags. ### Does this PR introduce any user-facing change? A new configuration `spark.sql.dialect` for choosing a database dialect. ### How was this patch tested? Existing tests. Closes #25697 from gengliangwang/dialect. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Reprocess all PostgreSQL dialect related PRs, listing in order: - #25158: PostgreSQL integral division support [revert] - #25170: UT changes for the integral division support [revert] - #25458: Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. [revert] - #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert] - #26112: Date substraction support [keep the ANSI-compliant part] - #26444: Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" [revert] - #26463: Cast to boolean support for PostgreSQL dialect [revert] - #26584: Make the behavior of Postgre dialect independent of ansi mode config [keep the ANSI-compliant part] ### Why are the changes needed? As the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html, we need to remove PostgreSQL dialect form code base for several reasons: 1. The current approach makes the codebase complicated and hard to maintain. 2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now. ### Does this PR introduce any user-facing change? Yes, the config `spark.sql.dialect` will be removed. ### How was this patch tested? Existing UT. Closes #26763 from xuanyuanking/SPARK-30125. Lead-authored-by: Yuanjian Li <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Division operator support integral division

c1990b5

wangyum mentioned this pull request Jul 15, 2019

[SPARK-28388][SQL][TEST] Port select_implicit.sql #25152

Closed

dongjoon-hyun added the SQL label Jul 16, 2019

cloud-fan closed this in 6926849 Jul 16, 2019

wangyum deleted the SPARK-28395 branch July 16, 2019 07:56

cloud-fan reviewed Aug 1, 2019

View reviewed changes

wangyum mentioned this pull request Aug 7, 2019

[SPARK-28395][FOLLOW-UP][SQL] Make spark.sql.function.preferIntegralDivision internal #25376

Closed

gengliangwang mentioned this pull request Sep 5, 2019

[SPARK-28997][SQL] Add spark.sql.dialect #25697

Closed

xuanyuanking mentioned this pull request Dec 4, 2019

[SPARK-30125][SQL] Remove PostgreSQL dialect #26763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-28395][SQL] Division operator support integral division #25158

[SPARK-28395][SQL] Division operator support integral division #25158

wangyum commented Jul 15, 2019

Uh oh!

SparkQA commented Jul 15, 2019

Uh oh!

wangyum commented Jul 15, 2019

Uh oh!

SparkQA commented Jul 15, 2019

Uh oh!

wangyum commented Jul 16, 2019 •

edited

Loading

Uh oh!

gatorsmile commented Jul 16, 2019

Uh oh!

cloud-fan commented Jul 16, 2019

Uh oh!

cloud-fan Aug 1, 2019

Uh oh!

wangyum Aug 2, 2019

Uh oh!

cloud-fan Aug 2, 2019 •

edited

Loading

Uh oh!

wangyum Aug 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-28395][SQL] Division operator support integral division #25158

[SPARK-28395][SQL] Division operator support integral division #25158

Conversation

wangyum commented Jul 15, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 15, 2019

Uh oh!

wangyum commented Jul 15, 2019

Uh oh!

SparkQA commented Jul 15, 2019

Uh oh!

wangyum commented Jul 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jul 16, 2019

Uh oh!

cloud-fan commented Jul 16, 2019

Uh oh!

cloud-fan Aug 1, 2019

Choose a reason for hiding this comment

Uh oh!

wangyum Aug 2, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyum Aug 2, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wangyum commented Jul 16, 2019 •

edited

Loading

cloud-fan Aug 2, 2019 •

edited

Loading