[SPARK-21646][SQL] Add new type coercion to compatible with Hive #18853

wangyum · 2017-08-05T09:25:47Z

What changes were proposed in this pull request?

Add HiveInConversion and HivePromoteStrings rules to TypeCoercion.scala to compatible with Hive.
Add SQL configuration spark.sql.typeCoercion.mode to configure whether use hive compatibility mode or default default mode.

All difference between default mode and hive mode:

The normal text is in default mode (default mode)
The bold text is in hive mode (compatible with Hive)

	StringType	DateType	TimestampType	NumericType
StringType	None	StringType/DateType	StringType/TimestampType	NumericType/DoubleType
DateType	StringType/DateType	None	StringType/TimestampType	None
TimestampType	StringType/TimestampType	StringType/TimestampType	None	None/DoubleType
NumericType	NumericType/DoubleType	None	None/DoubleType	None

The design doc:
https://issues.apache.org/jira/secure/attachment/12891695/Type_coercion_rules_to_compatible_with_Hive.pdf

How was this patch tested?

unit tests

SparkQA · 2017-08-05T10:50:11Z

Test build #80286 has finished for PR 18853 at commit f59a213.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-08-07T02:28:51Z

How about casting the int values into string ones in that case you described in the description, and then comparing them by a lexicographical order?

wangyum · 2017-08-07T11:43:28Z

Casting the int values into string can works, but filter a int column with string type feels terrible.
My opinion is cast filter value to column type.

wangyum · 2017-08-07T11:44:13Z

retest this please

SparkQA · 2017-08-07T13:09:05Z

Test build #80339 has finished for PR 18853 at commit f59a213.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-08-08T14:33:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

        p.makeCopy(Array(left, Cast(right, TimestampType)))

+      case p @ BinaryComparison(left, right)
+        if left.isInstanceOf[AttributeReference] && right.isInstanceOf[Literal] =>


We need to cover all the same cases, but it seems this fix couldn't do, for example;

scala> spark.udf.register("testUdf", () => "85908509832958239058032") scala> sql("select * from values (1) where testUdf() > 1").explain == Physical Plan == *Filter (cast(UDF:testUdf() as int) > 1) +- LocalTableScan [col1#104]

wangyum · 2017-08-12T05:46:10Z

Thanks @maropu, There are some problems:

spark-sql> select "20" > "100";
true
spark-sql>

So tmap.tkey < 100's result is not we expected. Do you have any idea?

maropu · 2017-08-14T06:18:23Z

If we change this behaviour, I think we better modify code in findCommonTypeForBinaryComparison of TypeCoercion instead of your pr: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L130

  val findCommonTypeForBinaryComparison: (DataType, DataType) => Option[DataType] = {
    ...
    case (l: StringType, r: AtomicType) if r != StringType => Some(r)
    case (l: AtomicType, r: StringType) if (l != StringType) => Some(l)
    ...
  }

As another option, we could cast NumericType to wider DecimalType? Since this change could have some runtime overhead and behaviour change, I'm not sure this is acceptable. cc @gatorsmile @cloud-fan

gatorsmile · 2017-08-15T06:10:05Z

Currently, the type casting has a few issues when types are different. So far, we do not have any good option to resolve all the issues. Thus, we are hesitant to introduce any behavior change unless this is well defined. Could you do a research to see how the others behave? Any rule?

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

SparkQA · 2017-09-10T17:11:12Z

Test build #81606 has finished for PR 18853 at commit 8d37c72.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-10T18:11:26Z

Test build #81605 has finished for PR 18853 at commit cedb239.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

wangyum · 2017-09-11T01:03:10Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

  test("SPARK-17913: compare long and string type column may return confusing result") {
    val df = Seq(123L -> "123", 19157170390056973L -> "19157170390056971").toDF("i", "j")
-    checkAnswer(df.select($"i" === $"j"), Row(true) :: Row(false) :: Nil)
+    checkAnswer(df.select($"i" === $"j"), Row(true) :: Row(true) :: Nil)


To compatible with Hive, MySQL and Oracle:

SparkQA · 2017-09-11T02:50:50Z

Test build #81613 has finished for PR 18853 at commit 522c4cd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-11T15:13:12Z

Test build #81638 has finished for PR 18853 at commit 3bec6a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2017-09-12T01:27:55Z

CC @gatorsmile, @cloud-fan

wangyum · 2017-09-15T04:34:19Z

I provide 2 SQL scripts to validate the different result between Spark and Hive:

Engine	SPARK_21646_1.txt	SPARK_21646_2.txt
Hive-2.2.0	0.1 0.6 100 1111111111111111111111111111111111111111111111111	2017-09-14
Spark	100	-

gatorsmile · 2017-09-17T00:10:47Z

Thank you for your investigation! I think we need to introduce a type inference conf for it. To avoid impacting the existing Spark users, we should keep the existing behaviors, by default.

SparkQA · 2017-09-18T09:02:42Z

Test build #81871 has finished for PR 18853 at commit 844aec7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2017-09-18T10:30:58Z

retest this please.

SparkQA · 2017-09-18T14:55:21Z

Test build #81876 has finished for PR 18853 at commit 844aec7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-18T15:57:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.binary.comparison.compatible.with.hive")
+      .doc("Whether compatible with Hive when binary comparison.")
+      .booleanConf
+      .createWithDefault(true)


This has to be false.

gatorsmile · 2017-09-18T15:57:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .createWithDefault(10000)

+  val BINARY_COMPARISON_COMPATIBLE_WITH_HIVE =
+    buildConf("spark.sql.binary.comparison.compatible.with.hive")


-> spark.sql.autoTypeCastingCompatibility

SparkQA · 2017-12-06T03:16:28Z

Test build #84516 has finished for PR 18853 at commit 663eb35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-06T05:10:44Z

Test build #84519 has finished for PR 18853 at commit 7802483.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-12-06T21:19:43Z

docs/sql-programming-guide.md

+      The <code>default</code> type coercion mode was used in spark prior to 2.3.0, and so it
+      continues to be the default to avoid breaking behavior. However, it has logical
+      inconsistencies. The <code>hive</code> mode is preferred for most new applications, though
+      it may require additional manual casting.


Since Spark 2.3, the <code>hive</code> mode is introduced for Hive compatiblity. Spark SQL has its native type cocersion mode, which is enabled by default.

gatorsmile · 2017-12-06T21:22:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "and so it continues to be the default to avoid breaking behavior. " +
+        "However, it has logical inconsistencies. " +
+        "The 'hive' mode is preferred for most new applications, " +
+        "though it may require additional manual casting.")


The same here.

gatorsmile · 2017-12-06T21:24:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+    } else {
+      commonTypeCoercionRules :+
+        InConversion :+
+        PromoteStrings


Rename them to NativeInConversion and NativePromoteStrings

gatorsmile · 2017-12-06T21:27:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+  val findCommonTypeToCompatibleWithHive: (DataType, DataType) => Option[DataType] = {
+    // Follow hive's binary comparison action:
+    // https://github.com/apache/hive/blob/rel/storage-release-2.4.0/ql/src/java/
+    // org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L781


I saw the change history of this file. It sounds like Hive's type coercion rules are also evolving.

gatorsmile · 2017-12-06T21:43:20Z

TypeCoercionModeSuite is an end-to-end test suite. We really need a test case in SQLQueryTestSuite. That means, create a .sql file that contains all the type mapping and implicit casting queries that can be ran in both Hive and Spark. We can easily verify whether we correctly cover all the type coercion compatibility issues.

Could you please do it? This must take a lot of efforts, but it really helps us to find all the holes. Appreciate it!

gatorsmile · 2017-12-06T21:44:22Z

Let me open an umbrella JIRA for tracking it. We can do it for both native and Hive compatibility mode.

gatorsmile · 2017-12-06T21:58:34Z

The JIRA https://issues.apache.org/jira/browse/SPARK-22722 was just opened. I will create an example and open many sub-tasks. Feel free to take them if you have bandwidth.

gatorsmile · 2018-01-09T01:22:10Z

cc @wangyum

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

…rings

SparkQA · 2018-01-09T08:05:01Z

Test build #85840 has finished for PR 18853 at commit 97a071d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-09T08:05:02Z

Test build #85841 has finished for PR 18853 at commit 408e889.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-09T08:33:32Z

retest this please

SparkQA · 2018-01-09T10:20:22Z

Test build #85846 has finished for PR 18853 at commit 408e889.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-09T13:37:48Z

Test build #85849 has finished for PR 18853 at commit e763330.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala

SparkQA · 2018-03-31T07:05:01Z

Test build #88778 has finished for PR 18853 at commit 81067b9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-03-31T07:59:59Z

retest this please

SparkQA · 2018-03-31T11:27:45Z

Test build #88780 has finished for PR 18853 at commit 81067b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-05-16T07:21:45Z

Spark vs Teradata:

SparkQA · 2018-06-10T16:07:24Z

Test build #91633 has finished for PR 18853 at commit d0a2089.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-06T03:53:54Z

Test build #95733 has finished for PR 18853 at commit d0a2089.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

BinaryComparison shouldn't auto cast string to int/long

f59a213

maropu reviewed Aug 8, 2017

View reviewed changes

wangyum added 2 commits September 10, 2017 23:16

follow hive

bc83848

Remove useless code

cedb239

wangyum changed the title ~~[SPARK-21646][SQL] BinaryComparison shouldn't auto cast string to int/long~~ [SPARK-21646][SQL] CommonType for binary comparison Sep 10, 2017

Merge remote-tracking branch 'origin/master' into SPARK-21646

8d37c72

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

Fix test error

522c4cd

wangyum commented Sep 11, 2017

View reviewed changes

Fix SQLQueryTestSuite test error

3bec6a2

Add spark.sql.binary.comparison.compatible.with.hive conf.

844aec7

gatorsmile reviewed Sep 18, 2017

View reviewed changes

gatorsmile reviewed Dec 6, 2017

View reviewed changes

gatorsmile mentioned this pull request Dec 24, 2017

[SPARK-22036][SQL] Decimal multiplication with high precision/scale often returns NULL #20023

Closed

wangyum added 3 commits January 9, 2018 11:51

Merge remote-tracking branch 'upstream/master' into SPARK-21646

dffe5d2

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

Merge SPARK-22894 to Hive mode.

97a071d

InConversion -> NativeInConversion; PromoteStrings -> NativePromoteSt…

408e889

…rings

Lost WindowFrameCoercion

e763330

Merge remote-tracking branch 'upstream/master' into SPARK-21646

81067b9

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala

Since Spark 2.3 -> Since Spark 2.4

d0a2089

haiboself mentioned this pull request Jan 23, 2019

[SPARK-21646][SQL] Add new type coercion to compatible with Hive #23626

Closed

wangyum closed this May 18, 2019

[SPARK-21646][SQL] Add new type coercion to compatible with Hive #18853

[SPARK-21646][SQL] Add new type coercion to compatible with Hive #18853

Uh oh!

Conversation

wangyum commented Aug 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 5, 2017

Uh oh!

maropu commented Aug 7, 2017

Uh oh!

wangyum commented Aug 7, 2017

Uh oh!

wangyum commented Aug 7, 2017

Uh oh!

SparkQA commented Aug 7, 2017

Uh oh!

maropu Aug 8, 2017

Choose a reason for hiding this comment

Uh oh!

wangyum commented Aug 12, 2017

Uh oh!

maropu commented Aug 14, 2017

Uh oh!

gatorsmile commented Aug 15, 2017

Uh oh!

SparkQA commented Sep 10, 2017

Uh oh!

SparkQA commented Sep 10, 2017

Uh oh!

wangyum Sep 11, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 11, 2017

Uh oh!

SparkQA commented Sep 11, 2017

Uh oh!

wangyum commented Sep 12, 2017

Uh oh!

wangyum commented Sep 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Sep 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 18, 2017

Uh oh!

wangyum commented Sep 18, 2017

Uh oh!

SparkQA commented Sep 18, 2017

Uh oh!

gatorsmile Sep 18, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile Sep 18, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 6, 2017

Uh oh!

SparkQA commented Dec 6, 2017

Uh oh!

gatorsmile Dec 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Dec 6, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile Dec 6, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile Dec 6, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Dec 6, 2017

Uh oh!

gatorsmile commented Dec 6, 2017

Uh oh!

wangyum commented Aug 5, 2017 •

edited

Loading

wangyum commented Sep 15, 2017 •

edited

Loading

gatorsmile commented Sep 17, 2017 •

edited

Loading

gatorsmile Dec 6, 2017 •

edited

Loading

gatorsmile commented Dec 6, 2017 •

edited

Loading