Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Aug 5, 2017

What changes were proposed in this pull request?

Add HiveInConversion and HivePromoteStrings rules to TypeCoercion.scala to compatible with Hive.
Add SQL configuration spark.sql.typeCoercion.mode to configure whether use hive compatibility mode or default default mode.

All difference between default mode and hive mode:

  • The normal text is in default mode (default mode)
  • The bold text is in hive mode (compatible with Hive)
  StringType DateType TimestampType NumericType
StringType None StringType/DateType StringType/TimestampType NumericType/DoubleType
DateType StringType/DateType None StringType/TimestampType None
TimestampType StringType/TimestampType StringType/TimestampType None None/DoubleType
NumericType NumericType/DoubleType None None/DoubleType None

The design doc:
https://issues.apache.org/jira/secure/attachment/12891695/Type_coercion_rules_to_compatible_with_Hive.pdf

How was this patch tested?

unit tests

@SparkQA
Copy link

SparkQA commented Aug 5, 2017

Test build #80286 has finished for PR 18853 at commit f59a213.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Aug 7, 2017

How about casting the int values into string ones in that case you described in the description, and then comparing them by a lexicographical order?

@wangyum
Copy link
Member Author

wangyum commented Aug 7, 2017

Casting the int values into string can works, but filter a int column with string type feels terrible.
My opinion is cast filter value to column type.

@wangyum
Copy link
Member Author

wangyum commented Aug 7, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Aug 7, 2017

Test build #80339 has finished for PR 18853 at commit f59a213.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

p.makeCopy(Array(left, Cast(right, TimestampType)))

case p @ BinaryComparison(left, right)
if left.isInstanceOf[AttributeReference] && right.isInstanceOf[Literal] =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to cover all the same cases, but it seems this fix couldn't do, for example;

scala> spark.udf.register("testUdf", () => "85908509832958239058032")
scala> sql("select * from values (1) where testUdf() > 1").explain
== Physical Plan ==
*Filter (cast(UDF:testUdf() as int) > 1)
+- LocalTableScan [col1#104]

@wangyum
Copy link
Member Author

wangyum commented Aug 12, 2017

Thanks @maropu, There are some problems:

spark-sql> select "20" > "100";
true
spark-sql> 

So tmap.tkey < 100's result is not we expected. Do you have any idea?

@maropu
Copy link
Member

maropu commented Aug 14, 2017

If we change this behaviour, I think we better modify code in findCommonTypeForBinaryComparison of TypeCoercion instead of your pr: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L130

  val findCommonTypeForBinaryComparison: (DataType, DataType) => Option[DataType] = {
    ...
    case (l: StringType, r: AtomicType) if r != StringType => Some(r)
    case (l: AtomicType, r: StringType) if (l != StringType) => Some(l)
    ...
  }

As another option, we could cast NumericType to wider DecimalType? Since this change could have some runtime overhead and behaviour change, I'm not sure this is acceptable. cc @gatorsmile @cloud-fan

@gatorsmile
Copy link
Member

Currently, the type casting has a few issues when types are different. So far, we do not have any good option to resolve all the issues. Thus, we are hesitant to introduce any behavior change unless this is well defined. Could you do a research to see how the others behave? Any rule?

@wangyum wangyum changed the title [SPARK-21646][SQL] BinaryComparison shouldn't auto cast string to int/long [SPARK-21646][SQL] CommonType for binary comparison Sep 10, 2017
# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@SparkQA
Copy link

SparkQA commented Sep 10, 2017

Test build #81606 has finished for PR 18853 at commit 8d37c72.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 10, 2017

Test build #81605 has finished for PR 18853 at commit cedb239.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

test("SPARK-17913: compare long and string type column may return confusing result") {
val df = Seq(123L -> "123", 19157170390056973L -> "19157170390056971").toDF("i", "j")
checkAnswer(df.select($"i" === $"j"), Row(true) :: Row(false) :: Nil)
checkAnswer(df.select($"i" === $"j"), Row(true) :: Row(true) :: Nil)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To compatible with Hive, MySQL and Oracle:
oracle

@SparkQA
Copy link

SparkQA commented Sep 11, 2017

Test build #81613 has finished for PR 18853 at commit 522c4cd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 11, 2017

Test build #81638 has finished for PR 18853 at commit 3bec6a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Sep 12, 2017

CC @gatorsmile, @cloud-fan

@wangyum
Copy link
Member Author

wangyum commented Sep 15, 2017

I provide 2 SQL scripts to validate the different result between Spark and Hive:

Engine SPARK_21646_1.txt SPARK_21646_2.txt
Hive-2.2.0 0.1
0.6
100
1111111111111111111111111111111111111111111111111
2017-09-14
Spark 100 -

@gatorsmile
Copy link
Member

gatorsmile commented Sep 17, 2017

Thank you for your investigation! I think we need to introduce a type inference conf for it. To avoid impacting the existing Spark users, we should keep the existing behaviors, by default.

@SparkQA
Copy link

SparkQA commented Sep 18, 2017

Test build #81871 has finished for PR 18853 at commit 844aec7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Sep 18, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 18, 2017

Test build #81876 has finished for PR 18853 at commit 844aec7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

buildConf("spark.sql.binary.comparison.compatible.with.hive")
.doc("Whether compatible with Hive when binary comparison.")
.booleanConf
.createWithDefault(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be false.

.createWithDefault(10000)

val BINARY_COMPARISON_COMPATIBLE_WITH_HIVE =
buildConf("spark.sql.binary.comparison.compatible.with.hive")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> spark.sql.autoTypeCastingCompatibility

@SparkQA
Copy link

SparkQA commented Dec 6, 2017

Test build #84516 has finished for PR 18853 at commit 663eb35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 6, 2017

Test build #84519 has finished for PR 18853 at commit 7802483.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

The <code>default</code> type coercion mode was used in spark prior to 2.3.0, and so it
continues to be the default to avoid breaking behavior. However, it has logical
inconsistencies. The <code>hive</code> mode is preferred for most new applications, though
it may require additional manual casting.
Copy link
Member

@gatorsmile gatorsmile Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Spark 2.3, the <code>hive</code> mode is introduced for Hive compatiblity. Spark SQL has its native type cocersion mode, which is enabled by default.

"and so it continues to be the default to avoid breaking behavior. " +
"However, it has logical inconsistencies. " +
"The 'hive' mode is preferred for most new applications, " +
"though it may require additional manual casting.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here.

} else {
commonTypeCoercionRules :+
InConversion :+
PromoteStrings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename them to NativeInConversion and NativePromoteStrings

val findCommonTypeToCompatibleWithHive: (DataType, DataType) => Option[DataType] = {
// Follow hive's binary comparison action:
// https://github.com/apache/hive/blob/rel/storage-release-2.4.0/ql/src/java/
// org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L781
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the change history of this file. It sounds like Hive's type coercion rules are also evolving.

@gatorsmile
Copy link
Member

TypeCoercionModeSuite is an end-to-end test suite. We really need a test case in SQLQueryTestSuite. That means, create a .sql file that contains all the type mapping and implicit casting queries that can be ran in both Hive and Spark. We can easily verify whether we correctly cover all the type coercion compatibility issues.

Could you please do it? This must take a lot of efforts, but it really helps us to find all the holes. Appreciate it!

@gatorsmile
Copy link
Member

Let me open an umbrella JIRA for tracking it. We can do it for both native and Hive compatibility mode.

@gatorsmile
Copy link
Member

gatorsmile commented Dec 6, 2017

The JIRA https://issues.apache.org/jira/browse/SPARK-22722 was just opened. I will create an example and open many sub-tasks. Feel free to take them if you have bandwidth.

@gatorsmile
Copy link
Member

cc @wangyum

# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
@SparkQA
Copy link

SparkQA commented Jan 9, 2018

Test build #85840 has finished for PR 18853 at commit 97a071d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 9, 2018

Test build #85841 has finished for PR 18853 at commit 408e889.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jan 9, 2018

Test build #85846 has finished for PR 18853 at commit 408e889.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 9, 2018

Test build #85849 has finished for PR 18853 at commit e763330.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
#	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
@SparkQA
Copy link

SparkQA commented Mar 31, 2018

Test build #88778 has finished for PR 18853 at commit 81067b9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Mar 31, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Mar 31, 2018

Test build #88780 has finished for PR 18853 at commit 81067b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented May 16, 2018

Spark vs Teradata:

td

spark

@SparkQA
Copy link

SparkQA commented Jun 10, 2018

Test build #91633 has finished for PR 18853 at commit d0a2089.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 6, 2018

Test build #95733 has finished for PR 18853 at commit d0a2089.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants