[SPARK-11856][SQL] add type cast if the real type is different but compatible with encoder schema #9840

cloud-fan · 2015-11-19T17:09:11Z

When we build the fromRowExpression for an encoder, we set up a lot of "unresolved" stuff and lost the required data type, which may lead to runtime error if the real type doesn't match the encoder's schema.
For example, we build an encoder for case class Data(a: Int, b: String) and the real type is [a: int, b: long], then we will hit runtime error and say that we can't construct class Data with int and long, because we lost the information that b should be a string.

SparkQA · 2015-11-19T17:28:55Z

Test build #46337 has finished for PR 9840 at commit 0877da3.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class UnresolvedAttribute(\n * case class UnresolvedExtractValue(\n

SparkQA · 2015-11-20T09:58:52Z

Test build #46409 has finished for PR 9840 at commit 2197018.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…er schema

cloud-fan · 2015-11-20T17:14:14Z

cc @marmbrus

SparkQA · 2015-11-20T18:59:19Z

Test build #46424 has finished for PR 9840 at commit e9dbd7b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-21T06:42:02Z

Test build #46470 has finished for PR 9840 at commit f4cd77e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-22T07:43:45Z

Test build #46486 has finished for PR 9840 at commit 8d6a6ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n

gatorsmile · 2015-11-22T07:58:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

When you saying type compatibility, is it like type promotion? Have we defined such rules for type promotion in Spark? Thanks

I'm not sure if we want to automatically downcast where we could possibly truncate the values. Unlike an explicit cast, where the user is asking for it, I think this could be confusing. Consider the following:

scala> case class Data(value: Int) scala> Seq(Int.MaxValue.toLong + 1).toDS().as[Data].collect() res6: Array[Data] = Array(Data(-2147483648))

I think we at least want to warn, and probably just throw an error. If this is really what they want then they can cast explicitly.

@gatorsmile "type compatibility" means if we do type cast, the Cast operator can be resolved, see https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L35-L85

@marmbrus I also thought about this, maybe we can create a different cast operator and define the encoder related rules there?

SparkQA · 2015-11-24T02:37:56Z

Test build #46569 has finished for PR 9840 at commit e5d963b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n

cloud-fan · 2015-11-24T07:23:35Z

It's blocked by #9928

SparkQA · 2015-11-24T10:21:30Z

Test build #46583 has finished for PR 9840 at commit 7c56223.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n * case class Cast(child: Expression, dataType: DataType) extends UnaryExpression\n * case class UpCast(child: Expression, dataType: DataType) extends UnaryExpression with Unevaluable\n

marmbrus · 2015-11-24T17:43:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

!(toPrecedence > 0 && fromPrecedence > toPrecedence)

marmbrus · 2015-11-24T17:58:01Z

This is going to be a huge usability improvement. Thanks for working on it!

marmbrus · 2015-11-24T17:58:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

stale comment?

cloud-fan · 2015-11-25T05:17:28Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/EncoderResolutionSuite.scala

should we allow this? string to numeric

Hmm, I'm torn. I think users might be confused by a bunch of nulls. I would say no and we can always add that later.

SparkQA · 2015-11-25T07:29:29Z

Test build #46666 has finished for PR 9840 at commit 6c9dc1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n * case class Cast(child: Expression, dataType: DataType) extends UnaryExpression\n * case class UpCast(child: Expression, dataType: DataType) extends UnaryExpression with Unevaluable\n

SparkQA · 2015-11-26T10:59:54Z

Test build #46764 has finished for PR 9840 at commit 399d812.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n * case class Cast(child: Expression, dataType: DataType) extends UnaryExpression\n * case class UpCast(child: Expression, dataType: DataType, walkedTypePath: Seq[String])\n

marmbrus · 2015-11-26T18:41:45Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/EncoderResolutionSuite.scala

Why don't we allow this?

cloud-fan · 2015-11-27T03:38:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala

I'm not sure if this is by intention. Why we ignore fraction type?

forType returns the default DecimalType for primitive, that could have smaller range than it, for example, Float has larger range than DecimalType(24, 7).

We can't compare Float/Double with DecimalType

so it's not safe to cast Float/Double to Decimal and vice versa?

That is correct. They could always truncate so we should return this and fix the function below.

cloud-fan · 2015-11-27T03:52:57Z

retest this please.

SparkQA · 2015-11-27T06:13:27Z

Test build #46801 has finished for PR 9840 at commit 2f7370c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n * case class Cast(child: Expression, dataType: DataType) extends UnaryExpression\n * case class UpCast(child: Expression, dataType: DataType, walkedTypePath: Seq[String])\n

liancheng · 2015-11-30T13:04:59Z

A little bit off topic but related. Currently, our casting rules (either implicit or explicit casting) are defined as full functions, thus it's hard to programmatically check whether data type A is implicitly/explicitly convertible to data type B. This makes us hard code type conversion rules here and there, potentially in an inconsistent way.

I think we can improve this situation by leveraging partial functions. Take IntegerType as an example, here is a prototype:

object Cast {
  private type CastBuilder = PartialFunction[DataType, Any => Any]

  private val asInt = (_: Any) match { case v: Int => v }

  private val implicitlyFromInt: CastBuilder = {
    case BooleanType => asInt andThen (_ == 0)
    case LongType    => asInt andThen (_.toLong)
    case FloatType   => asInt andThen (_.toFloat)
    case DoubleType  => asInt andThen (_.toDouble)
    case StringType  => _.toString
  }

  private val explicitlyFromInt: CastBuilder = {
    case ByteType  => asInt andThen (_.toByte)
    case ShortType => asInt andThen (_.toShort)
  }

  private val buildImplicitCast: PartialFunction[DataType, CastBuilder] = {
    // ...
    case IntType => implicitlyFromInt
    // ...
  }

  private val buildExplicitCast: PartialFunction[DataType, CastBuilder] = {
    // ...
    case IntType => explicitlyFromInt
    // ...
  }

  private val buildCast: PartialFunction[DataType, CastBuilder] = {
    buildImplicitCast orElse buildExplicitCast
  }
}

Then we can define Cast.nullSafeEval as:

case class Cast(child: Expression, dataType: DataType) {
  def nullSafeEval(input: Any): Any = buildCast(child.dataType)(dataType)(input)
}

With these at hand, it would be trivial to write some APIs for checking conversion between types, for example:

object Cast {
  def implicitlyConvertible(x: DataType, y: DataType): Boolean = {
    buildImplicitCast.lift(x).exists(_.isDefinedAt(y))
  }

  def explicitlyConvertible(x: DataType, y: DataType): Boolean = {
    buildExplicitCast.lift(x).exists(_.isDefinedAt(y))
  }

  def convertible(x: DataType, y: DataType): Boolean = {
    buildCast.lift(x).exists(_.isDefinedAt(y))
  }

  def implicitlyCompatible(x: DataType, y: DataType): Boolean = {
    x == y || implicitlyConvertible(x, y) || implicitlyConvertible(y, x)
  }
}

This probably makes life easier?

marmbrus · 2015-11-30T23:57:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

I guess I can't find a case where this causes a problem, but its seems a little odd to have this here instead of as part of normal resolution. I guess its fine because we are speculatively assuming that the UpCast is going to resolve into a cast that returns dataType anyway?

It seems slightly better to me to have UpCast be unresolved and fixed as we resolve the plan. However, I would not block merging the PR on this.

marmbrus · 2015-12-01T00:53:40Z

Other than @davies concerns this LGTM.

SparkQA · 2015-12-01T03:47:27Z

Test build #46927 has finished for PR 9840 at commit 57b0d7e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n * case class Cast(child: Expression, dataType: DataType) extends UnaryExpression\n * case class UpCast(child: Expression, dataType: DataType, walkedTypePath: Seq[String])\n

cloud-fan · 2015-12-01T03:57:02Z

retest this please.

SparkQA · 2015-12-01T06:07:55Z

Test build #46937 has finished for PR 9840 at commit 57b0d7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n * case class Cast(child: Expression, dataType: DataType) extends UnaryExpression\n * case class UpCast(child: Expression, dataType: DataType, walkedTypePath: Seq[String])\n

marmbrus · 2015-12-01T18:25:03Z

Thanks, merging to master and 1.6.

…mpatible with encoder schema When we build the `fromRowExpression` for an encoder, we set up a lot of "unresolved" stuff and lost the required data type, which may lead to runtime error if the real type doesn't match the encoder's schema. For example, we build an encoder for `case class Data(a: Int, b: String)` and the real type is `[a: int, b: long]`, then we will hit runtime error and say that we can't construct class `Data` with int and long, because we lost the information that `b` should be a string. Author: Wenchen Fan <[email protected]> Closes #9840 from cloud-fan/err-msg. (cherry picked from commit 9df2462) Signed-off-by: Michael Armbrust <[email protected]>

cloud-fan force-pushed the err-msg branch from 0877da3 to 2197018 Compare November 20, 2015 09:24

add type cast if the real type is different but compatible with encod…

e9dbd7b

…er schema

cloud-fan force-pushed the err-msg branch from 2197018 to e9dbd7b Compare November 20, 2015 17:00

cloud-fan changed the title ~~[SPARK-11856][SQL][WIP] add type cast if the real type is different but compatible with encoder schema~~ [SPARK-11856][SQL] add type cast if the real type is different but compatible with encoder schema Nov 20, 2015

Merge remote-tracking branch 'origin/master' into cast

19dbed2

simplify the implmenetation

8d6a6ff

cloud-fan force-pushed the err-msg branch from f4cd77e to 8d6a6ff Compare November 22, 2015 05:21

gatorsmile reviewed Nov 22, 2015
View reviewed changes

Merge remote-tracking branch 'origin/master' into cast

e5d963b

add UpCast

7c56223

marmbrus reviewed Nov 24, 2015
View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala Outdated

Copy link

Contributor

marmbrus Nov 24, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stale comment?

cloud-fan added 2 commits November 25, 2015 08:57

Merge remote-tracking branch 'origin/master' into cast

211a107

address comments

6c9dc1e

cloud-fan reviewed Nov 25, 2015
View reviewed changes

improves

399d812

marmbrus reviewed Nov 26, 2015
View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/EncoderResolutionSuite.scala

Copy link

Contributor

marmbrus Nov 26, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we allow this?

update

2f7370c

cloud-fan reviewed Nov 27, 2015
View reviewed changes

marmbrus reviewed Nov 30, 2015
View reviewed changes

address comments

57b0d7e

asfgit closed this in 9df2462 Dec 1, 2015

[SPARK-11856][SQL] add type cast if the real type is different but compatible with encoder schema #9840

[SPARK-11856][SQL] add type cast if the real type is different but compatible with encoder schema #9840

Uh oh!

Conversation

cloud-fan commented Nov 19, 2015

Uh oh!

SparkQA commented Nov 19, 2015

Uh oh!

SparkQA commented Nov 20, 2015

Uh oh!

cloud-fan commented Nov 20, 2015

Uh oh!

SparkQA commented Nov 20, 2015

Uh oh!

SparkQA commented Nov 21, 2015

Uh oh!

SparkQA commented Nov 22, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 24, 2015

Uh oh!

cloud-fan commented Nov 24, 2015

Uh oh!

SparkQA commented Nov 24, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Nov 24, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 25, 2015

Uh oh!

SparkQA commented Nov 26, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 27, 2015

Uh oh!

SparkQA commented Nov 27, 2015

Uh oh!

liancheng commented Nov 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Dec 1, 2015

Uh oh!

SparkQA commented Dec 1, 2015

Uh oh!

cloud-fan commented Dec 1, 2015

Uh oh!

SparkQA commented Dec 1, 2015

Uh oh!

marmbrus commented Dec 1, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects