Skip to content

Conversation

@cloud-fan
Copy link
Contributor

When we build the fromRowExpression for an encoder, we set up a lot of "unresolved" stuff and lost the required data type, which may lead to runtime error if the real type doesn't match the encoder's schema.
For example, we build an encoder for case class Data(a: Int, b: String) and the real type is [a: int, b: long], then we will hit runtime error and say that we can't construct class Data with int and long, because we lost the information that b should be a string.

@SparkQA
Copy link

SparkQA commented Nov 19, 2015

Test build #46337 has finished for PR 9840 at commit 0877da3.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class UnresolvedAttribute(\n * case class UnresolvedExtractValue(\n

@SparkQA
Copy link

SparkQA commented Nov 20, 2015

Test build #46409 has finished for PR 9840 at commit 2197018.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan changed the title [SPARK-11856][SQL][WIP] add type cast if the real type is different but compatible with encoder schema [SPARK-11856][SQL] add type cast if the real type is different but compatible with encoder schema Nov 20, 2015
@cloud-fan
Copy link
Contributor Author

cc @marmbrus

@SparkQA
Copy link

SparkQA commented Nov 20, 2015

Test build #46424 has finished for PR 9840 at commit e9dbd7b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2015

Test build #46470 has finished for PR 9840 at commit f4cd77e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2015

Test build #46486 has finished for PR 9840 at commit 8d6a6ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you saying type compatibility, is it like type promotion? Have we defined such rules for type promotion in Spark? Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to automatically downcast where we could possibly truncate the values. Unlike an explicit cast, where the user is asking for it, I think this could be confusing. Consider the following:

scala> case class Data(value: Int)
scala> Seq(Int.MaxValue.toLong + 1).toDS().as[Data].collect()
res6: Array[Data] = Array(Data(-2147483648))

I think we at least want to warn, and probably just throw an error. If this is really what they want then they can cast explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile "type compatibility" means if we do type cast, the Cast operator can be resolved, see https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L35-L85

@marmbrus I also thought about this, maybe we can create a different cast operator and define the encoder related rules there?

@SparkQA
Copy link

SparkQA commented Nov 24, 2015

Test build #46569 has finished for PR 9840 at commit e5d963b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n

@cloud-fan
Copy link
Contributor Author

It's blocked by #9928

@SparkQA
Copy link

SparkQA commented Nov 24, 2015

Test build #46583 has finished for PR 9840 at commit 7c56223.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n * case class Cast(child: Expression, dataType: DataType) extends UnaryExpression\n * case class UpCast(child: Expression, dataType: DataType) extends UnaryExpression with Unevaluable\n

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!(toPrecedence > 0 && fromPrecedence > toPrecedence)

@marmbrus
Copy link
Contributor

This is going to be a huge usability improvement. Thanks for working on it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stale comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we allow this? string to numeric

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm torn. I think users might be confused by a bunch of nulls. I would say no and we can always add that later.

@SparkQA
Copy link

SparkQA commented Nov 25, 2015

Test build #46666 has finished for PR 9840 at commit 6c9dc1e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n * case class Cast(child: Expression, dataType: DataType) extends UnaryExpression\n * case class UpCast(child: Expression, dataType: DataType) extends UnaryExpression with Unevaluable\n

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #46764 has finished for PR 9840 at commit 399d812.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n * case class Cast(child: Expression, dataType: DataType) extends UnaryExpression\n * case class UpCast(child: Expression, dataType: DataType, walkedTypePath: Seq[String])\n

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we allow this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is by intention. Why we ignore fraction type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forType returns the default DecimalType for primitive, that could have smaller range than it, for example, Float has larger range than DecimalType(24, 7).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't compare Float/Double with DecimalType

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it's not safe to cast Float/Double to Decimal and vice versa?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. They could always truncate so we should return this and fix the function below.

@cloud-fan
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Nov 27, 2015

Test build #46801 has finished for PR 9840 at commit 2f7370c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n * case class Cast(child: Expression, dataType: DataType) extends UnaryExpression\n * case class UpCast(child: Expression, dataType: DataType, walkedTypePath: Seq[String])\n

@liancheng
Copy link
Contributor

A little bit off topic but related. Currently, our casting rules (either implicit or explicit casting) are defined as full functions, thus it's hard to programmatically check whether data type A is implicitly/explicitly convertible to data type B. This makes us hard code type conversion rules here and there, potentially in an inconsistent way.

I think we can improve this situation by leveraging partial functions. Take IntegerType as an example, here is a prototype:

object Cast {
  private type CastBuilder = PartialFunction[DataType, Any => Any]

  private val asInt = (_: Any) match { case v: Int => v }

  private val implicitlyFromInt: CastBuilder = {
    case BooleanType => asInt andThen (_ == 0)
    case LongType    => asInt andThen (_.toLong)
    case FloatType   => asInt andThen (_.toFloat)
    case DoubleType  => asInt andThen (_.toDouble)
    case StringType  => _.toString
  }

  private val explicitlyFromInt: CastBuilder = {
    case ByteType  => asInt andThen (_.toByte)
    case ShortType => asInt andThen (_.toShort)
  }

  private val buildImplicitCast: PartialFunction[DataType, CastBuilder] = {
    // ...
    case IntType => implicitlyFromInt
    // ...
  }

  private val buildExplicitCast: PartialFunction[DataType, CastBuilder] = {
    // ...
    case IntType => explicitlyFromInt
    // ...
  }

  private val buildCast: PartialFunction[DataType, CastBuilder] = {
    buildImplicitCast orElse buildExplicitCast
  }
}

Then we can define Cast.nullSafeEval as:

case class Cast(child: Expression, dataType: DataType) {
  def nullSafeEval(input: Any): Any = buildCast(child.dataType)(dataType)(input)
}

With these at hand, it would be trivial to write some APIs for checking conversion between types, for example:

object Cast {
  def implicitlyConvertible(x: DataType, y: DataType): Boolean = {
    buildImplicitCast.lift(x).exists(_.isDefinedAt(y))
  }

  def explicitlyConvertible(x: DataType, y: DataType): Boolean = {
    buildExplicitCast.lift(x).exists(_.isDefinedAt(y))
  }

  def convertible(x: DataType, y: DataType): Boolean = {
    buildCast.lift(x).exists(_.isDefinedAt(y))
  }

  def implicitlyCompatible(x: DataType, y: DataType): Boolean = {
    x == y || implicitlyConvertible(x, y) || implicitlyConvertible(y, x)
  }
}

This probably makes life easier?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I can't find a case where this causes a problem, but its seems a little odd to have this here instead of as part of normal resolution. I guess its fine because we are speculatively assuming that the UpCast is going to resolve into a cast that returns dataType anyway?

It seems slightly better to me to have UpCast be unresolved and fixed as we resolve the plan. However, I would not block merging the PR on this.

@marmbrus
Copy link
Contributor

marmbrus commented Dec 1, 2015

Other than @davies concerns this LGTM.

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46927 has finished for PR 9840 at commit 57b0d7e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n * case class Cast(child: Expression, dataType: DataType) extends UnaryExpression\n * case class UpCast(child: Expression, dataType: DataType, walkedTypePath: Seq[String])\n

@cloud-fan
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46937 has finished for PR 9840 at commit 57b0d7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * * For example, we build an encoder forcase class Data(a: Int, b: String)and the real type\n * case class Cast(child: Expression, dataType: DataType) extends UnaryExpression\n * case class UpCast(child: Expression, dataType: DataType, walkedTypePath: Seq[String])\n

@marmbrus
Copy link
Contributor

marmbrus commented Dec 1, 2015

Thanks, merging to master and 1.6.

asfgit pushed a commit that referenced this pull request Dec 1, 2015
…mpatible with encoder schema

When we build the `fromRowExpression` for an encoder, we set up a lot of "unresolved" stuff and lost the required data type, which may lead to runtime error if the real type doesn't match the encoder's schema.
For example, we build an encoder for `case class Data(a: Int, b: String)` and the real type is `[a: int, b: long]`, then we will hit runtime error and say that we can't construct class `Data` with int and long, because we lost the information that `b` should be a string.

Author: Wenchen Fan <[email protected]>

Closes #9840 from cloud-fan/err-msg.

(cherry picked from commit 9df2462)
Signed-off-by: Michael Armbrust <[email protected]>
@asfgit asfgit closed this in 9df2462 Dec 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants