-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-11856][SQL] add type cast if the real type is different but compatible with encoder schema #9840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #46337 has finished for PR 9840 at commit
|
|
Test build #46409 has finished for PR 9840 at commit
|
|
cc @marmbrus |
|
Test build #46424 has finished for PR 9840 at commit
|
|
Test build #46470 has finished for PR 9840 at commit
|
|
Test build #46486 has finished for PR 9840 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you saying type compatibility, is it like type promotion? Have we defined such rules for type promotion in Spark? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want to automatically downcast where we could possibly truncate the values. Unlike an explicit cast, where the user is asking for it, I think this could be confusing. Consider the following:
scala> case class Data(value: Int)
scala> Seq(Int.MaxValue.toLong + 1).toDS().as[Data].collect()
res6: Array[Data] = Array(Data(-2147483648))I think we at least want to warn, and probably just throw an error. If this is really what they want then they can cast explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile "type compatibility" means if we do type cast, the Cast operator can be resolved, see https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L35-L85
@marmbrus I also thought about this, maybe we can create a different cast operator and define the encoder related rules there?
|
Test build #46569 has finished for PR 9840 at commit
|
|
It's blocked by #9928 |
|
Test build #46583 has finished for PR 9840 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!(toPrecedence > 0 && fromPrecedence > toPrecedence)
|
This is going to be a huge usability improvement. Thanks for working on it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stale comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we allow this? string to numeric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm torn. I think users might be confused by a bunch of nulls. I would say no and we can always add that later.
|
Test build #46666 has finished for PR 9840 at commit
|
|
Test build #46764 has finished for PR 9840 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we allow this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is by intention. Why we ignore fraction type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forType returns the default DecimalType for primitive, that could have smaller range than it, for example, Float has larger range than DecimalType(24, 7).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't compare Float/Double with DecimalType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it's not safe to cast Float/Double to Decimal and vice versa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. They could always truncate so we should return this and fix the function below.
|
retest this please. |
|
Test build #46801 has finished for PR 9840 at commit
|
|
A little bit off topic but related. Currently, our casting rules (either implicit or explicit casting) are defined as full functions, thus it's hard to programmatically check whether data type I think we can improve this situation by leveraging partial functions. Take object Cast {
private type CastBuilder = PartialFunction[DataType, Any => Any]
private val asInt = (_: Any) match { case v: Int => v }
private val implicitlyFromInt: CastBuilder = {
case BooleanType => asInt andThen (_ == 0)
case LongType => asInt andThen (_.toLong)
case FloatType => asInt andThen (_.toFloat)
case DoubleType => asInt andThen (_.toDouble)
case StringType => _.toString
}
private val explicitlyFromInt: CastBuilder = {
case ByteType => asInt andThen (_.toByte)
case ShortType => asInt andThen (_.toShort)
}
private val buildImplicitCast: PartialFunction[DataType, CastBuilder] = {
// ...
case IntType => implicitlyFromInt
// ...
}
private val buildExplicitCast: PartialFunction[DataType, CastBuilder] = {
// ...
case IntType => explicitlyFromInt
// ...
}
private val buildCast: PartialFunction[DataType, CastBuilder] = {
buildImplicitCast orElse buildExplicitCast
}
}Then we can define case class Cast(child: Expression, dataType: DataType) {
def nullSafeEval(input: Any): Any = buildCast(child.dataType)(dataType)(input)
}With these at hand, it would be trivial to write some APIs for checking conversion between types, for example: object Cast {
def implicitlyConvertible(x: DataType, y: DataType): Boolean = {
buildImplicitCast.lift(x).exists(_.isDefinedAt(y))
}
def explicitlyConvertible(x: DataType, y: DataType): Boolean = {
buildExplicitCast.lift(x).exists(_.isDefinedAt(y))
}
def convertible(x: DataType, y: DataType): Boolean = {
buildCast.lift(x).exists(_.isDefinedAt(y))
}
def implicitlyCompatible(x: DataType, y: DataType): Boolean = {
x == y || implicitlyConvertible(x, y) || implicitlyConvertible(y, x)
}
}This probably makes life easier? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I can't find a case where this causes a problem, but its seems a little odd to have this here instead of as part of normal resolution. I guess its fine because we are speculatively assuming that the UpCast is going to resolve into a cast that returns dataType anyway?
It seems slightly better to me to have UpCast be unresolved and fixed as we resolve the plan. However, I would not block merging the PR on this.
|
Other than @davies concerns this LGTM. |
|
Test build #46927 has finished for PR 9840 at commit
|
|
retest this please. |
|
Test build #46937 has finished for PR 9840 at commit
|
|
Thanks, merging to master and 1.6. |
…mpatible with encoder schema When we build the `fromRowExpression` for an encoder, we set up a lot of "unresolved" stuff and lost the required data type, which may lead to runtime error if the real type doesn't match the encoder's schema. For example, we build an encoder for `case class Data(a: Int, b: String)` and the real type is `[a: int, b: long]`, then we will hit runtime error and say that we can't construct class `Data` with int and long, because we lost the information that `b` should be a string. Author: Wenchen Fan <[email protected]> Closes #9840 from cloud-fan/err-msg. (cherry picked from commit 9df2462) Signed-off-by: Michael Armbrust <[email protected]>
When we build the
fromRowExpressionfor an encoder, we set up a lot of "unresolved" stuff and lost the required data type, which may lead to runtime error if the real type doesn't match the encoder's schema.For example, we build an encoder for
case class Data(a: Int, b: String)and the real type is[a: int, b: long], then we will hit runtime error and say that we can't construct classDatawith int and long, because we lost the information thatbshould be a string.