-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8772][SQL] Implement implicit type cast for expressions that define input types. #7175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,8 +31,7 @@ import org.apache.spark.unsafe.types.UTF8String | |
| * A function that calculates an MD5 128-bit checksum and returns it as a hex string | ||
| * For input of type [[BinaryType]] | ||
| */ | ||
| case class Md5(child: Expression) | ||
| extends UnaryExpression with AutoCastInputTypes { | ||
| case class Md5(child: Expression) extends UnaryExpression with ExpectsInputTypes { | ||
|
|
||
| override def dataType: DataType = StringType | ||
|
|
||
|
|
@@ -62,12 +61,10 @@ case class Md5(child: Expression) | |
| * the hash length is not one of the permitted values, the return value is NULL. | ||
| */ | ||
| case class Sha2(left: Expression, right: Expression) | ||
| extends BinaryExpression with Serializable with AutoCastInputTypes { | ||
| extends BinaryExpression with Serializable with ExpectsInputTypes { | ||
|
|
||
| override def dataType: DataType = StringType | ||
|
|
||
| override def toString: String = s"SHA2($left, $right)" | ||
|
|
||
| override def inputTypes: Seq[DataType] = Seq(BinaryType, IntegerType) | ||
|
|
||
| override def eval(input: InternalRow): Any = { | ||
|
|
@@ -147,7 +144,7 @@ case class Sha2(left: Expression, right: Expression) | |
| * A function that calculates a sha1 hash value and returns it as a hex string | ||
| * For input of type [[BinaryType]] or [[StringType]] | ||
| */ | ||
| case class Sha1(child: Expression) extends UnaryExpression with AutoCastInputTypes { | ||
| case class Sha1(child: Expression) extends UnaryExpression with ExpectsInputTypes { | ||
|
|
||
| override def dataType: DataType = StringType | ||
|
|
||
|
|
@@ -174,8 +171,7 @@ case class Sha1(child: Expression) extends UnaryExpression with AutoCastInputTyp | |
| * A function that computes a cyclic redundancy check value and returns it as a bigint | ||
| * For input of type [[BinaryType]] | ||
| */ | ||
| case class Crc32(child: Expression) | ||
| extends UnaryExpression with AutoCastInputTypes { | ||
| case class Crc32(child: Expression) extends UnaryExpression with ExpectsInputTypes { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Crc32 should be able to work with StringType, but StringType cannot be implicit casted BinaryType, right ?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i need to think about whether we should support implicit casts from string to binary. sql server does support that. hive doesn't, but hive chose to make a lot of the udfs work against both types.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure if we can always cast a string to binary correctly, as it produces different binary when specifying different encoder. It's actually the case accept multiple We probably need another PR for this improvement.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking about having an AbstractDataType that's a TypeCollection, that expressions can put arbitrary types into it. Basically similar to the Seq[Any] idea, but with better type safety.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that's a good idea for this, but it probably make thing more complicated for auto casting. (Which data type should be cast to?)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 for StringType -> BinaryType (UTF8 will be used)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean we'd better leave the casting (StringType -> BinaryType) to be done within the UDF
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, sorry, I just checked the code of Hive, it does convert the StringType => BinaryType (UTF8 bytes), just as the generic rule. @davies +1 |
||
|
|
||
| override def dataType: DataType = LongType | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,114 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.types | ||
|
|
||
| import scala.reflect.ClassTag | ||
| import scala.reflect.runtime.universe.{TypeTag, runtimeMirror} | ||
|
|
||
| import org.apache.spark.sql.catalyst.ScalaReflectionLock | ||
| import org.apache.spark.sql.catalyst.expressions.Expression | ||
| import org.apache.spark.util.Utils | ||
|
|
||
| /** | ||
| * A non-concrete data type, reserved for internal uses. | ||
| */ | ||
| private[sql] abstract class AbstractDataType { | ||
| private[sql] def defaultConcreteType: DataType | ||
| } | ||
|
|
||
|
|
||
| /** | ||
| * An internal type used to represent everything that is not null, UDTs, arrays, structs, and maps. | ||
| */ | ||
| protected[sql] abstract class AtomicType extends DataType { | ||
| private[sql] type InternalType | ||
| @transient private[sql] val tag: TypeTag[InternalType] | ||
| private[sql] val ordering: Ordering[InternalType] | ||
|
|
||
| @transient private[sql] val classTag = ScalaReflectionLock.synchronized { | ||
| val mirror = runtimeMirror(Utils.getSparkClassLoader) | ||
| ClassTag[InternalType](mirror.runtimeClass(tag.tpe)) | ||
| } | ||
| } | ||
|
|
||
|
|
||
| /** | ||
| * :: DeveloperApi :: | ||
| * Numeric data types. | ||
| */ | ||
| abstract class NumericType extends AtomicType { | ||
| // Unfortunately we can't get this implicitly as that breaks Spark Serialization. In order for | ||
| // implicitly[Numeric[JvmType]] to be valid, we have to change JvmType from a type variable to a | ||
| // type parameter and add a numeric annotation (i.e., [JvmType : Numeric]). This gets | ||
| // desugared by the compiler into an argument to the objects constructor. This means there is no | ||
| // longer an no argument constructor and thus the JVM cannot serialize the object anymore. | ||
| private[sql] val numeric: Numeric[InternalType] | ||
| } | ||
|
|
||
|
|
||
| private[sql] object NumericType extends AbstractDataType { | ||
| /** | ||
| * Enables matching against NumericType for expressions: | ||
| * {{{ | ||
| * case Cast(child @ NumericType(), StringType) => | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Exist: why we add unapply for it? Is it same with
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sorry didn't see that...
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. btw this is old code. just got copied around. |
||
| * ... | ||
| * }}} | ||
| */ | ||
| def unapply(e: Expression): Boolean = e.dataType.isInstanceOf[NumericType] | ||
|
|
||
| private[sql] override def defaultConcreteType: DataType = IntegerType | ||
| } | ||
|
|
||
|
|
||
| private[sql] object IntegralType extends AbstractDataType { | ||
| /** | ||
| * Enables matching against IntegralType for expressions: | ||
| * {{{ | ||
| * case Cast(child @ IntegralType(), StringType) => | ||
| * ... | ||
| * }}} | ||
| */ | ||
| def unapply(e: Expression): Boolean = e.dataType.isInstanceOf[IntegralType] | ||
|
|
||
| private[sql] override def defaultConcreteType: DataType = IntegerType | ||
| } | ||
|
|
||
|
|
||
| private[sql] abstract class IntegralType extends NumericType { | ||
| private[sql] val integral: Integral[InternalType] | ||
| } | ||
|
|
||
|
|
||
| private[sql] object FractionalType extends AbstractDataType { | ||
| /** | ||
| * Enables matching against FractionalType for expressions: | ||
| * {{{ | ||
| * case Cast(child @ FractionalType(), StringType) => | ||
| * ... | ||
| * }}} | ||
| */ | ||
| def unapply(e: Expression): Boolean = e.dataType.isInstanceOf[FractionalType] | ||
|
|
||
| private[sql] override def defaultConcreteType: DataType = DoubleType | ||
| } | ||
|
|
||
|
|
||
| private[sql] abstract class FractionalType extends NumericType { | ||
| private[sql] val fractional: Fractional[InternalType] | ||
| private[sql] val asIntegral: Integral[InternalType] | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to hive and discussion in #6551,
should we only allow atomic type(except boolean and binary) to string?