-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20318][SQL] Use Catalyst type for min/max in ColumnStat for ease of estimation #17630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #75761 has started for PR 17630 at commit |
|
hm this means we will forever need to be able to read the internal format, doesn't it? |
|
@rxin We still use external format for persistence (in metastore). Sorry can you explain more about "forever read the internal format"? |
|
When we update Spark and change the internal format, we'd still need to keep the current implementation. |
|
@rxin Yes, ideally that is better. But the literals in filter conditions are in internal format, we have to do conversion work between them (internal format) and min/max values (external format). If the internal format is changed, we still can't keep the current implementation unchanged. I mean the conversion logic is always there, either we do it in estimation, or in ColumnStat. I think the later is better. |
|
Test build #75772 has finished for PR 17630 at commit
|
| * In the case min/max values are null (None), they won't appear in the map. | ||
| */ | ||
| def toMap: Map[String, String] = { | ||
| def toMap(name: String, dataType: DataType): Map[String, String] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: colName
| def toMap(name: String, dataType: DataType): Map[String, String] = { | ||
| def toExternalString(v: Any, dataType: DataType): String = { | ||
| val externalValue = dataType match { | ||
| case BooleanType => v.toString.toBoolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v.asInstanceOf[Boolean]
| def toExternalString(v: Any, dataType: DataType): String = { | ||
| val externalValue = dataType match { | ||
| case BooleanType => v.toString.toBoolean | ||
| case _: IntegralType => v.toString.toLong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case t: IntegralType => v.asInstanceOf[t.InternalType]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we want to convert to external format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For IntegralType, internal and external all same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but v.asInstanceOf[t.InternalType] is a a little misleading I think, it reads like we are converting to internal format.
| */ | ||
| def toMap: Map[String, String] = { | ||
| def toMap(name: String, dataType: DataType): Map[String, String] = { | ||
| def toExternalString(v: Any, dataType: DataType): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a top level method like fromExternalString?
| sizeInBytes: BigInt, | ||
| rowCount: Option[BigInt] = None, | ||
| colStats: Map[String, ColumnStat] = Map.empty) { | ||
| colStats: Map[String, (DataType, ColumnStat)] = Map.empty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why add DataType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, it can be removed
| val externalValue = dataType match { | ||
| case BooleanType => v.asInstanceOf[Boolean] | ||
| case _: IntegralType => v.toString.toLong | ||
| case DateType => DateTimeUtils.toJavaDate(v.toString.toInt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: v.asInstanceOf[Int]
| case BooleanType => v.asInstanceOf[Boolean] | ||
| case _: IntegralType => v.toString.toLong | ||
| case DateType => DateTimeUtils.toJavaDate(v.toString.toInt) | ||
| case TimestampType => DateTimeUtils.toJavaTimestamp(v.toString.toLong) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar here
| case DateType => DateTimeUtils.toJavaDate(v.toString.toInt) | ||
| case TimestampType => DateTimeUtils.toJavaTimestamp(v.toString.toLong) | ||
| case FloatType | DoubleType => v.toString.toDouble | ||
| case _: DecimalType => Decimal.fromDecimal(v).toJavaBigDecimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v.asInstanceOf[Decimal].toJavaBigDecimal
| * data type. | ||
| */ | ||
| private def toExternalString(v: Any, colName: String, dataType: DataType): String = { | ||
| val externalValue = dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we get the externalValue first and then call toString? this means for long we will do l.toString.toLong.toString
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea good point. I should use asInstance to replace all these toString/toLong. Then call toString after conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should just return string in each cases of this pattern match.
| private def fromExternalString(s: String, name: String, dataType: DataType): Any = { | ||
| dataType match { | ||
| case BooleanType => s.toBoolean | ||
| case _: IntegralType => s.toLong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to the doc https://github.com/apache/spark/pull/17630/files#diff-a4113ed6a89e8d19a39f5c27ce95658bR79 , this should be int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, seems the doc is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc is fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm.. I think it's better to keep min/max value in ColumnStat just the same as internal type, i.e. don't cast short/int to long. This causes less confusion. Besides, we'll use decimal to unify their comparison and computation logics anyway.
| case class NumericRange(min: JDecimal, max: JDecimal) extends Range { | ||
| case class NumericRange(min: Decimal, max: Decimal) extends Range { | ||
| override def contains(l: Literal): Boolean = { | ||
| val decimal = l.dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we call EstimationUtils.toDecimal here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixed
|
Test build #75785 has finished for PR 17630 at commit
|
|
Test build #75797 has finished for PR 17630 at commit
|
|
LGTM, merging to master! |
|
Wait - are we storing UTF8Strings directly in the catalog for statistics? That doesn't make sense ... if we are not, then we are not using internal types. In that case we should document clearly what's happening. My concern is that the internal types are specific to the physical execution path and stats/CBO are independent of that. We can in the future change the internal data types without changing CBO, and completely screw ourselves. If you take into account the future evolution of the system, we'd need some abstraction to shim the internal changes away from CBO anyway. |
@rxin By "in the catalog for statistics", do you mean statistics in metastore? We still use external type for statistics in the metastore. What this pr changed were the types of min/max in
Since literal values are internal, stats/CBO need to be consistent with them to do estimation. So it's hard for CBO to be independent of that. If the internal types are changed in the future, what we need to do is to change the conversion contract (i.e. |
|
Thanks for the explanation. |
…se of estimation ## What changes were proposed in this pull request? Currently when estimating predicates like col > literal or col = literal, we will update min or max in column stats based on literal value. However, literal value is of Catalyst type (internal type), while min/max is of external type. Then for the next predicate, we again need to do type conversion to compare and update column stats. This is awkward and causes many unnecessary conversions in estimation. To solve this, we use Catalyst type for min/max in `ColumnStat`. Note that the persistent format in metastore is still of external type, so there's no inconsistency for statistics in metastore. This pr also fixes a bug for boolean type in `IN` condition. ## How was this patch tested? The changes for ColumnStat are covered by existing tests. For bug fix, a new test for boolean type in IN condition is added Author: wangzhenhua <[email protected]> Closes apache#17630 from wzhfy/refactorColumnStat.
What changes were proposed in this pull request?
Currently when estimating predicates like col > literal or col = literal, we will update min or max in column stats based on literal value. However, literal value is of Catalyst type (internal type), while min/max is of external type. Then for the next predicate, we again need to do type conversion to compare and update column stats. This is awkward and causes many unnecessary conversions in estimation.
To solve this, we use Catalyst type for min/max in
ColumnStat. Note that the persistent format in metastore is still of external type, so there's no inconsistency for statistics in metastore.This pr also fixes a bug for boolean type in
INcondition.How was this patch tested?
The changes for ColumnStat are covered by existing tests.
For bug fix, a new test for boolean type in IN condition is added