Skip to content

Conversation

@wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Apr 13, 2017

What changes were proposed in this pull request?

Currently when estimating predicates like col > literal or col = literal, we will update min or max in column stats based on literal value. However, literal value is of Catalyst type (internal type), while min/max is of external type. Then for the next predicate, we again need to do type conversion to compare and update column stats. This is awkward and causes many unnecessary conversions in estimation.

To solve this, we use Catalyst type for min/max in ColumnStat. Note that the persistent format in metastore is still of external type, so there's no inconsistency for statistics in metastore.

This pr also fixes a bug for boolean type in IN condition.

How was this patch tested?

The changes for ColumnStat are covered by existing tests.
For bug fix, a new test for boolean type in IN condition is added

@wzhfy
Copy link
Contributor Author

wzhfy commented Apr 13, 2017

cc @cloud-fan @ron8hu

@SparkQA
Copy link

SparkQA commented Apr 13, 2017

Test build #75761 has started for PR 17630 at commit 4ef05e7.

@rxin
Copy link
Contributor

rxin commented Apr 13, 2017

hm this means we will forever need to be able to read the internal format, doesn't it?

@wzhfy
Copy link
Contributor Author

wzhfy commented Apr 13, 2017

@rxin We still use external format for persistence (in metastore). Sorry can you explain more about "forever read the internal format"?

@rxin
Copy link
Contributor

rxin commented Apr 13, 2017

When we update Spark and change the internal format, we'd still need to keep the current implementation.

@wzhfy
Copy link
Contributor Author

wzhfy commented Apr 13, 2017

@rxin Yes, ideally that is better. But the literals in filter conditions are in internal format, we have to do conversion work between them (internal format) and min/max values (external format). If the internal format is changed, we still can't keep the current implementation unchanged. I mean the conversion logic is always there, either we do it in estimation, or in ColumnStat. I think the later is better.

@SparkQA
Copy link

SparkQA commented Apr 13, 2017

Test build #75772 has finished for PR 17630 at commit 656f6a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* In the case min/max values are null (None), they won't appear in the map.
*/
def toMap: Map[String, String] = {
def toMap(name: String, dataType: DataType): Map[String, String] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: colName

def toMap(name: String, dataType: DataType): Map[String, String] = {
def toExternalString(v: Any, dataType: DataType): String = {
val externalValue = dataType match {
case BooleanType => v.toString.toBoolean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v.asInstanceOf[Boolean]

def toExternalString(v: Any, dataType: DataType): String = {
val externalValue = dataType match {
case BooleanType => v.toString.toBoolean
case _: IntegralType => v.toString.toLong
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case t: IntegralType => v.asInstanceOf[t.InternalType]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we want to convert to external format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For IntegralType, internal and external all same

Copy link
Contributor Author

@wzhfy wzhfy Apr 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but v.asInstanceOf[t.InternalType] is a a little misleading I think, it reads like we are converting to internal format.

*/
def toMap: Map[String, String] = {
def toMap(name: String, dataType: DataType): Map[String, String] = {
def toExternalString(v: Any, dataType: DataType): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a top level method like fromExternalString?

sizeInBytes: BigInt,
rowCount: Option[BigInt] = None,
colStats: Map[String, ColumnStat] = Map.empty) {
colStats: Map[String, (DataType, ColumnStat)] = Map.empty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add DataType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, it can be removed

val externalValue = dataType match {
case BooleanType => v.asInstanceOf[Boolean]
case _: IntegralType => v.toString.toLong
case DateType => DateTimeUtils.toJavaDate(v.toString.toInt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: v.asInstanceOf[Int]

case BooleanType => v.asInstanceOf[Boolean]
case _: IntegralType => v.toString.toLong
case DateType => DateTimeUtils.toJavaDate(v.toString.toInt)
case TimestampType => DateTimeUtils.toJavaTimestamp(v.toString.toLong)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar here

case DateType => DateTimeUtils.toJavaDate(v.toString.toInt)
case TimestampType => DateTimeUtils.toJavaTimestamp(v.toString.toLong)
case FloatType | DoubleType => v.toString.toDouble
case _: DecimalType => Decimal.fromDecimal(v).toJavaBigDecimal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v.asInstanceOf[Decimal].toJavaBigDecimal

* data type.
*/
private def toExternalString(v: Any, colName: String, dataType: DataType): String = {
val externalValue = dataType match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we get the externalValue first and then call toString? this means for long we will do l.toString.toLong.toString

Copy link
Contributor Author

@wzhfy wzhfy Apr 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea good point. I should use asInstance to replace all these toString/toLong. Then call toString after conversion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should just return string in each cases of this pattern match.

private def fromExternalString(s: String, name: String, dataType: DataType): Any = {
dataType match {
case BooleanType => s.toBoolean
case _: IntegralType => s.toLong
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, seems the doc is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc is fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm.. I think it's better to keep min/max value in ColumnStat just the same as internal type, i.e. don't cast short/int to long. This causes less confusion. Besides, we'll use decimal to unify their comparison and computation logics anyway.

case class NumericRange(min: JDecimal, max: JDecimal) extends Range {
case class NumericRange(min: Decimal, max: Decimal) extends Range {
override def contains(l: Literal): Boolean = {
val decimal = l.dataType match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we call EstimationUtils.toDecimal here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed

@SparkQA
Copy link

SparkQA commented Apr 14, 2017

Test build #75785 has finished for PR 17630 at commit 1a5069d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 14, 2017

Test build #75797 has finished for PR 17630 at commit 195d428.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in fb036c4 Apr 14, 2017
@rxin
Copy link
Contributor

rxin commented Apr 14, 2017

Wait - are we storing UTF8Strings directly in the catalog for statistics? That doesn't make sense ... if we are not, then we are not using internal types. In that case we should document clearly what's happening.

My concern is that the internal types are specific to the physical execution path and stats/CBO are independent of that. We can in the future change the internal data types without changing CBO, and completely screw ourselves.

If you take into account the future evolution of the system, we'd need some abstraction to shim the internal changes away from CBO anyway.

@wzhfy
Copy link
Contributor Author

wzhfy commented Apr 15, 2017

are we storing UTF8Strings directly in the catalog for statistics? That doesn't make sense ... if we are not, then we are not using internal types.

@rxin By "in the catalog for statistics", do you mean statistics in metastore? We still use external type for statistics in the metastore. What this pr changed were the types of min/max in ColumnStat. So we don't have this problem here.

My concern is that the internal types are specific to the physical execution path and stats/CBO are independent of that. We can in the future change the internal data types without changing CBO.

Since literal values are internal, stats/CBO need to be consistent with them to do estimation. So it's hard for CBO to be independent of that. If the internal types are changed in the future, what we need to do is to change the conversion contract (i.e. fromMap and toMap) defined in ColumnStat based on the changes on internal types.

@rxin
Copy link
Contributor

rxin commented Apr 15, 2017

Thanks for the explanation.

peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
…se of estimation

## What changes were proposed in this pull request?

Currently when estimating predicates like col > literal or col = literal, we will update min or max in column stats based on literal value. However, literal value is of Catalyst type (internal type), while min/max is of external type. Then for the next predicate, we again need to do type conversion to compare and update column stats. This is awkward and causes many unnecessary conversions in estimation.

To solve this, we use Catalyst type for min/max in `ColumnStat`. Note that the persistent format in metastore is still of external type, so there's no inconsistency for statistics in metastore.

This pr also fixes a bug for boolean type in `IN` condition.

## How was this patch tested?

The changes for ColumnStat are covered by existing tests.
For bug fix, a new test for boolean type in IN condition is added

Author: wangzhenhua <[email protected]>

Closes apache#17630 from wzhfy/refactorColumnStat.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants