[SPARK-20318][SQL] Use Catalyst type for min/max in ColumnStat for ease of estimation #17630

wzhfy · 2017-04-13T06:54:00Z

What changes were proposed in this pull request?

Currently when estimating predicates like col > literal or col = literal, we will update min or max in column stats based on literal value. However, literal value is of Catalyst type (internal type), while min/max is of external type. Then for the next predicate, we again need to do type conversion to compare and update column stats. This is awkward and causes many unnecessary conversions in estimation.

To solve this, we use Catalyst type for min/max in ColumnStat. Note that the persistent format in metastore is still of external type, so there's no inconsistency for statistics in metastore.

This pr also fixes a bug for boolean type in IN condition.

How was this patch tested?

The changes for ColumnStat are covered by existing tests.
For bug fix, a new test for boolean type in IN condition is added

wzhfy · 2017-04-13T06:54:55Z

cc @cloud-fan @ron8hu

SparkQA · 2017-04-13T06:57:34Z

Test build #75761 has started for PR 17630 at commit 4ef05e7.

rxin · 2017-04-13T06:58:30Z

hm this means we will forever need to be able to read the internal format, doesn't it?

wzhfy · 2017-04-13T07:19:36Z

@rxin We still use external format for persistence (in metastore). Sorry can you explain more about "forever read the internal format"?

rxin · 2017-04-13T07:28:53Z

When we update Spark and change the internal format, we'd still need to keep the current implementation.

wzhfy · 2017-04-13T08:28:59Z

@rxin Yes, ideally that is better. But the literals in filter conditions are in internal format, we have to do conversion work between them (internal format) and min/max values (external format). If the internal format is changed, we still can't keep the current implementation unchanged. I mean the conversion logic is always there, either we do it in estimation, or in ColumnStat. I think the later is better.

SparkQA · 2017-04-13T11:25:52Z

Test build #75772 has finished for PR 17630 at commit 656f6a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-04-13T14:19:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

   * In the case min/max values are null (None), they won't appear in the map.
   */
-  def toMap: Map[String, String] = {
+  def toMap(name: String, dataType: DataType): Map[String, String] = {


nit: colName

cloud-fan · 2017-04-13T14:20:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

+  def toMap(name: String, dataType: DataType): Map[String, String] = {
+    def toExternalString(v: Any, dataType: DataType): String = {
+      val externalValue = dataType match {
+        case BooleanType => v.toString.toBoolean


v.asInstanceOf[Boolean]

cloud-fan · 2017-04-13T14:23:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

+    def toExternalString(v: Any, dataType: DataType): String = {
+      val externalValue = dataType match {
+        case BooleanType => v.toString.toBoolean
+        case _: IntegralType => v.toString.toLong


case t: IntegralType => v.asInstanceOf[t.InternalType]

Here we want to convert to external format.

For IntegralType, internal and external all same

yes, but v.asInstanceOf[t.InternalType] is a a little misleading I think, it reads like we are converting to internal format.

cloud-fan · 2017-04-13T14:24:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

   */
-  def toMap: Map[String, String] = {
+  def toMap(name: String, dataType: DataType): Map[String, String] = {
+    def toExternalString(v: Any, dataType: DataType): String = {


make this a top level method like fromExternalString?

cloud-fan · 2017-04-13T14:26:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

    sizeInBytes: BigInt,
    rowCount: Option[BigInt] = None,
-    colStats: Map[String, ColumnStat] = Map.empty) {
+    colStats: Map[String, (DataType, ColumnStat)] = Map.empty) {


why add DataType?

you are right, it can be removed

cloud-fan · 2017-04-14T04:24:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

+    val externalValue = dataType match {
+      case BooleanType => v.asInstanceOf[Boolean]
+      case _: IntegralType => v.toString.toLong
+      case DateType => DateTimeUtils.toJavaDate(v.toString.toInt)


nit: v.asInstanceOf[Int]

cloud-fan · 2017-04-14T04:25:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

+      case BooleanType => v.asInstanceOf[Boolean]
+      case _: IntegralType => v.toString.toLong
+      case DateType => DateTimeUtils.toJavaDate(v.toString.toInt)
+      case TimestampType => DateTimeUtils.toJavaTimestamp(v.toString.toLong)


similar here

cloud-fan · 2017-04-14T04:25:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

+      case DateType => DateTimeUtils.toJavaDate(v.toString.toInt)
+      case TimestampType => DateTimeUtils.toJavaTimestamp(v.toString.toLong)
+      case FloatType | DoubleType => v.toString.toDouble
+      case _: DecimalType => Decimal.fromDecimal(v).toJavaBigDecimal


v.asInstanceOf[Decimal].toJavaBigDecimal

cloud-fan · 2017-04-14T04:27:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

+   * data type.
+   */
+  private def toExternalString(v: Any, colName: String, dataType: DataType): String = {
+    val externalValue = dataType match {


why we get the externalValue first and then call toString? this means for long we will do l.toString.toLong.toString

yea good point. I should use asInstance to replace all these toString/toLong. Then call toString after conversion.

we should just return string in each cases of this pattern match.

cloud-fan · 2017-04-14T04:28:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

+  private def fromExternalString(s: String, name: String, dataType: DataType): Any = {
+    dataType match {
+      case BooleanType => s.toBoolean
+      case _: IntegralType => s.toLong


according to the doc https://github.com/apache/spark/pull/17630/files#diff-a4113ed6a89e8d19a39f5c27ce95658bR79 , this should be int

oh, seems the doc is wrong.

The doc is fixed

hm.. I think it's better to keep min/max value in ColumnStat just the same as internal type, i.e. don't cast short/int to long. This causes less confusion. Besides, we'll use decimal to unify their comparison and computation logics anyway.

cloud-fan · 2017-04-14T04:30:38Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/Range.scala

-case class NumericRange(min: JDecimal, max: JDecimal) extends Range {
+case class NumericRange(min: Decimal, max: Decimal) extends Range {
  override def contains(l: Literal): Boolean = {
-    val decimal = l.dataType match {


shall we call EstimationUtils.toDecimal here?

Good catch, fixed

SparkQA · 2017-04-14T05:21:43Z

Test build #75785 has finished for PR 17630 at commit 1a5069d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-14T09:32:19Z

Test build #75797 has finished for PR 17630 at commit 195d428.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-04-14T11:16:53Z

LGTM, merging to master!

rxin · 2017-04-14T17:25:42Z

Wait - are we storing UTF8Strings directly in the catalog for statistics? That doesn't make sense ... if we are not, then we are not using internal types. In that case we should document clearly what's happening.

My concern is that the internal types are specific to the physical execution path and stats/CBO are independent of that. We can in the future change the internal data types without changing CBO, and completely screw ourselves.

If you take into account the future evolution of the system, we'd need some abstraction to shim the internal changes away from CBO anyway.

wzhfy · 2017-04-15T03:55:31Z

are we storing UTF8Strings directly in the catalog for statistics? That doesn't make sense ... if we are not, then we are not using internal types.

@rxin By "in the catalog for statistics", do you mean statistics in metastore? We still use external type for statistics in the metastore. What this pr changed were the types of min/max in ColumnStat. So we don't have this problem here.

My concern is that the internal types are specific to the physical execution path and stats/CBO are independent of that. We can in the future change the internal data types without changing CBO.

Since literal values are internal, stats/CBO need to be consistent with them to do estimation. So it's hard for CBO to be independent of that. If the internal types are changed in the future, what we need to do is to change the conversion contract (i.e. fromMap and toMap) defined in ColumnStat based on the changes on internal types.

rxin · 2017-04-15T06:38:22Z

Thanks for the explanation.

…se of estimation ## What changes were proposed in this pull request? Currently when estimating predicates like col > literal or col = literal, we will update min or max in column stats based on literal value. However, literal value is of Catalyst type (internal type), while min/max is of external type. Then for the next predicate, we again need to do type conversion to compare and update column stats. This is awkward and causes many unnecessary conversions in estimation. To solve this, we use Catalyst type for min/max in `ColumnStat`. Note that the persistent format in metastore is still of external type, so there's no inconsistency for statistics in metastore. This pr also fixes a bug for boolean type in `IN` condition. ## How was this patch tested? The changes for ColumnStat are covered by existing tests. For bug fix, a new test for boolean type in IN condition is added Author: wangzhenhua <[email protected]> Closes apache#17630 from wzhfy/refactorColumnStat.

use Catalyst type in ColumnStat for ease of estimation

4ef05e7

fix boolean type for in set condition

656f6a2

cloud-fan reviewed Apr 13, 2017

View reviewed changes

fix comments and remove datatype in CatalogStats

1a5069d

cloud-fan reviewed Apr 14, 2017

View reviewed changes

fix some conversion logic

195d428

asfgit closed this in fb036c4 Apr 14, 2017

GulajavaMinistudio mentioned this pull request Apr 15, 2017

[SPARK-20318][SQL] Use Catalyst type for min/max in ColumnStat for ea… GulajavaMinistudio/spark#17

Merged

[SPARK-20318][SQL] Use Catalyst type for min/max in ColumnStat for ease of estimation #17630

[SPARK-20318][SQL] Use Catalyst type for min/max in ColumnStat for ease of estimation #17630

Uh oh!

Conversation

wzhfy commented Apr 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

wzhfy commented Apr 13, 2017

Uh oh!

SparkQA commented Apr 13, 2017

Uh oh!

rxin commented Apr 13, 2017

Uh oh!

wzhfy commented Apr 13, 2017

Uh oh!

rxin commented Apr 13, 2017

Uh oh!

wzhfy commented Apr 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Apr 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy Apr 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy Apr 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 14, 2017

Uh oh!

SparkQA commented Apr 14, 2017

Uh oh!

cloud-fan commented Apr 14, 2017

Uh oh!

rxin commented Apr 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wzhfy commented Apr 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

wzhfy commented Apr 13, 2017 •

edited

Loading

wzhfy commented Apr 13, 2017 •

edited

Loading

wzhfy Apr 14, 2017 •

edited

Loading

wzhfy Apr 14, 2017 •

edited

Loading

rxin commented Apr 14, 2017 •

edited

Loading

wzhfy commented Apr 15, 2017 •

edited

Loading