[SPARK-17410] [SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl #14971

gatorsmile · 2016-09-06T07:21:39Z

What changes were proposed in this pull request?

After we adding a new field stats into CatalogTable, we should not expose Hive-specific Stats metadata to MetastoreRelation. It complicates all the related codes. It also introduces a bug in SHOW CREATE TABLE. The statistics-related table properties should be skipped by SHOW CREATE TABLE, since it could be incorrect in the newly created table. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-13792

Also fix the issue to fill Hive-generated RowCounts to our stats.

This PR is to handle Hive-specific Stats metadata in HiveClientImpl.

How was this patch tested?

Added a few test cases.

SparkQA · 2016-09-06T09:03:29Z

Test build #64979 has finished for PR 14971 at commit c9cdf44.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-06T09:41:17Z

Test build #64978 has finished for PR 14971 at commit efd879d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-09-11T06:01:35Z

Still need a test case for verifying alter table drop/add partitions

gatorsmile · 2016-09-11T06:59:27Z

Found bugs in the master and 2.0 branch when adding alter table drop/add partitions. Will try to fix it.

Update: Just realized this is part of CBO work. See https://issues.apache.org/jira/browse/SPARK-17129. Will not fix it here and leave it to @wzhfy . Currently, the table-level statistics does not consider whether the partition is included or not. Thus, it does not provide the right number of table statistics.

SparkQA · 2016-09-11T07:52:32Z

Test build #65220 has finished for PR 14971 at commit d3dcb56.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-09-11T16:20:58Z

... Very surprised about Hive... Any ALTER TABLE SET/UNSET TBLPROPERTIES statements can invalidate the Hive-generated statistics...

hiveClient.runSqlHive(s"ANALYZE TABLE $oldName COMPUTE STATISTICS")
hiveClient.runSqlHive(s"DESCRIBE FORMATTED $oldName").foreach(println)

Table Parameters:        
    COLUMN_STATS_ACCURATE   true                
    numFiles                1                   
    numRows                 500                 
    rawDataSize             5312                
    spark.sql.statistics.numRows    500                 
    spark.sql.statistics.totalSize  5812                
    totalSize               5812                
    transient_lastDdlTime   1473610039

hiveClient.runSqlHive(s"ALTER TABLE $oldName SET TBLPROPERTIES ('foofoo' = 'a')")
hiveClient.runSqlHive(s"DESCRIBE FORMATTED $oldName").foreach(println)

Table Parameters:        
    COLUMN_STATS_ACCURATE   false               
    foofoo                  a                   
    last_modified_by        xiaoli              
    last_modified_time      1473610039          
    numFiles                1                   
    numRows                 -1                  
    rawDataSize             -1                  
    spark.sql.statistics.numRows    500                 
    spark.sql.statistics.totalSize  5812                
    totalSize               5812                
    transient_lastDdlTime   1473610039

SparkQA · 2016-09-11T23:51:38Z

Test build #65232 has finished for PR 14971 at commit 9e18ba1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-09-12T00:01:51Z

cc @hvanhovell @cloud-fan Now, the code is ready for review.

gatorsmile · 2016-09-12T00:04:33Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

FYI, when we drop partitions of EXTERNAL tables, ANALYZE TABLE is unable to exclude them from statistics. This should be fixed with https://issues.apache.org/jira/browse/SPARK-17129, if my understanding is right.

gatorsmile · 2016-09-14T04:18:02Z

@hvanhovell @cloud-fan Could you help me review this PR? #15090 is changing the same code path for column-level statistics.

Thanks!

gatorsmile · 2016-09-14T04:18:19Z

retest this please

SparkQA · 2016-09-14T06:16:21Z

Test build #65350 has finished for PR 14971 at commit 9e18ba1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-09-14T13:32:06Z

Will this change break existing behaviour? The MetastoreRelation can have table statistics if the hive table is already analyzed.

BTW, I'd like to have this behaviour:

HiveClient.getTable should return a CatalogTable with stats if this table has stats properties of hive.
HiveExternalCatalog.getTable should get the table via hive client, and overwrite the stats if this table has stats properties of spark, i.e. we trust spark rather than hive.

Any ideas?

gatorsmile · 2016-09-16T06:39:11Z

It does not break the existing behavior. If the MetastoreRelation has the Hive-generated table statistics, we create a statistics here. If we have Spark-generated statistics, we overwrite the hive-generated one in restoreTableMetadata.

Thus, the current code completely matches what you wants. : )

gatorsmile · 2016-09-16T06:42:12Z

Let me write a test case to ensure this correctly works and also put more comments in the code.

gatorsmile · 2016-09-16T06:46:16Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala

In the master branch, we do not use Hive-generated numRows... Let me fix it in this PR.

cloud-fan · 2016-09-16T07:59:47Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

how about we only handle the 2 we need? i.e. TOTAL_SIZE and RAW_DATA_SIZE. Then we don't need to do an extra getTable call in alterTable, which may cause performance regression.

Ideally the rule is, we only drop the hive properties that we moved to other places, so that we can reconstruct them without an extra getTable call.

That means, we will overwrite the Hive-generated statistics TOTAL_SIZE and RAW_DATA_SIZE by our statistics. This could be a surprise to the users who are using both Hive and Spark on the same data sets, when they issue an alter table from Spark.

If we do not hide the other Hive-specific fields (e.g., NUM_FILES, NUM_PARTITIONS), SHOW CREATE TABLE needs to explicitly exclude them, like what we did in the PR: #14855.

Do you want me to make the changes?

ah I see. So our targets are:

recognize hive statistics, i.e. we should set the CatalogTable.stats according to hive stats properties

don't overwrite hive stats properties.

SHOW CREATE TABLE shouldn't print hive stats properties.

My proposal: In HiveClientImpl, set CatalogTable.stats by hive stats properties, and still keep them in table properties. In SHOW CREATE TABLE, hide the hive stats properties.

CatalogTable have a field unsupportedFeatures, can we extend it to hide this kind of hive specific properties which are only useful in alter table?

Using unsupportedFeatures sounds a pretty good idea! Let me make a try. Thanks!

gatorsmile · 2016-09-17T07:07:16Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

Here, we also utilize Hive-generated row counts when users have not run ANALYZE TABLE through Spark.

SparkQA · 2016-09-17T08:52:28Z

Test build #65526 has finished for PR 14971 at commit 2e4d398.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-17T09:30:53Z

Test build #65529 has finished for PR 14971 at commit 5dfa17e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-18T01:35:51Z

Test build #65546 has finished for PR 14971 at commit 2f40c7f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-18T05:16:20Z

Test build #65548 has finished for PR 14971 at commit 3376bd6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-09-18T06:27:48Z

retest this please

gatorsmile · 2016-09-18T07:57:26Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

Actually, this is a surprise to me. We did not use Hive-generated statistics. I found the table-level statistics is missing for partitioned Hive serde tables. We need to get the statistics info from the properties for each partition and then add them up. Will submit a separate PR.

SparkQA · 2016-09-18T08:17:37Z

Test build #65553 has finished for PR 14971 at commit 3376bd6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-21T07:37:03Z

Test build #65703 has finished for PR 14971 at commit 7ad08fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-09-22T08:09:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

should we give it a new name? hive stats properties are not unsupported but ignored...

or we may just add a new field

👍 Let me add a new field called ignoredProperties

SparkQA · 2017-05-18T02:18:07Z

Test build #77036 has finished for PR 14971 at commit 22a2c00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-18T02:21:37Z

Test build #77037 has finished for PR 14971 at commit cce31db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-18T04:22:04Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

+      // When table is external, `totalSize` is always zero, which will influence join strategy
+      // so when `totalSize` is zero, use `rawDataSize` instead. When `rawDataSize` is also zero,
+      // return None. Later, we will use the other ways to estimate the statistics.
+      if (totalSize.isDefined && totalSize.get > 0L) {


the indention is wrong

cloud-fan · 2017-05-18T04:25:20Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/ShowCreateTableSuite.scala

    }

+    val e = normalize(actual)
+    val m = normalize(expected)


remove this?

cloud-fan · 2017-05-18T04:25:54Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

      checkTableStats(
        textTable,
-        hasSizeInBytes = false,
+        hasSizeInBytes = true,


why the behavior is changed?

Because now we respect Hive's stats in HiveClientImpl.getTableOption.

Hive will alter totalSize after inserting data.

It sounds like Hive does online stats updates.

wzhfy · 2017-05-18T03:57:51Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

+      // TODO: check if this estimate is valid for tables after partition pruning.
+      // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
+      // relatively cheap if parameters for the table are populated into the metastore.
+      // Currently, only totalSize, rawDataSize, and row_count are used to build the field `stats`


nit: rowCount

wzhfy · 2017-05-18T06:25:30Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+        createNonPartitionedTable(tabName, analyzedByHive = true, analyzedBySpark = analyzedBySpark)
+        val fetchedStats1 = checkTableStats(
+          tabName, hasSizeInBytes = true, expectedRowCounts = Some(500))
+        sql(s"ALTER TABLE $tabName UNSET TBLPROPERTIES ('prop1')")


What's Hive's behavior if we set/unset 'totalSize'?

The prop values are not changed after set/unset in Hive 2.x

wzhfy · 2017-05-18T06:27:58Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala

-    "numRows",
-    "rawDataSize",
-    "totalSize",
-    "totalNumberFiles",


Is totalNumberFiles the same as numFiles?

TextMetaDataFormatter and JsonMetaDataFormatter insert these info based on numFiles.

I think we should keep it unchanged.

wzhfy · 2017-05-18T06:28:15Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

      withSQLConf("spark.sql.hive.convertMetastoreOrc" -> "true") {
-        checkTableStats(orcTable, hasSizeInBytes = false, expectedRowCounts = None)
+        // We still can get tableSize from Hive before Analyze
+        checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts = None)


Orc table has size from Hive, while parquet table doesn't?

A good question. This is from Hive. : ( I did not investigate the root cause inside Hive.

wzhfy · 2017-05-18T06:58:48Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+      assert(stats2.get.sizeInBytes > stats3.get.sizeInBytes)
+
+      sql(s"ALTER TABLE $managedTable ADD PARTITION (ds='2008-04-08', hr='12')")
+      assert(stats1 == stats2)


wzhfy · 2017-05-18T07:13:29Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

+
+      val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
+      val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_))
+      def rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)) match {


why use def?

only used once. We also can use lazy val

SparkQA · 2017-05-19T00:10:20Z

Test build #77066 has finished for PR 14971 at commit aa9a36e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-19T02:27:21Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

+
+      val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
+      val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_))
+      lazy val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)) match {


I think we can just use val, no need to bother about performance here.

can be simplified to xxx.filter(_ >= 0)

cloud-fan · 2017-05-19T02:31:38Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+          tabName, hasSizeInBytes = true, expectedRowCounts = Some(500))
+        assert(fetchedStats1 == fetchedStats2)
+
+        val hiveClient = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client


this appeared many times, we can create a method

SparkQA · 2017-05-19T02:45:01Z

Test build #77071 has finished for PR 14971 at commit 1e4182d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-22T08:22:58Z

LGTM

SparkQA · 2017-05-22T09:31:52Z

Test build #77176 has finished for PR 14971 at commit 2048c97.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-22T14:25:06Z

seems like a valid test failure

SparkQA · 2017-05-22T19:17:43Z

Test build #77196 has started for PR 14971 at commit ea7abd4.

gatorsmile · 2017-05-22T21:17:07Z

retest this please

SparkQA · 2017-05-22T23:43:17Z

Test build #77204 has finished for PR 14971 at commit ea7abd4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-05-23T00:28:53Z

Thanks! Merging to master.

…ntImpl ### What changes were proposed in this pull request? After we adding a new field `stats` into `CatalogTable`, we should not expose Hive-specific Stats metadata to `MetastoreRelation`. It complicates all the related codes. It also introduces a bug in `SHOW CREATE TABLE`. The statistics-related table properties should be skipped by `SHOW CREATE TABLE`, since it could be incorrect in the newly created table. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-13792 Also fix the issue to fill Hive-generated RowCounts to our stats. This PR is to handle Hive-specific Stats metadata in `HiveClientImpl`. ### How was this patch tested? Added a few test cases. Author: Xiao Li <[email protected]> Closes apache#14971 from gatorsmile/showCreateTableNew.

gatorsmile mentioned this pull request Sep 6, 2016

[SPARK-17284] [SQL] Remove Statistics-related Table Properties from SHOW CREATE TABLE #14855

Closed

gatorsmile reviewed Sep 12, 2016
View reviewed changes

gatorsmile commented Sep 16, 2016

View reviewed changes

cloud-fan reviewed Sep 16, 2016

View reviewed changes

gatorsmile commented Sep 17, 2016

View reviewed changes

gatorsmile mentioned this pull request Sep 18, 2016

[SPARK-17581] [SQL] Invalidate Statistics After Some ALTER TABLE Commands [WIP] #15136

Closed

gatorsmile commented Sep 18, 2016

View reviewed changes

cloud-fan reviewed Sep 22, 2016

View reviewed changes

fix.

c2d8e90

gatorsmile force-pushed the showCreateTableNew branch from 50ce04e to c2d8e90 Compare May 17, 2017 23:54

gatorsmile added 2 commits May 17, 2017 16:56

address comments.

22a2c00

address comments.

cce31db

cloud-fan reviewed May 18, 2017

View reviewed changes

wzhfy reviewed May 18, 2017

View reviewed changes

address comments.

aa9a36e

address comments.

1e4182d

cloud-fan reviewed May 19, 2017

View reviewed changes

address comments.

2048c97

fix.

ea7abd4

asfgit closed this in a2460be May 23, 2017

[SPARK-17410] [SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl #14971

[SPARK-17410] [SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl #14971

Uh oh!

Conversation

gatorsmile commented Sep 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 6, 2016

Uh oh!

SparkQA commented Sep 6, 2016

Uh oh!

gatorsmile commented Sep 11, 2016

Uh oh!

gatorsmile commented Sep 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 11, 2016

Uh oh!

gatorsmile commented Sep 11, 2016

Uh oh!

SparkQA commented Sep 11, 2016

Uh oh!

gatorsmile commented Sep 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Sep 14, 2016

Uh oh!

gatorsmile commented Sep 14, 2016

Uh oh!

SparkQA commented Sep 14, 2016

Uh oh!

cloud-fan commented Sep 14, 2016

Uh oh!

gatorsmile commented Sep 16, 2016

Uh oh!

gatorsmile commented Sep 16, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Sep 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 17, 2016

Uh oh!

SparkQA commented Sep 17, 2016

Uh oh!

SparkQA commented Sep 18, 2016

Uh oh!

SparkQA commented Sep 18, 2016

Uh oh!

gatorsmile commented Sep 18, 2016

Uh oh!

gatorsmile Sep 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 18, 2016

Uh oh!

SparkQA commented Sep 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Sep 6, 2016 •

edited

Loading

gatorsmile commented Sep 11, 2016 •

edited

Loading

gatorsmile Sep 17, 2016 •

edited

Loading

gatorsmile Sep 18, 2016 •

edited

Loading

wzhfy May 18, 2017 •

edited

Loading