Skip to content

Conversation

@gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Sep 6, 2016

What changes were proposed in this pull request?

After we adding a new field stats into CatalogTable, we should not expose Hive-specific Stats metadata to MetastoreRelation. It complicates all the related codes. It also introduces a bug in SHOW CREATE TABLE. The statistics-related table properties should be skipped by SHOW CREATE TABLE, since it could be incorrect in the newly created table. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-13792

Also fix the issue to fill Hive-generated RowCounts to our stats.

This PR is to handle Hive-specific Stats metadata in HiveClientImpl.

How was this patch tested?

Added a few test cases.

@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #64979 has finished for PR 14971 at commit c9cdf44.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #64978 has finished for PR 14971 at commit efd879d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

Still need a test case for verifying alter table drop/add partitions

@gatorsmile
Copy link
Member Author

gatorsmile commented Sep 11, 2016

Found bugs in the master and 2.0 branch when adding alter table drop/add partitions. Will try to fix it.

Update: Just realized this is part of CBO work. See https://issues.apache.org/jira/browse/SPARK-17129. Will not fix it here and leave it to @wzhfy . Currently, the table-level statistics does not consider whether the partition is included or not. Thus, it does not provide the right number of table statistics.

@SparkQA
Copy link

SparkQA commented Sep 11, 2016

Test build #65220 has finished for PR 14971 at commit d3dcb56.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

... Very surprised about Hive... Any ALTER TABLE SET/UNSET TBLPROPERTIES statements can invalidate the Hive-generated statistics...

hiveClient.runSqlHive(s"ANALYZE TABLE $oldName COMPUTE STATISTICS")
hiveClient.runSqlHive(s"DESCRIBE FORMATTED $oldName").foreach(println)
Table Parameters:        
    COLUMN_STATS_ACCURATE   true                
    numFiles                1                   
    numRows                 500                 
    rawDataSize             5312                
    spark.sql.statistics.numRows    500                 
    spark.sql.statistics.totalSize  5812                
    totalSize               5812                
    transient_lastDdlTime   1473610039          
hiveClient.runSqlHive(s"ALTER TABLE $oldName SET TBLPROPERTIES ('foofoo' = 'a')")
hiveClient.runSqlHive(s"DESCRIBE FORMATTED $oldName").foreach(println)
Table Parameters:        
    COLUMN_STATS_ACCURATE   false               
    foofoo                  a                   
    last_modified_by        xiaoli              
    last_modified_time      1473610039          
    numFiles                1                   
    numRows                 -1                  
    rawDataSize             -1                  
    spark.sql.statistics.numRows    500                 
    spark.sql.statistics.totalSize  5812                
    totalSize               5812                
    transient_lastDdlTime   1473610039          

@SparkQA
Copy link

SparkQA commented Sep 11, 2016

Test build #65232 has finished for PR 14971 at commit 9e18ba1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

cc @hvanhovell @cloud-fan Now, the code is ready for review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, when we drop partitions of EXTERNAL tables, ANALYZE TABLE is unable to exclude them from statistics. This should be fixed with https://issues.apache.org/jira/browse/SPARK-17129, if my understanding is right.

@gatorsmile
Copy link
Member Author

@hvanhovell @cloud-fan Could you help me review this PR? #15090 is changing the same code path for column-level statistics.

Thanks!

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 14, 2016

Test build #65350 has finished for PR 14971 at commit 9e18ba1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Will this change break existing behaviour? The MetastoreRelation can have table statistics if the hive table is already analyzed.

BTW, I'd like to have this behaviour:

  1. HiveClient.getTable should return a CatalogTable with stats if this table has stats properties of hive.
  2. HiveExternalCatalog.getTable should get the table via hive client, and overwrite the stats if this table has stats properties of spark, i.e. we trust spark rather than hive.

Any ideas?

@gatorsmile
Copy link
Member Author

It does not break the existing behavior. If the MetastoreRelation has the Hive-generated table statistics, we create a statistics here. If we have Spark-generated statistics, we overwrite the hive-generated one in restoreTableMetadata.

Thus, the current code completely matches what you wants. : )

@gatorsmile
Copy link
Member Author

Let me write a test case to ensure this correctly works and also put more comments in the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the master branch, we do not use Hive-generated numRows... Let me fix it in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we only handle the 2 we need? i.e. TOTAL_SIZE and RAW_DATA_SIZE. Then we don't need to do an extra getTable call in alterTable, which may cause performance regression.

Ideally the rule is, we only drop the hive properties that we moved to other places, so that we can reconstruct them without an extra getTable call.

Copy link
Member Author

@gatorsmile gatorsmile Sep 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means, we will overwrite the Hive-generated statistics TOTAL_SIZE and RAW_DATA_SIZE by our statistics. This could be a surprise to the users who are using both Hive and Spark on the same data sets, when they issue an alter table from Spark.

If we do not hide the other Hive-specific fields (e.g., NUM_FILES, NUM_PARTITIONS), SHOW CREATE TABLE needs to explicitly exclude them, like what we did in the PR: #14855.

Do you want me to make the changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see. So our targets are:

  1. recognize hive statistics, i.e. we should set the CatalogTable.stats according to hive stats properties
  2. don't overwrite hive stats properties.
  3. SHOW CREATE TABLE shouldn't print hive stats properties.

My proposal: In HiveClientImpl, set CatalogTable.stats by hive stats properties, and still keep them in table properties. In SHOW CREATE TABLE, hide the hive stats properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CatalogTable have a field unsupportedFeatures, can we extend it to hide this kind of hive specific properties which are only useful in alter table?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using unsupportedFeatures sounds a pretty good idea! Let me make a try. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we also utilize Hive-generated row counts when users have not run ANALYZE TABLE through Spark.

@SparkQA
Copy link

SparkQA commented Sep 17, 2016

Test build #65526 has finished for PR 14971 at commit 2e4d398.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 17, 2016

Test build #65529 has finished for PR 14971 at commit 5dfa17e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65546 has finished for PR 14971 at commit 2f40c7f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65548 has finished for PR 14971 at commit 3376bd6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

Copy link
Member Author

@gatorsmile gatorsmile Sep 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is a surprise to me. We did not use Hive-generated statistics. I found the table-level statistics is missing for partitioned Hive serde tables. We need to get the statistics info from the properties for each partition and then add them up. Will submit a separate PR.

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65553 has finished for PR 14971 at commit 3376bd6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 21, 2016

Test build #65703 has finished for PR 14971 at commit 7ad08fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we give it a new name? hive stats properties are not unsupported but ignored...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we may just add a new field

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Let me add a new field called ignoredProperties

@gatorsmile gatorsmile force-pushed the showCreateTableNew branch from 50ce04e to c2d8e90 Compare May 17, 2017 23:54
@SparkQA
Copy link

SparkQA commented May 18, 2017

Test build #77036 has finished for PR 14971 at commit 22a2c00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 18, 2017

Test build #77037 has finished for PR 14971 at commit cce31db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// When table is external, `totalSize` is always zero, which will influence join strategy
// so when `totalSize` is zero, use `rawDataSize` instead. When `rawDataSize` is also zero,
// return None. Later, we will use the other ways to estimate the statistics.
if (totalSize.isDefined && totalSize.get > 0L) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the indention is wrong

}

val e = normalize(actual)
val m = normalize(expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this?

checkTableStats(
textTable,
hasSizeInBytes = false,
hasSizeInBytes = true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the behavior is changed?

Copy link
Contributor

@wzhfy wzhfy May 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because now we respect Hive's stats in HiveClientImpl.getTableOption.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive will alter totalSize after inserting data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like Hive does online stats updates.

// TODO: check if this estimate is valid for tables after partition pruning.
// NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
// relatively cheap if parameters for the table are populated into the metastore.
// Currently, only totalSize, rawDataSize, and row_count are used to build the field `stats`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rowCount

createNonPartitionedTable(tabName, analyzedByHive = true, analyzedBySpark = analyzedBySpark)
val fetchedStats1 = checkTableStats(
tabName, hasSizeInBytes = true, expectedRowCounts = Some(500))
sql(s"ALTER TABLE $tabName UNSET TBLPROPERTIES ('prop1')")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's Hive's behavior if we set/unset 'totalSize'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prop values are not changed after set/unset in Hive 2.x

"numRows",
"rawDataSize",
"totalSize",
"totalNumberFiles",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is totalNumberFiles the same as numFiles?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextMetaDataFormatter and JsonMetaDataFormatter insert these info based on numFiles.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep it unchanged.

withSQLConf("spark.sql.hive.convertMetastoreOrc" -> "true") {
checkTableStats(orcTable, hasSizeInBytes = false, expectedRowCounts = None)
// We still can get tableSize from Hive before Analyze
checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts = None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orc table has size from Hive, while parquet table doesn't?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good question. This is from Hive. : ( I did not investigate the root cause inside Hive.

assert(stats2.get.sizeInBytes > stats3.get.sizeInBytes)

sql(s"ALTER TABLE $managedTable ADD PARTITION (ds='2008-04-08', hr='12')")
assert(stats1 == stats2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant?


val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_))
def rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use def?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only used once. We also can use lazy val

@SparkQA
Copy link

SparkQA commented May 19, 2017

Test build #77066 has finished for PR 14971 at commit aa9a36e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_))
lazy val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think we can just use val, no need to bother about performance here.
  2. can be simplified to xxx.filter(_ >= 0)

tabName, hasSizeInBytes = true, expectedRowCounts = Some(500))
assert(fetchedStats1 == fetchedStats2)

val hiveClient = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this appeared many times, we can create a method

@SparkQA
Copy link

SparkQA commented May 19, 2017

Test build #77071 has finished for PR 14971 at commit 1e4182d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented May 22, 2017

Test build #77176 has finished for PR 14971 at commit 2048c97.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

seems like a valid test failure

@SparkQA
Copy link

SparkQA commented May 22, 2017

Test build #77196 has started for PR 14971 at commit ea7abd4.

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented May 22, 2017

Test build #77204 has finished for PR 14971 at commit ea7abd4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

Thanks! Merging to master.

@asfgit asfgit closed this in a2460be May 23, 2017
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
…ntImpl

### What changes were proposed in this pull request?

After we adding a new field `stats` into `CatalogTable`, we should not expose Hive-specific Stats metadata to `MetastoreRelation`. It complicates all the related codes. It also introduces a bug in `SHOW CREATE TABLE`. The statistics-related table properties should be skipped by `SHOW CREATE TABLE`, since it could be incorrect in the newly created table. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-13792

Also fix the issue to fill Hive-generated RowCounts to our stats.

This PR is to handle Hive-specific Stats metadata in `HiveClientImpl`.
### How was this patch tested?

Added a few test cases.

Author: Xiao Li <[email protected]>

Closes apache#14971 from gatorsmile/showCreateTableNew.
dilipbiswal pushed a commit to dilipbiswal/spark that referenced this pull request Aug 4, 2017
…ntImpl

### What changes were proposed in this pull request?

After we adding a new field `stats` into `CatalogTable`, we should not expose Hive-specific Stats metadata to `MetastoreRelation`. It complicates all the related codes. It also introduces a bug in `SHOW CREATE TABLE`. The statistics-related table properties should be skipped by `SHOW CREATE TABLE`, since it could be incorrect in the newly created table. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-13792

Also fix the issue to fill Hive-generated RowCounts to our stats.

This PR is to handle Hive-specific Stats metadata in `HiveClientImpl`.
### How was this patch tested?

Added a few test cases.

Author: Xiao Li <[email protected]>

Closes apache#14971 from gatorsmile/showCreateTableNew.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants