-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17410] [SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl #14971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #64979 has finished for PR 14971 at commit
|
|
Test build #64978 has finished for PR 14971 at commit
|
|
Still need a test case for verifying alter table drop/add partitions |
|
Found bugs in the master and 2.0 branch when adding alter table drop/add partitions. Will try to fix it. Update: Just realized this is part of CBO work. See https://issues.apache.org/jira/browse/SPARK-17129. Will not fix it here and leave it to @wzhfy . Currently, the table-level statistics does not consider whether the partition is included or not. Thus, it does not provide the right number of table statistics. |
|
Test build #65220 has finished for PR 14971 at commit
|
|
... Very surprised about Hive... Any hiveClient.runSqlHive(s"ANALYZE TABLE $oldName COMPUTE STATISTICS")
hiveClient.runSqlHive(s"DESCRIBE FORMATTED $oldName").foreach(println)hiveClient.runSqlHive(s"ALTER TABLE $oldName SET TBLPROPERTIES ('foofoo' = 'a')")
hiveClient.runSqlHive(s"DESCRIBE FORMATTED $oldName").foreach(println) |
|
Test build #65232 has finished for PR 14971 at commit
|
|
cc @hvanhovell @cloud-fan Now, the code is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, when we drop partitions of EXTERNAL tables, ANALYZE TABLE is unable to exclude them from statistics. This should be fixed with https://issues.apache.org/jira/browse/SPARK-17129, if my understanding is right.
|
@hvanhovell @cloud-fan Could you help me review this PR? #15090 is changing the same code path for column-level statistics. Thanks! |
|
retest this please |
|
Test build #65350 has finished for PR 14971 at commit
|
|
Will this change break existing behaviour? The BTW, I'd like to have this behaviour:
Any ideas? |
|
It does not break the existing behavior. If the MetastoreRelation has the Hive-generated table statistics, we create a statistics here. If we have Spark-generated statistics, we overwrite the hive-generated one in restoreTableMetadata. Thus, the current code completely matches what you wants. : ) |
|
Let me write a test case to ensure this correctly works and also put more comments in the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the master branch, we do not use Hive-generated numRows... Let me fix it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we only handle the 2 we need? i.e. TOTAL_SIZE and RAW_DATA_SIZE. Then we don't need to do an extra getTable call in alterTable, which may cause performance regression.
Ideally the rule is, we only drop the hive properties that we moved to other places, so that we can reconstruct them without an extra getTable call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means, we will overwrite the Hive-generated statistics TOTAL_SIZE and RAW_DATA_SIZE by our statistics. This could be a surprise to the users who are using both Hive and Spark on the same data sets, when they issue an alter table from Spark.
If we do not hide the other Hive-specific fields (e.g., NUM_FILES, NUM_PARTITIONS), SHOW CREATE TABLE needs to explicitly exclude them, like what we did in the PR: #14855.
Do you want me to make the changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see. So our targets are:
- recognize hive statistics, i.e. we should set the
CatalogTable.statsaccording to hive stats properties - don't overwrite hive stats properties.
- SHOW CREATE TABLE shouldn't print hive stats properties.
My proposal: In HiveClientImpl, set CatalogTable.stats by hive stats properties, and still keep them in table properties. In SHOW CREATE TABLE, hide the hive stats properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CatalogTable have a field unsupportedFeatures, can we extend it to hide this kind of hive specific properties which are only useful in alter table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using unsupportedFeatures sounds a pretty good idea! Let me make a try. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we also utilize Hive-generated row counts when users have not run ANALYZE TABLE through Spark.
|
Test build #65526 has finished for PR 14971 at commit
|
|
Test build #65529 has finished for PR 14971 at commit
|
|
Test build #65546 has finished for PR 14971 at commit
|
|
Test build #65548 has finished for PR 14971 at commit
|
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is a surprise to me. We did not use Hive-generated statistics. I found the table-level statistics is missing for partitioned Hive serde tables. We need to get the statistics info from the properties for each partition and then add them up. Will submit a separate PR.
|
Test build #65553 has finished for PR 14971 at commit
|
|
Test build #65703 has finished for PR 14971 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we give it a new name? hive stats properties are not unsupported but ignored...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we may just add a new field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Let me add a new field called ignoredProperties
50ce04e to
c2d8e90
Compare
|
Test build #77036 has finished for PR 14971 at commit
|
|
Test build #77037 has finished for PR 14971 at commit
|
| // When table is external, `totalSize` is always zero, which will influence join strategy | ||
| // so when `totalSize` is zero, use `rawDataSize` instead. When `rawDataSize` is also zero, | ||
| // return None. Later, we will use the other ways to estimate the statistics. | ||
| if (totalSize.isDefined && totalSize.get > 0L) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the indention is wrong
| } | ||
|
|
||
| val e = normalize(actual) | ||
| val m = normalize(expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this?
| checkTableStats( | ||
| textTable, | ||
| hasSizeInBytes = false, | ||
| hasSizeInBytes = true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the behavior is changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because now we respect Hive's stats in HiveClientImpl.getTableOption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive will alter totalSize after inserting data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like Hive does online stats updates.
| // TODO: check if this estimate is valid for tables after partition pruning. | ||
| // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be | ||
| // relatively cheap if parameters for the table are populated into the metastore. | ||
| // Currently, only totalSize, rawDataSize, and row_count are used to build the field `stats` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rowCount
| createNonPartitionedTable(tabName, analyzedByHive = true, analyzedBySpark = analyzedBySpark) | ||
| val fetchedStats1 = checkTableStats( | ||
| tabName, hasSizeInBytes = true, expectedRowCounts = Some(500)) | ||
| sql(s"ALTER TABLE $tabName UNSET TBLPROPERTIES ('prop1')") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's Hive's behavior if we set/unset 'totalSize'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prop values are not changed after set/unset in Hive 2.x
| "numRows", | ||
| "rawDataSize", | ||
| "totalSize", | ||
| "totalNumberFiles", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is totalNumberFiles the same as numFiles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextMetaDataFormatter and JsonMetaDataFormatter insert these info based on numFiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep it unchanged.
| withSQLConf("spark.sql.hive.convertMetastoreOrc" -> "true") { | ||
| checkTableStats(orcTable, hasSizeInBytes = false, expectedRowCounts = None) | ||
| // We still can get tableSize from Hive before Analyze | ||
| checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts = None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orc table has size from Hive, while parquet table doesn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good question. This is from Hive. : ( I did not investigate the root cause inside Hive.
| assert(stats2.get.sizeInBytes > stats3.get.sizeInBytes) | ||
|
|
||
| sql(s"ALTER TABLE $managedTable ADD PARTITION (ds='2008-04-08', hr='12')") | ||
| assert(stats1 == stats2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant?
|
|
||
| val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_)) | ||
| val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_)) | ||
| def rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use def?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only used once. We also can use lazy val
|
Test build #77066 has finished for PR 14971 at commit
|
|
|
||
| val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_)) | ||
| val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_)) | ||
| lazy val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think we can just use val, no need to bother about performance here.
- can be simplified to
xxx.filter(_ >= 0)
| tabName, hasSizeInBytes = true, expectedRowCounts = Some(500)) | ||
| assert(fetchedStats1 == fetchedStats2) | ||
|
|
||
| val hiveClient = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this appeared many times, we can create a method
|
Test build #77071 has finished for PR 14971 at commit
|
|
LGTM |
|
Test build #77176 has finished for PR 14971 at commit
|
|
seems like a valid test failure |
|
Test build #77196 has started for PR 14971 at commit |
|
retest this please |
|
Test build #77204 has finished for PR 14971 at commit
|
|
Thanks! Merging to master. |
…ntImpl ### What changes were proposed in this pull request? After we adding a new field `stats` into `CatalogTable`, we should not expose Hive-specific Stats metadata to `MetastoreRelation`. It complicates all the related codes. It also introduces a bug in `SHOW CREATE TABLE`. The statistics-related table properties should be skipped by `SHOW CREATE TABLE`, since it could be incorrect in the newly created table. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-13792 Also fix the issue to fill Hive-generated RowCounts to our stats. This PR is to handle Hive-specific Stats metadata in `HiveClientImpl`. ### How was this patch tested? Added a few test cases. Author: Xiao Li <[email protected]> Closes apache#14971 from gatorsmile/showCreateTableNew.
…ntImpl ### What changes were proposed in this pull request? After we adding a new field `stats` into `CatalogTable`, we should not expose Hive-specific Stats metadata to `MetastoreRelation`. It complicates all the related codes. It also introduces a bug in `SHOW CREATE TABLE`. The statistics-related table properties should be skipped by `SHOW CREATE TABLE`, since it could be incorrect in the newly created table. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-13792 Also fix the issue to fill Hive-generated RowCounts to our stats. This PR is to handle Hive-specific Stats metadata in `HiveClientImpl`. ### How was this patch tested? Added a few test cases. Author: Xiao Li <[email protected]> Closes apache#14971 from gatorsmile/showCreateTableNew.
What changes were proposed in this pull request?
After we adding a new field
statsintoCatalogTable, we should not expose Hive-specific Stats metadata toMetastoreRelation. It complicates all the related codes. It also introduces a bug inSHOW CREATE TABLE. The statistics-related table properties should be skipped bySHOW CREATE TABLE, since it could be incorrect in the newly created table. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-13792Also fix the issue to fill Hive-generated RowCounts to our stats.
This PR is to handle Hive-specific Stats metadata in
HiveClientImpl.How was this patch tested?
Added a few test cases.