-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17072] [SQL] support table-level statistics generation and storing into/loading from metastore #14712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference with the other statistics field? Could you give an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just use the existing statistics field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this way, we can separate cbo path from the original path when we add a conf for cbo in following tasks. If cbo is open, we use completeStats, if it is off, we still use the original statistics.
When cbo is mature in the future, we can remove the original statistics and use the new completeStats instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can merge the two even now. In the worst case they can be controlled via a config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i've updated this pr based on your comments
|
Test build #3227 has finished for PR 14712 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except noscan, are there some other options we may support in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there are. AFAIK we will also support column level statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan noscan won't scan files, it only collects statistics like total size. Without noscan, we will collect other stats like row count and column level stats.
|
So far, the test coverage is weak. Could we add more test cases to cover all the corner cases? Thanks! |
|
If the goal also includes the interoperability with Hive, the test cases should also verify whether the table property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before calling ANALYZE TABLE, table properties already have the latest values of numFiles and totalSize. Thus, totalSize is not added/updated by ANALYZE TABLE.
|
a high-level question: Looks like the current design depends on some features of hive metastore, e.g. the |
|
@cloud-fan @gatorsmile Actually, we desperately need spark sql to have its own metastore, because we need to persist statistics like histograms which AFAIK hive metastore doesn't support. |
|
@wzhfy I am kind of worried about the dependency of Hive metastore.
This is a design decision we need to make. @rxin @yhuai @hvanhovell @liancheng @cloud-fan @clockfly |
|
My proposal is: Like data source table metadata, we store the table statistics using different names from hive, if the statistics is hive compatible, like row count, we also store the corresponding hive entries. In this way, we won't be affected by possible hive metastore bugs, and hive can also recognize table statistics generated by spark. When we read in a hive table, if its statistics already exists but in hive format, we can generate the corresponding spark sql entries. Then spark sql can also recognize table statistics generated by hive. |
|
I like @cloud-fan 's proposal. : ) When the values of these two copies are different, which one is preferred? a) Hive: if we prefer Hive's version, we might be affected by Hive. b) Spark: if we choose Spark's copy, we might lose the benefit of Hive-generated statistics. Adding a new configuration parameter? The default is Option a? (I also do not like adding any extra parameter) |
|
If it is a hive table, I think we should respect hive's statistics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR, but looks like AnalyzeTableCommand doesn't handle the possible NoSuchTableException caused by sessionState.catalog.lookupRelation. It should be better to handle it and provide error message.
|
I suggest in the current stage, we still follow hive's convention. When spark sql has its own metastore, we can bridge these two metastores by a mapping between two different sets of names/data structures, and then provide a config for users to declare their preference. |
|
Spark SQL already has its own metastore:
I think target 1 is more important, and we do need an implementation that not depend on hive features.
We store table statistics in table properties, why would hive metastore not support it? Do you mean Hive can't recognize it? But I think it's ok, we should not limit our table statistics by what Hive supports. |
@cloud-fan The above comment is out of the range of this pr. Table property is a string-string map, which means we need to transform every statistic into/from a string. This is ok for simple table statistics like "numRows", but not a good choice for complicated column statistics like histograms. InMemoryCatalog is "in memory", we can have our own properties or data structures, but currently we still need Hive metastore api to “persist” these statistics. Hive has APIs for storing/loading table properties and ColumnStats, but no api for histograms. What I'm trying to say is, we can use hive as a persistent level, but what we can store/load is still limited by its api. Of course we can put everything into properties, but it's not elegant. |
|
Test build #64789 has finished for PR 14712 at commit
|
|
@cloud-fan @hvanhovell @gatorsmile Please review again, thanks! |
| createTime: Long = System.currentTimeMillis, | ||
| lastAccessTime: Long = -1, | ||
| properties: Map[String, String] = Map.empty, | ||
| stats: Option[Statistics] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should also update toString to include stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan and also simpleString in LogicalRelation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LogicalRelation doesn't need to be updated I think.
|
looks pretty good now! I left some comments about some small issues, thanks for working on it! |
|
Test build #64848 has finished for PR 14712 at commit
|
|
retest this please |
|
Test build #3244 has finished for PR 14712 at commit
|
|
Test build #64850 has finished for PR 14712 at commit
|
|
LGTM, @hvanhovell can you take another look? |
| sql(s"ANALYZE TABLE $textTable COMPUTE STATISTICS") | ||
| checkMetastoreRelationStats(textTable, expectedStats = | ||
| Some(Statistics(sizeInBytes = 5812, rowCount = Some(500)))) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just here, could you add a few lines?
sql(s"ANALYZE TABLE $textTable COMPUTE STATISTICS noscan")
// when the total size is not changed, the old row count is kept
checkMetastoreRelationStats(textTable, expectedStats =
Some(Statistics(sizeInBytes = 5812, rowCount = Some(500))))|
Below is a test case for a table with zero column. Could you also add it here? test("statistics collection of a table with zero column") {
val table_no_cols = "table_no_cols"
withTable(table_no_cols) {
val rddNoCols = sparkContext.parallelize(1 to 10).map(_ => Row.empty)
val dfNoCols = spark.createDataFrame(rddNoCols, StructType(Seq.empty))
dfNoCols.write.format("json").saveAsTable(table_no_cols)
sql(s"ANALYZE TABLE $table_no_cols COMPUTE STATISTICS")
checkLogicalRelationStats(table_no_cols, expectedStats =
Some(Statistics(sizeInBytes = 30, rowCount = Some(10))))
}
}In the future, we will do column-level statistics collection. This might help you when you implement collection of column-level statistics. |
|
LGTM except two minor comments about test cases. |
|
@gatorsmile Thank you for the good test cases! |
|
Test build #64886 has finished for PR 14712 at commit
|
|
LGTM. Merging to master. Thanks! |
|
I have created https://issues.apache.org/jira/browse/SPARK-17408. @wzhfy Can you take a look? |
| // noscan won't count the number of rows | ||
| sql(s"ANALYZE TABLE $textTable COMPUTE STATISTICS noscan") | ||
| checkMetastoreRelationStats(textTable, expectedStats = | ||
| Some(Statistics(sizeInBytes = 5812, rowCount = None))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I missed this, we should avoid hardcode nondeterministic values(like file size) in test, for this case, we only need to make sure the first sizeInBytes is greater than 0, and the second sizeInBytes is equal to the first one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably caused by Hive's hive.exec.compress.output; try setting this to false. I do agree with @cloud-fan, that equality testing in these cases is very brittle.
|
These tests block multiple PRs. It is midnight in China. : ) Let me do a quick fix based on the comments of @cloud-fan and @hvanhovell |
|
@yhuai @hvanhovell @cloud-fan Sorry for the late response, I'm out of office for two days. |
What changes were proposed in this pull request?
HiveExternalCatalogHiveExternalCatalogInMemoryCatalog.CatalogTableto hold statistics in Spark side.HiveExternalCatalog.rowCount(will add estimatedSize when we have column stats).How was this patch tested?
add unit tests