[SPARK-17072] [SQL] support table-level statistics generation and storing into/loading from metastore #14712

wzhfy · 2016-08-19T06:29:15Z

What changes were proposed in this pull request?

Support generation table-level statistics for
- hive tables in HiveExternalCatalog
- data source tables in HiveExternalCatalog
- data source tables in InMemoryCatalog.
Add a property "stats" in CatalogTable to hold statistics in Spark side.
Put logics of statistics transformation between Spark and Hive in HiveExternalCatalog.
Extend Statistics class by adding rowCount (will add estimatedSize when we have column stats).

How was this patch tested?

add unit tests

scwf · 2016-08-19T06:34:21Z

/cc @cloud-fan @rxin @hvanhovell

hvanhovell · 2016-08-19T07:11:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

What is the difference with the other statistics field? Could you give an example?

why not just use the existing statistics field?

In this way, we can separate cbo path from the original path when we add a conf for cbo in following tasks. If cbo is open, we use completeStats, if it is off, we still use the original statistics.
When cbo is mature in the future, we can remove the original statistics and use the new completeStats instead.

You can merge the two even now. In the worst case they can be controlled via a config?

ok, i've updated this pr based on your comments

SparkQA · 2016-08-19T09:02:52Z

Test build #3227 has finished for PR 14712 at commit 4375e76.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Statistics(
- case class AnalyzeTableCommand(tableName: String, noscan: Boolean = true) extends RunnableCommand

cloud-fan · 2016-08-19T13:20:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

except noscan, are there some other options we may support in the future?

Yeah there are. AFAIK we will also support column level statistics.

@cloud-fan noscan won't scan files, it only collects statistics like total size. Without noscan, we will collect other stats like row count and column level stats.

gatorsmile · 2016-08-19T17:22:52Z

So far, the test coverage is weak. Could we add more test cases to cover all the corner cases? Thanks!

gatorsmile · 2016-08-19T20:07:25Z

If the goal also includes the interoperability with Hive, the test cases should also verify whether the table property COLUMN_STATS_ACCURATE is true or not. This should be implicitly updated by Hive.

gatorsmile · 2016-08-19T20:40:27Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

Before calling ANALYZE TABLE, table properties already have the latest values of numFiles and totalSize. Thus, totalSize is not added/updated by ANALYZE TABLE.

cloud-fan · 2016-08-20T05:06:45Z

a high-level question: Looks like the current design depends on some features of hive metastore, e.g. the STATS_GENERATED_VIA_STATS_TASK flag. Is it possible that we just treat hive metastore as a persistent level? So that the statistics can still work if Spark SQL has its own metastore in the future.

wzhfy · 2016-08-20T14:58:46Z

@cloud-fan @gatorsmile Actually, we desperately need spark sql to have its own metastore, because we need to persist statistics like histograms which AFAIK hive metastore doesn't support.

gatorsmile · 2016-08-20T15:27:37Z

@wzhfy I am kind of worried about the dependency of Hive metastore.

Using the same table property names like Hive: Hive metastore could change them in an unexpected way (for example, by a Hive metastore bug). The behaviors might be different in different metastore versions. However, the statistics can be shared by Hive execution and Spark execution. Thus, it could benefit the Spark users who are using Hive and Spark together on the same set of tables.
Using the different names, like what we did for data source table schema: Spark fully controls them and we just treat the Hive metastore as a persistent storage of these statistics.
Provided a configuration parameter so that users can control which option they preferred?

This is a design decision we need to make. @rxin @yhuai @hvanhovell @liancheng @cloud-fan @clockfly

cloud-fan · 2016-08-20T15:43:42Z

My proposal is: Like data source table metadata, we store the table statistics using different names from hive, if the statistics is hive compatible, like row count, we also store the corresponding hive entries. In this way, we won't be affected by possible hive metastore bugs, and hive can also recognize table statistics generated by spark.

When we read in a hive table, if its statistics already exists but in hive format, we can generate the corresponding spark sql entries. Then spark sql can also recognize table statistics generated by hive.

gatorsmile · 2016-08-20T15:52:11Z

I like @cloud-fan 's proposal. : )

When the values of these two copies are different, which one is preferred? a) Hive: if we prefer Hive's version, we might be affected by Hive. b) Spark: if we choose Spark's copy, we might lose the benefit of Hive-generated statistics. Adding a new configuration parameter? The default is Option a? (I also do not like adding any extra parameter)

viirya · 2016-08-21T03:04:34Z

If it is a hive table, I think we should respect hive's statistics.

viirya · 2016-08-21T03:07:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala

Not related to this PR, but looks like AnalyzeTableCommand doesn't handle the possible NoSuchTableException caused by sessionState.catalog.lookupRelation. It should be better to handle it and provide error message.

wzhfy · 2016-08-21T04:08:51Z

I suggest in the current stage, we still follow hive's convention. When spark sql has its own metastore, we can bridge these two metastores by a mapping between two different sets of names/data structures, and then provide a config for users to declare their preference.

cloud-fan · 2016-08-21T04:34:05Z

Spark SQL already has its own metastore: InMemoryCatalog. And we do have an abstraction for metasotre: ExternalCatalog. We have 2 targets here:

add table statistics in Spark SQL
Spark SQL and Hive should recognize table statistics from each other.

I think target 1 is more important, and we do need an implementation that not depend on hive features.

Actually, we desperately need spark sql to have its own metastore, because we need to persist statistics like histograms which AFAIK hive metastore doesn't support.

We store table statistics in table properties, why would hive metastore not support it? Do you mean Hive can't recognize it? But I think it's ok, we should not limit our table statistics by what Hive supports.

wzhfy · 2016-08-21T08:54:39Z

Actually, we desperately need spark sql to have its own metastore, because we need to persist statistics like histograms which AFAIK hive metastore doesn't support.

@cloud-fan The above comment is out of the range of this pr.

Table property is a string-string map, which means we need to transform every statistic into/from a string. This is ok for simple table statistics like "numRows", but not a good choice for complicated column statistics like histograms.

InMemoryCatalog is "in memory", we can have our own properties or data structures, but currently we still need Hive metastore api to “persist” these statistics. Hive has APIs for storing/loading table properties and ColumnStats, but no api for histograms.

What I'm trying to say is, we can use hive as a persistent level, but what we can store/load is still limited by its api. Of course we can put everything into properties, but it's not elegant.

SparkQA · 2016-09-01T17:02:18Z

Test build #64789 has finished for PR 14712 at commit b6c655a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2016-09-02T01:33:22Z

@cloud-fan @hvanhovell @gatorsmile Please review again, thanks!

cloud-fan · 2016-09-02T06:51:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

    createTime: Long = System.currentTimeMillis,
    lastAccessTime: Long = -1,
    properties: Map[String, String] = Map.empty,
+    stats: Option[Statistics] = None,


nit: we should also update toString to include stats.

@cloud-fan and also simpleString in LogicalRelation?

LogicalRelation doesn't need to be updated I think.

cloud-fan · 2016-09-02T07:28:56Z

looks pretty good now! I left some comments about some small issues, thanks for working on it!

SparkQA · 2016-09-02T10:54:18Z

Test build #64848 has finished for PR 14712 at commit b946df0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2016-09-02T11:21:23Z

retest this please

SparkQA · 2016-09-02T13:12:53Z

Test build #3244 has finished for PR 14712 at commit b946df0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-02T13:36:10Z

Test build #64850 has finished for PR 14712 at commit b946df0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-09-02T14:17:46Z

LGTM, @hvanhovell can you take another look?

gatorsmile · 2016-09-02T16:17:58Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+      sql(s"ANALYZE TABLE $textTable COMPUTE STATISTICS")
+      checkMetastoreRelationStats(textTable, expectedStats =
+        Some(Statistics(sizeInBytes = 5812, rowCount = Some(500))))
+


Just here, could you add a few lines?

sql(s"ANALYZE TABLE $textTable COMPUTE STATISTICS noscan") // when the total size is not changed, the old row count is kept checkMetastoreRelationStats(textTable, expectedStats = Some(Statistics(sizeInBytes = 5812, rowCount = Some(500))))

gatorsmile · 2016-09-02T16:53:25Z

Below is a test case for a table with zero column. Could you also add it here?

  test("statistics collection of a table with zero column") {
    val table_no_cols = "table_no_cols"
    withTable(table_no_cols) {
      val rddNoCols = sparkContext.parallelize(1 to 10).map(_ => Row.empty)
      val dfNoCols = spark.createDataFrame(rddNoCols, StructType(Seq.empty))
      dfNoCols.write.format("json").saveAsTable(table_no_cols)
      sql(s"ANALYZE TABLE $table_no_cols COMPUTE STATISTICS")
      checkLogicalRelationStats(table_no_cols, expectedStats =
        Some(Statistics(sizeInBytes = 30, rowCount = Some(10))))
    }
  }

In the future, we will do column-level statistics collection. This might help you when you implement collection of column-level statistics.

gatorsmile · 2016-09-02T16:53:43Z

LGTM except two minor comments about test cases.

wzhfy · 2016-09-03T01:30:47Z

@gatorsmile Thank you for the good test cases!

SparkQA · 2016-09-03T03:38:58Z

Test build #64886 has finished for PR 14712 at commit 5d6e559.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-09-05T15:31:58Z

LGTM. Merging to master. Thanks!

yhuai · 2016-09-06T04:17:21Z

Can you take a look at the test at https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64956/testReport/junit/org.apache.spark.sql.hive/StatisticsSuite/test_statistics_of_LogicalRelation_converted_from_MetastoreRelation/? It is flaky.

yhuai · 2016-09-06T04:18:50Z

I have created https://issues.apache.org/jira/browse/SPARK-17408. @wzhfy Can you take a look?

cloud-fan · 2016-09-06T04:41:32Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+      // noscan won't count the number of rows
+      sql(s"ANALYZE TABLE $textTable COMPUTE STATISTICS noscan")
+      checkMetastoreRelationStats(textTable, expectedStats =
+        Some(Statistics(sizeInBytes = 5812, rowCount = None)))


sorry I missed this, we should avoid hardcode nondeterministic values(like file size) in test, for this case, we only need to make sure the first sizeInBytes is greater than 0, and the second sizeInBytes is equal to the first one.

This is probably caused by Hive's hive.exec.compress.output; try setting this to false. I do agree with @cloud-fan, that equality testing in these cases is very brittle.

gatorsmile · 2016-09-06T17:29:31Z

These tests block multiple PRs. It is midnight in China. : ) Let me do a quick fix based on the comments of @cloud-fan and @hvanhovell

wzhfy · 2016-09-07T01:25:32Z

@yhuai @hvanhovell @cloud-fan Sorry for the late response, I'm out of office for two days.
@gatorsmile Thanks for fixing it!

hvanhovell reviewed Aug 19, 2016
View reviewed changes

wzhfy force-pushed the tableStats branch from 4375e76 to 26c3a0e Compare August 19, 2016 09:09

cloud-fan reviewed Aug 19, 2016
View reviewed changes

gatorsmile reviewed Aug 19, 2016
View reviewed changes

viirya reviewed Aug 21, 2016
View reviewed changes

cloud-fan reviewed Sep 2, 2016
View reviewed changes

update based on comments

b946df0

gatorsmile reviewed Sep 2, 2016
View reviewed changes

add test cases

5d6e559

asfgit closed this in 6d86403 Sep 5, 2016

cloud-fan reviewed Sep 6, 2016
View reviewed changes

gatorsmile mentioned this pull request Jan 31, 2018

[SPARK-23203][SQL] make DataSourceV2Relation immutable #20448

Closed

[SPARK-17072] [SQL] support table-level statistics generation and storing into/loading from metastore #14712

[SPARK-17072] [SQL] support table-level statistics generation and storing into/loading from metastore #14712

Uh oh!

Conversation

wzhfy commented Aug 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

scwf commented Aug 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Aug 19, 2016

Uh oh!

gatorsmile commented Aug 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Aug 20, 2016

Uh oh!

wzhfy commented Aug 20, 2016

Uh oh!

gatorsmile commented Aug 20, 2016

Uh oh!

cloud-fan commented Aug 20, 2016

Uh oh!

gatorsmile commented Aug 20, 2016

Uh oh!

viirya commented Aug 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy commented Aug 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Aug 21, 2016

Uh oh!

wzhfy commented Aug 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 1, 2016

Uh oh!

wzhfy commented Sep 2, 2016

Uh oh!

cloud-fan Sep 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy Sep 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 2, 2016

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

wzhfy commented Sep 2, 2016

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

wzhfy commented Aug 19, 2016 •

edited

Loading

scwf commented Aug 19, 2016 •

edited

Loading

wzhfy commented Aug 21, 2016 •

edited

Loading

wzhfy commented Aug 21, 2016 •

edited

Loading

cloud-fan Sep 2, 2016 •

edited

Loading

wzhfy Sep 2, 2016 •

edited

Loading

gatorsmile Sep 2, 2016 •

edited

Loading

gatorsmile commented Sep 2, 2016 •

edited

Loading

wzhfy commented Sep 7, 2016 •

edited

Loading