Skip to content

Conversation

@wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Aug 19, 2016

What changes were proposed in this pull request?

  1. Support generation table-level statistics for
    • hive tables in HiveExternalCatalog
    • data source tables in HiveExternalCatalog
    • data source tables in InMemoryCatalog.
  2. Add a property "stats" in CatalogTable to hold statistics in Spark side.
  3. Put logics of statistics transformation between Spark and Hive in HiveExternalCatalog.
  4. Extend Statistics class by adding rowCount (will add estimatedSize when we have column stats).

How was this patch tested?

add unit tests

@scwf
Copy link
Contributor

scwf commented Aug 19, 2016

/cc @cloud-fan @rxin @hvanhovell

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference with the other statistics field? Could you give an example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just use the existing statistics field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this way, we can separate cbo path from the original path when we add a conf for cbo in following tasks. If cbo is open, we use completeStats, if it is off, we still use the original statistics.
When cbo is mature in the future, we can remove the original statistics and use the new completeStats instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can merge the two even now. In the worst case they can be controlled via a config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i've updated this pr based on your comments

@SparkQA
Copy link

SparkQA commented Aug 19, 2016

Test build #3227 has finished for PR 14712 at commit 4375e76.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Statistics(
    • case class AnalyzeTableCommand(tableName: String, noscan: Boolean = true) extends RunnableCommand

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except noscan, are there some other options we may support in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there are. AFAIK we will also support column level statistics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan noscan won't scan files, it only collects statistics like total size. Without noscan, we will collect other stats like row count and column level stats.

@gatorsmile
Copy link
Member

So far, the test coverage is weak. Could we add more test cases to cover all the corner cases? Thanks!

@gatorsmile
Copy link
Member

If the goal also includes the interoperability with Hive, the test cases should also verify whether the table property COLUMN_STATS_ACCURATE is true or not. This should be implicitly updated by Hive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before calling ANALYZE TABLE, table properties already have the latest values of numFiles and totalSize. Thus, totalSize is not added/updated by ANALYZE TABLE.

@cloud-fan
Copy link
Contributor

a high-level question: Looks like the current design depends on some features of hive metastore, e.g. the STATS_GENERATED_VIA_STATS_TASK flag. Is it possible that we just treat hive metastore as a persistent level? So that the statistics can still work if Spark SQL has its own metastore in the future.

@wzhfy
Copy link
Contributor Author

wzhfy commented Aug 20, 2016

@cloud-fan @gatorsmile Actually, we desperately need spark sql to have its own metastore, because we need to persist statistics like histograms which AFAIK hive metastore doesn't support.

@gatorsmile
Copy link
Member

@wzhfy I am kind of worried about the dependency of Hive metastore.

  • Using the same table property names like Hive: Hive metastore could change them in an unexpected way (for example, by a Hive metastore bug). The behaviors might be different in different metastore versions. However, the statistics can be shared by Hive execution and Spark execution. Thus, it could benefit the Spark users who are using Hive and Spark together on the same set of tables.
  • Using the different names, like what we did for data source table schema: Spark fully controls them and we just treat the Hive metastore as a persistent storage of these statistics.
  • Provided a configuration parameter so that users can control which option they preferred?

This is a design decision we need to make. @rxin @yhuai @hvanhovell @liancheng @cloud-fan @clockfly

@cloud-fan
Copy link
Contributor

My proposal is: Like data source table metadata, we store the table statistics using different names from hive, if the statistics is hive compatible, like row count, we also store the corresponding hive entries. In this way, we won't be affected by possible hive metastore bugs, and hive can also recognize table statistics generated by spark.

When we read in a hive table, if its statistics already exists but in hive format, we can generate the corresponding spark sql entries. Then spark sql can also recognize table statistics generated by hive.

@gatorsmile
Copy link
Member

I like @cloud-fan 's proposal. : )

When the values of these two copies are different, which one is preferred? a) Hive: if we prefer Hive's version, we might be affected by Hive. b) Spark: if we choose Spark's copy, we might lose the benefit of Hive-generated statistics. Adding a new configuration parameter? The default is Option a? (I also do not like adding any extra parameter)

@viirya
Copy link
Member

viirya commented Aug 21, 2016

If it is a hive table, I think we should respect hive's statistics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but looks like AnalyzeTableCommand doesn't handle the possible NoSuchTableException caused by sessionState.catalog.lookupRelation. It should be better to handle it and provide error message.

@wzhfy
Copy link
Contributor Author

wzhfy commented Aug 21, 2016

I suggest in the current stage, we still follow hive's convention. When spark sql has its own metastore, we can bridge these two metastores by a mapping between two different sets of names/data structures, and then provide a config for users to declare their preference.

@cloud-fan
Copy link
Contributor

Spark SQL already has its own metastore: InMemoryCatalog. And we do have an abstraction for metasotre: ExternalCatalog. We have 2 targets here:

  1. add table statistics in Spark SQL
  2. Spark SQL and Hive should recognize table statistics from each other.

I think target 1 is more important, and we do need an implementation that not depend on hive features.

Actually, we desperately need spark sql to have its own metastore, because we need to persist statistics like histograms which AFAIK hive metastore doesn't support.

We store table statistics in table properties, why would hive metastore not support it? Do you mean Hive can't recognize it? But I think it's ok, we should not limit our table statistics by what Hive supports.

@wzhfy
Copy link
Contributor Author

wzhfy commented Aug 21, 2016

Actually, we desperately need spark sql to have its own metastore, because we need to persist statistics like histograms which AFAIK hive metastore doesn't support.

@cloud-fan The above comment is out of the range of this pr.

Table property is a string-string map, which means we need to transform every statistic into/from a string. This is ok for simple table statistics like "numRows", but not a good choice for complicated column statistics like histograms.

InMemoryCatalog is "in memory", we can have our own properties or data structures, but currently we still need Hive metastore api to “persist” these statistics. Hive has APIs for storing/loading table properties and ColumnStats, but no api for histograms.

What I'm trying to say is, we can use hive as a persistent level, but what we can store/load is still limited by its api. Of course we can put everything into properties, but it's not elegant.

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64789 has finished for PR 14712 at commit b6c655a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 2, 2016

@cloud-fan @hvanhovell @gatorsmile Please review again, thanks!

createTime: Long = System.currentTimeMillis,
lastAccessTime: Long = -1,
properties: Map[String, String] = Map.empty,
stats: Option[Statistics] = None,
Copy link
Contributor

@cloud-fan cloud-fan Sep 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should also update toString to include stats.

Copy link
Contributor Author

@wzhfy wzhfy Sep 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan and also simpleString in LogicalRelation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogicalRelation doesn't need to be updated I think.

@cloud-fan
Copy link
Contributor

looks pretty good now! I left some comments about some small issues, thanks for working on it!

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64848 has finished for PR 14712 at commit b946df0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 2, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #3244 has finished for PR 14712 at commit b946df0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64850 has finished for PR 14712 at commit b946df0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, @hvanhovell can you take another look?

sql(s"ANALYZE TABLE $textTable COMPUTE STATISTICS")
checkMetastoreRelationStats(textTable, expectedStats =
Some(Statistics(sizeInBytes = 5812, rowCount = Some(500))))

Copy link
Member

@gatorsmile gatorsmile Sep 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just here, could you add a few lines?

      sql(s"ANALYZE TABLE $textTable COMPUTE STATISTICS noscan")
      // when the total size is not changed, the old row count is kept
      checkMetastoreRelationStats(textTable, expectedStats =
        Some(Statistics(sizeInBytes = 5812, rowCount = Some(500))))

@gatorsmile
Copy link
Member

gatorsmile commented Sep 2, 2016

Below is a test case for a table with zero column. Could you also add it here?

  test("statistics collection of a table with zero column") {
    val table_no_cols = "table_no_cols"
    withTable(table_no_cols) {
      val rddNoCols = sparkContext.parallelize(1 to 10).map(_ => Row.empty)
      val dfNoCols = spark.createDataFrame(rddNoCols, StructType(Seq.empty))
      dfNoCols.write.format("json").saveAsTable(table_no_cols)
      sql(s"ANALYZE TABLE $table_no_cols COMPUTE STATISTICS")
      checkLogicalRelationStats(table_no_cols, expectedStats =
        Some(Statistics(sizeInBytes = 30, rowCount = Some(10))))
    }
  }

In the future, we will do column-level statistics collection. This might help you when you implement collection of column-level statistics.

@gatorsmile
Copy link
Member

LGTM except two minor comments about test cases.

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 3, 2016

@gatorsmile Thank you for the good test cases!

@SparkQA
Copy link

SparkQA commented Sep 3, 2016

Test build #64886 has finished for PR 14712 at commit 5d6e559.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

LGTM. Merging to master. Thanks!

@asfgit asfgit closed this in 6d86403 Sep 5, 2016
@yhuai
Copy link
Contributor

yhuai commented Sep 6, 2016

@yhuai
Copy link
Contributor

yhuai commented Sep 6, 2016

I have created https://issues.apache.org/jira/browse/SPARK-17408. @wzhfy Can you take a look?

// noscan won't count the number of rows
sql(s"ANALYZE TABLE $textTable COMPUTE STATISTICS noscan")
checkMetastoreRelationStats(textTable, expectedStats =
Some(Statistics(sizeInBytes = 5812, rowCount = None)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I missed this, we should avoid hardcode nondeterministic values(like file size) in test, for this case, we only need to make sure the first sizeInBytes is greater than 0, and the second sizeInBytes is equal to the first one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably caused by Hive's hive.exec.compress.output; try setting this to false. I do agree with @cloud-fan, that equality testing in these cases is very brittle.

@gatorsmile
Copy link
Member

These tests block multiple PRs. It is midnight in China. : ) Let me do a quick fix based on the comments of @cloud-fan and @hvanhovell

@wzhfy
Copy link
Contributor Author

wzhfy commented Sep 7, 2016

@yhuai @hvanhovell @cloud-fan Sorry for the late response, I'm out of office for two days.
@gatorsmile Thanks for fixing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants