[SPARK-27403][SQL] Fix `updateTableStats` to update table stats always with new stats or None #24315

sujith71955 · 2019-04-07T19:26:33Z

What changes were proposed in this pull request?

System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled.
Reference:
https://cwiki.apache.org/confluence/display/Hive/StatsDev

As part of fix , autoSizeUpdateEnabled validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation.

How was this patch tested?

UT is written and manually verified in cluster.
Tested with unit tests + some internal tests on real cluster.

Before fix:

After fix

… auto update feature is enabled(spark.sql.statistics.size.autoUpdate.enabled =true) What changes were proposed in this pull request? For the table, INSERT OVERWRITE command statistics are automatically computed by default if user set spark.sql.statistics.size.autoUpdate.enabled =true and the statistics shall be recorded in metadata store, this is not happening currently because of validation table.stats.nonEmpty, the statistics were never recorded for the newly created table, this check doesn't holds good if auto update property feature is enabled by the user. As part of fix the autoSizeUpdateEnabled has been pulled up as part of separate validation which will ensure if this feature is enabled the system will calculate the size of the table in every insert command and the same will be recorded in meta-store. How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster.

sujith71955 · 2019-04-07T19:31:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala


  /** Change statistics after changing data by commands. */
  def updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit = {
-    if (table.stats.nonEmpty) {


Because of this condition check, the insert table command will never able to calculate the table size even if user enables sparkSession.sessionState.conf.autoSizeUpdateEnabled.
This check holds good if autoSizeUpdateEnabled is false which will ensure the size will be calculated from hadoop relation

Thank you for pinging me, @sujith71955 . Got it. I'll take a look. BTW, I update the JIRA ID in the title.

Sure . thanks

sujith71955 · 2019-04-07T19:45:20Z

cc @dongjoon-hyun @HyukjinKwon @cloud-fan

sujith71955 · 2019-04-07T19:45:52Z

Please review and let me know for any suggestions/clarifications. Thanks

SparkQA · 2019-04-07T22:39:06Z

Test build #104362 has finished for PR 24315 at commit b143e84.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-04-07T23:58:31Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

+    val autoUpdate = true
+      withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) {
+        withTable(table) {
+          sql(s"CREATE TABLE $table (i int, j string) STORED AS PARQUET")


STORE AS is only supported in hive module. Please use USING PARQUET.

Right, I updated the same. thanks for the input.

wangyum

Please see the previous discussion: https://github.com/apache/spark/pull/20430/files#r164662154

… auto update feature is enabled(spark.sql.statistics.size.autoUpdate.enabled =true) What changes were proposed in this pull request? For the table, INSERT OVERWRITE command statistics are automatically computed by default if user set spark.sql.statistics.size.autoUpdate.enabled =true and the statistics shall be recorded in metadata store, this is not happening currently because of validation table.stats.nonEmpty, the statistics were never recorded for the newly created table, this check doesn't holds good if auto update property feature is enabled by the user. As part of fix the autoSizeUpdateEnabled has been pulled up as part of separate validation which will ensure if this feature is enabled the system will calculate the size of the table in every insert command and the same will be recorded in meta-store. How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster.

sujith71955 · 2019-04-08T04:22:52Z

Please see the previous discussion: https://github.com/apache/spark/pull/20430/files#r164662154

@wangyum : I am not removing table.stats.nonEmpty validation here, it is intact. The scenario mentioned by Zhenhua is a valid concern and this PR doesn't have any impact on the scenario what he mentioned.
if user enables the auto size update feature, user expects the size will be calculated by the system automatically which was not happening , and this PR addressed the issue, other logics are intact. Thanks

sujith71955 · 2019-04-08T04:24:44Z

Please see the previous discussion: https://github.com/apache/spark/pull/20430/files#r164662154

@wangyum : I am not removing table.stats.nonEmpty validation here, it is intact. The scenario mentioned by Zhenhua is a valid concern and this PR doesn't have any impact on the scenario what he mentioned.
if user enables the auto size update feature, user expects the size will be calculated by the system automatically which was not happening , and this PR addressed the issue, other logics are intact. Thanks

Hope i clarified your point.

SparkQA · 2019-04-08T07:05:01Z

Test build #104371 has finished for PR 24315 at commit f87639e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

sujith71955 · 2019-04-08T16:54:53Z

retest this please

SparkQA · 2019-04-08T20:22:31Z

Test build #104398 has finished for PR 24315 at commit f87639e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sujith71955 · 2019-04-09T03:09:50Z

Test build #104398 has finished for PR 24315 at commit f87639e.

This patch fails PySpark unit tests.

This patch merges cleanly.

This patch adds no public classes.

Seems to be a false alarm, Not relevant to this PR, i will re-trigger the build.

sujith71955 · 2019-04-09T03:10:25Z

retest this please

SparkQA · 2019-04-09T07:05:02Z

Test build #104419 has finished for PR 24315 at commit f87639e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

sujith71955 · 2019-04-09T11:10:25Z

retest this please

SparkQA · 2019-04-09T15:16:54Z

Test build #104428 has finished for PR 24315 at commit f87639e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sujith71955 · 2019-04-09T17:54:09Z

@dongjoon-hyun fixed the review comment please let me know for any further clarifications/suggestion. thanks

dongjoon-hyun · 2019-04-09T22:51:31Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

+  test("auto gather stats after insert command") {
+    val table = "change_stats_insert_datasource_table"
+    val autoUpdate = true
+      withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) {


Thank you for updating, @sujith71955 . Here are two issues.

We need to test both cases by using Seq(false, true).foreach { autoUpdate =>

The current indentation is wrong at 343.

If you fix (1), (2) will be fixed together.

Handled the comment. thanks

dongjoon-hyun · 2019-04-09T22:52:23Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

+          // analyze to get initial stats
+          // insert into command
+          sql(s"INSERT INTO TABLE $table SELECT 1, 'abc'")
+            val stats = getCatalogTable(table).stats


Also, fix the indentation here, too.

Fixed. thanks

… auto update feature is enabled(spark.sql.statistics.size.autoUpdate.enabled =true) What changes were proposed in this pull request? For the table, INSERT OVERWRITE command statistics are automatically computed by default if user set spark.sql.statistics.size.autoUpdate.enabled =true and the statistics shall be recorded in metadata store, this is not happening currently because of validation table.stats.nonEmpty, the statistics were never recorded for the newly created table, this check doesn't holds good if auto update property feature is enabled by the user. As part of fix the autoSizeUpdateEnabled has been pulled up as part of separate validation which will ensure if this feature is enabled the system will calculate the size of the table in every insert command and the same will be recorded in meta-store. How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster.

SparkQA · 2019-04-10T21:35:15Z

Test build #104489 has finished for PR 24315 at commit 3692775.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sujith71955 · 2019-04-11T06:16:27Z

@dongjoon-hyun All comments are handled. Thanks a lot for the valuable inputs. let me know for any further suggestions/clarifications

dongjoon-hyun

+1, LGTM. Merged to master. Thank you, @sujith71955 .

cc @gatorsmile and @cloud-fan

…s with new stats or None ## What changes were proposed in this pull request? System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev As part of fix , autoSizeUpdateEnabled validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation. ## How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster. Before fix: ![image](https://user-images.githubusercontent.com/12999161/55688682-cd8d4780-5998-11e9-85da-e1a4e34419f6.png) After fix ![image](https://user-images.githubusercontent.com/12999161/55688654-7d15ea00-5998-11e9-973f-1f4cee27018f.png) Closes #24315 from sujith71955/master_autoupdate. Authored-by: s71955 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 239082d) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2019-04-17T16:23:41Z

Merged to branch-2.4, too.

…s with new stats or None ## What changes were proposed in this pull request? System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev As part of fix , autoSizeUpdateEnabled validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation. ## How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster. Before fix: ![image](https://user-images.githubusercontent.com/12999161/55688682-cd8d4780-5998-11e9-85da-e1a4e34419f6.png) After fix ![image](https://user-images.githubusercontent.com/12999161/55688654-7d15ea00-5998-11e9-973f-1f4cee27018f.png) Closes apache#24315 from sujith71955/master_autoupdate. Authored-by: s71955 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 239082d) Signed-off-by: Dongjoon Hyun <[email protected]>

sujith71955 commented Apr 7, 2019

View reviewed changes

dongjoon-hyun reviewed Apr 7, 2019

View reviewed changes

wangyum reviewed Apr 8, 2019

View reviewed changes

dongjoon-hyun reviewed Apr 9, 2019

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-27403][SQL] Table Statisics shall be updated automatically if auto size update feature is enabled(spark.sql.statistics.size.autoUpdate.enabled =true)~~ [SPARK-27403][SQL] Fix updateTableStats to update table stats with new stats or None Apr 11, 2019

dongjoon-hyun approved these changes Apr 11, 2019

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-27403][SQL] Fix updateTableStats to update table stats with new stats or None~~ [SPARK-27403][SQL] Fix updateTableStats to update table stats always with new stats or None Apr 11, 2019

dongjoon-hyun closed this in 239082d Apr 11, 2019

[SPARK-27403][SQL] Fix updateTableStats to update table stats always with new stats or None #24315

[SPARK-27403][SQL] Fix updateTableStats to update table stats always with new stats or None #24315

Uh oh!

Conversation

sujith71955 commented Apr 7, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

sujith71955 Apr 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sujith71955 commented Apr 7, 2019

Uh oh!

sujith71955 commented Apr 7, 2019

Uh oh!

SparkQA commented Apr 7, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyum left a comment

Choose a reason for hiding this comment

Uh oh!

sujith71955 commented Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sujith71955 commented Apr 8, 2019

Uh oh!

SparkQA commented Apr 8, 2019

Uh oh!

sujith71955 commented Apr 8, 2019

Uh oh!

SparkQA commented Apr 8, 2019

Uh oh!

sujith71955 commented Apr 9, 2019

Uh oh!

sujith71955 commented Apr 9, 2019

Uh oh!

SparkQA commented Apr 9, 2019

Uh oh!

sujith71955 commented Apr 9, 2019

Uh oh!

SparkQA commented Apr 9, 2019

Uh oh!

sujith71955 commented Apr 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 10, 2019

Uh oh!

sujith71955 commented Apr 11, 2019

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Apr 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-27403][SQL] Fix `updateTableStats` to update table stats always with new stats or None #24315

[SPARK-27403][SQL] Fix `updateTableStats` to update table stats always with new stats or None #24315

sujith71955 Apr 7, 2019 •

edited

Loading

sujith71955 commented Apr 8, 2019 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading