Skip to content

Conversation

@sujith71955
Copy link
Contributor

What changes were proposed in this pull request?

System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled.
Reference:
https://cwiki.apache.org/confluence/display/Hive/StatsDev

As part of fix , autoSizeUpdateEnabled validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation.

How was this patch tested?

UT is written and manually verified in cluster.
Tested with unit tests + some internal tests on real cluster.

Before fix:

image

After fix
image

… auto update feature is enabled(spark.sql.statistics.size.autoUpdate.enabled =true)

What changes were proposed in this pull request?

For the  table, INSERT OVERWRITE command statistics are automatically computed by default if user set spark.sql.statistics.size.autoUpdate.enabled =true and the statistics shall be recorded
in metadata store, this is not happening currently  because of  validation table.stats.nonEmpty, the statistics were never recorded for the newly created table, this check  doesn't holds good if auto update property feature is enabled by the user.
As part of fix the autoSizeUpdateEnabled has been pulled up as part of separate validation which will ensure if this feature is enabled the system will calculate the size of the table in every insert command
and the same will be recorded in meta-store.

How was this patch tested?
UT is written and manually verified in cluster.

Tested with unit tests + some internal tests on real cluster.
@sujith71955 sujith71955 changed the title [SPARK-274034][SQL] Table Statisics shall be updated automatically if auto update feature is enabled(spark.sql.statistics.size.autoUpdate.enabled =true) [SPARK-274034][SQL] Table Statisics shall be updated automatically if auto size update feature is enabled(spark.sql.statistics.size.autoUpdate.enabled =true) Apr 7, 2019

/** Change statistics after changing data by commands. */
def updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit = {
if (table.stats.nonEmpty) {
Copy link
Contributor Author

@sujith71955 sujith71955 Apr 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this condition check, the insert table command will never able to calculate the table size even if user enables sparkSession.sessionState.conf.autoSizeUpdateEnabled.
This check holds good if autoSizeUpdateEnabled is false which will ensure the size will be calculated from hadoop relation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pinging me, @sujith71955 . Got it. I'll take a look. BTW, I update the JIRA ID in the title.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure . thanks

@sujith71955
Copy link
Contributor Author

@sujith71955
Copy link
Contributor Author

Please review and let me know for any suggestions/clarifications. Thanks

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-274034][SQL] Table Statisics shall be updated automatically if auto size update feature is enabled(spark.sql.statistics.size.autoUpdate.enabled =true) [SPARK-27403][SQL] Table Statisics shall be updated automatically if auto size update feature is enabled(spark.sql.statistics.size.autoUpdate.enabled =true) Apr 7, 2019
@SparkQA
Copy link

SparkQA commented Apr 7, 2019

Test build #104362 has finished for PR 24315 at commit b143e84.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val autoUpdate = true
withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) {
withTable(table) {
sql(s"CREATE TABLE $table (i int, j string) STORED AS PARQUET")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STORE AS is only supported in hive module. Please use USING PARQUET.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I updated the same. thanks for the input.

Copy link
Member

@wangyum wangyum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… auto update feature is enabled(spark.sql.statistics.size.autoUpdate.enabled =true)

What changes were proposed in this pull request?

For the  table, INSERT OVERWRITE command statistics are automatically computed by default if user set spark.sql.statistics.size.autoUpdate.enabled =true and the statistics shall be recorded
in metadata store, this is not happening currently  because of  validation table.stats.nonEmpty, the statistics were never recorded for the newly created table, this check  doesn't holds good if auto update property feature is enabled by the user.
As part of fix the autoSizeUpdateEnabled has been pulled up as part of separate validation which will ensure if this feature is enabled the system will calculate the size of the table in every insert command
and the same will be recorded in meta-store.

How was this patch tested?
UT is written and manually verified in cluster.

Tested with unit tests + some internal tests on real cluster.
@sujith71955
Copy link
Contributor Author

sujith71955 commented Apr 8, 2019

Please see the previous discussion: https://github.com/apache/spark/pull/20430/files#r164662154

@wangyum : I am not removing table.stats.nonEmpty validation here, it is intact. The scenario mentioned by Zhenhua is a valid concern and this PR doesn't have any impact on the scenario what he mentioned.
if user enables the auto size update feature, user expects the size will be calculated by the system automatically which was not happening , and this PR addressed the issue, other logics are intact. Thanks

@sujith71955
Copy link
Contributor Author

Please see the previous discussion: https://github.com/apache/spark/pull/20430/files#r164662154

@wangyum : I am not removing table.stats.nonEmpty validation here, it is intact. The scenario mentioned by Zhenhua is a valid concern and this PR doesn't have any impact on the scenario what he mentioned.
if user enables the auto size update feature, user expects the size will be calculated by the system automatically which was not happening , and this PR addressed the issue, other logics are intact. Thanks

Hope i clarified your point.

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104371 has finished for PR 24315 at commit f87639e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sujith71955
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104398 has finished for PR 24315 at commit f87639e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sujith71955
Copy link
Contributor Author

Test build #104398 has finished for PR 24315 at commit f87639e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Seems to be a false alarm, Not relevant to this PR, i will re-trigger the build.

@sujith71955
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 9, 2019

Test build #104419 has finished for PR 24315 at commit f87639e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sujith71955
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 9, 2019

Test build #104428 has finished for PR 24315 at commit f87639e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sujith71955
Copy link
Contributor Author

@dongjoon-hyun fixed the review comment please let me know for any further clarifications/suggestion. thanks

test("auto gather stats after insert command") {
val table = "change_stats_insert_datasource_table"
val autoUpdate = true
withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating, @sujith71955 . Here are two issues.

  1. We need to test both cases by using Seq(false, true).foreach { autoUpdate =>
  2. The current indentation is wrong at 343.

If you fix (1), (2) will be fixed together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled the comment. thanks

// analyze to get initial stats
// insert into command
sql(s"INSERT INTO TABLE $table SELECT 1, 'abc'")
val stats = getCatalogTable(table).stats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, fix the indentation here, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. thanks

… auto update feature is enabled(spark.sql.statistics.size.autoUpdate.enabled =true)

What changes were proposed in this pull request?

For the  table, INSERT OVERWRITE command statistics are automatically computed by default if user set spark.sql.statistics.size.autoUpdate.enabled =true and the statistics shall be recorded
in metadata store, this is not happening currently  because of  validation table.stats.nonEmpty, the statistics were never recorded for the newly created table, this check  doesn't holds good if auto update property feature is enabled by the user.
As part of fix the autoSizeUpdateEnabled has been pulled up as part of separate validation which will ensure if this feature is enabled the system will calculate the size of the table in every insert command
and the same will be recorded in meta-store.

How was this patch tested?
UT is written and manually verified in cluster.

Tested with unit tests + some internal tests on real cluster.
@SparkQA
Copy link

SparkQA commented Apr 10, 2019

Test build #104489 has finished for PR 24315 at commit 3692775.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sujith71955
Copy link
Contributor Author

@dongjoon-hyun All comments are handled. Thanks a lot for the valuable inputs. let me know for any further suggestions/clarifications

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27403][SQL] Table Statisics shall be updated automatically if auto size update feature is enabled(spark.sql.statistics.size.autoUpdate.enabled =true) [SPARK-27403][SQL] Fix updateTableStats to update table stats with new stats or None Apr 11, 2019
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master. Thank you, @sujith71955 .

cc @gatorsmile and @cloud-fan

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27403][SQL] Fix updateTableStats to update table stats with new stats or None [SPARK-27403][SQL] Fix updateTableStats to update table stats always with new stats or None Apr 11, 2019
dongjoon-hyun pushed a commit that referenced this pull request Apr 17, 2019
…s with new stats or None

## What changes were proposed in this pull request?

System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled.
Reference:
https://cwiki.apache.org/confluence/display/Hive/StatsDev

As part of fix , autoSizeUpdateEnabled  validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation.

## How was this patch tested?
UT is written and manually verified in cluster.
Tested with unit tests + some internal tests on real cluster.

Before fix:

![image](https://user-images.githubusercontent.com/12999161/55688682-cd8d4780-5998-11e9-85da-e1a4e34419f6.png)

After fix
![image](https://user-images.githubusercontent.com/12999161/55688654-7d15ea00-5998-11e9-973f-1f4cee27018f.png)

Closes #24315 from sujith71955/master_autoupdate.

Authored-by: s71955 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 239082d)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Merged to branch-2.4, too.

kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
…s with new stats or None

## What changes were proposed in this pull request?

System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled.
Reference:
https://cwiki.apache.org/confluence/display/Hive/StatsDev

As part of fix , autoSizeUpdateEnabled  validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation.

## How was this patch tested?
UT is written and manually verified in cluster.
Tested with unit tests + some internal tests on real cluster.

Before fix:

![image](https://user-images.githubusercontent.com/12999161/55688682-cd8d4780-5998-11e9-85da-e1a4e34419f6.png)

After fix
![image](https://user-images.githubusercontent.com/12999161/55688654-7d15ea00-5998-11e9-973f-1f4cee27018f.png)

Closes apache#24315 from sujith71955/master_autoupdate.

Authored-by: s71955 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 239082d)
Signed-off-by: Dongjoon Hyun <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 25, 2019
…s with new stats or None

## What changes were proposed in this pull request?

System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled.
Reference:
https://cwiki.apache.org/confluence/display/Hive/StatsDev

As part of fix , autoSizeUpdateEnabled  validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation.

## How was this patch tested?
UT is written and manually verified in cluster.
Tested with unit tests + some internal tests on real cluster.

Before fix:

![image](https://user-images.githubusercontent.com/12999161/55688682-cd8d4780-5998-11e9-85da-e1a4e34419f6.png)

After fix
![image](https://user-images.githubusercontent.com/12999161/55688654-7d15ea00-5998-11e9-973f-1f4cee27018f.png)

Closes apache#24315 from sujith71955/master_autoupdate.

Authored-by: s71955 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 239082d)
Signed-off-by: Dongjoon Hyun <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
…s with new stats or None

## What changes were proposed in this pull request?

System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled.
Reference:
https://cwiki.apache.org/confluence/display/Hive/StatsDev

As part of fix , autoSizeUpdateEnabled  validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation.

## How was this patch tested?
UT is written and manually verified in cluster.
Tested with unit tests + some internal tests on real cluster.

Before fix:

![image](https://user-images.githubusercontent.com/12999161/55688682-cd8d4780-5998-11e9-85da-e1a4e34419f6.png)

After fix
![image](https://user-images.githubusercontent.com/12999161/55688654-7d15ea00-5998-11e9-973f-1f4cee27018f.png)

Closes apache#24315 from sujith71955/master_autoupdate.

Authored-by: s71955 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 239082d)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants