-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27403][SQL] Fix updateTableStats to update table stats always with new stats or None
#24315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… auto update feature is enabled(spark.sql.statistics.size.autoUpdate.enabled =true) What changes were proposed in this pull request? For the table, INSERT OVERWRITE command statistics are automatically computed by default if user set spark.sql.statistics.size.autoUpdate.enabled =true and the statistics shall be recorded in metadata store, this is not happening currently because of validation table.stats.nonEmpty, the statistics were never recorded for the newly created table, this check doesn't holds good if auto update property feature is enabled by the user. As part of fix the autoSizeUpdateEnabled has been pulled up as part of separate validation which will ensure if this feature is enabled the system will calculate the size of the table in every insert command and the same will be recorded in meta-store. How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster.
|
|
||
| /** Change statistics after changing data by commands. */ | ||
| def updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit = { | ||
| if (table.stats.nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of this condition check, the insert table command will never able to calculate the table size even if user enables sparkSession.sessionState.conf.autoSizeUpdateEnabled.
This check holds good if autoSizeUpdateEnabled is false which will ensure the size will be calculated from hadoop relation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pinging me, @sujith71955 . Got it. I'll take a look. BTW, I update the JIRA ID in the title.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure . thanks
|
Please review and let me know for any suggestions/clarifications. Thanks |
|
Test build #104362 has finished for PR 24315 at commit
|
| val autoUpdate = true | ||
| withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) { | ||
| withTable(table) { | ||
| sql(s"CREATE TABLE $table (i int, j string) STORED AS PARQUET") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STORE AS is only supported in hive module. Please use USING PARQUET.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I updated the same. thanks for the input.
wangyum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the previous discussion: https://github.com/apache/spark/pull/20430/files#r164662154
… auto update feature is enabled(spark.sql.statistics.size.autoUpdate.enabled =true) What changes were proposed in this pull request? For the table, INSERT OVERWRITE command statistics are automatically computed by default if user set spark.sql.statistics.size.autoUpdate.enabled =true and the statistics shall be recorded in metadata store, this is not happening currently because of validation table.stats.nonEmpty, the statistics were never recorded for the newly created table, this check doesn't holds good if auto update property feature is enabled by the user. As part of fix the autoSizeUpdateEnabled has been pulled up as part of separate validation which will ensure if this feature is enabled the system will calculate the size of the table in every insert command and the same will be recorded in meta-store. How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster.
@wangyum : I am not removing table.stats.nonEmpty validation here, it is intact. The scenario mentioned by Zhenhua is a valid concern and this PR doesn't have any impact on the scenario what he mentioned. |
Hope i clarified your point. |
|
Test build #104371 has finished for PR 24315 at commit
|
|
retest this please |
|
Test build #104398 has finished for PR 24315 at commit
|
Seems to be a false alarm, Not relevant to this PR, i will re-trigger the build. |
|
retest this please |
|
Test build #104419 has finished for PR 24315 at commit
|
|
retest this please |
|
Test build #104428 has finished for PR 24315 at commit
|
|
@dongjoon-hyun fixed the review comment please let me know for any further clarifications/suggestion. thanks |
| test("auto gather stats after insert command") { | ||
| val table = "change_stats_insert_datasource_table" | ||
| val autoUpdate = true | ||
| withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> autoUpdate.toString) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating, @sujith71955 . Here are two issues.
- We need to test both cases by using
Seq(false, true).foreach { autoUpdate => - The current indentation is wrong at 343.
If you fix (1), (2) will be fixed together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled the comment. thanks
| // analyze to get initial stats | ||
| // insert into command | ||
| sql(s"INSERT INTO TABLE $table SELECT 1, 'abc'") | ||
| val stats = getCatalogTable(table).stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, fix the indentation here, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. thanks
… auto update feature is enabled(spark.sql.statistics.size.autoUpdate.enabled =true) What changes were proposed in this pull request? For the table, INSERT OVERWRITE command statistics are automatically computed by default if user set spark.sql.statistics.size.autoUpdate.enabled =true and the statistics shall be recorded in metadata store, this is not happening currently because of validation table.stats.nonEmpty, the statistics were never recorded for the newly created table, this check doesn't holds good if auto update property feature is enabled by the user. As part of fix the autoSizeUpdateEnabled has been pulled up as part of separate validation which will ensure if this feature is enabled the system will calculate the size of the table in every insert command and the same will be recorded in meta-store. How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster.
|
Test build #104489 has finished for PR 24315 at commit
|
|
@dongjoon-hyun All comments are handled. Thanks a lot for the valuable inputs. let me know for any further suggestions/clarifications |
updateTableStats to update table stats with new stats or None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master. Thank you, @sujith71955 .
cc @gatorsmile and @cloud-fan
updateTableStats to update table stats with new stats or NoneupdateTableStats to update table stats always with new stats or None
…s with new stats or None ## What changes were proposed in this pull request? System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev As part of fix , autoSizeUpdateEnabled validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation. ## How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster. Before fix:  After fix  Closes #24315 from sujith71955/master_autoupdate. Authored-by: s71955 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 239082d) Signed-off-by: Dongjoon Hyun <[email protected]>
|
Merged to branch-2.4, too. |
…s with new stats or None ## What changes were proposed in this pull request? System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev As part of fix , autoSizeUpdateEnabled validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation. ## How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster. Before fix:  After fix  Closes apache#24315 from sujith71955/master_autoupdate. Authored-by: s71955 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 239082d) Signed-off-by: Dongjoon Hyun <[email protected]>
…s with new stats or None ## What changes were proposed in this pull request? System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev As part of fix , autoSizeUpdateEnabled validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation. ## How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster. Before fix:  After fix  Closes apache#24315 from sujith71955/master_autoupdate. Authored-by: s71955 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 239082d) Signed-off-by: Dongjoon Hyun <[email protected]>
…s with new stats or None ## What changes were proposed in this pull request? System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev As part of fix , autoSizeUpdateEnabled validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation. ## How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster. Before fix:  After fix  Closes apache#24315 from sujith71955/master_autoupdate. Authored-by: s71955 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 239082d) Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled.
Reference:
https://cwiki.apache.org/confluence/display/Hive/StatsDev
As part of fix , autoSizeUpdateEnabled validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation.
How was this patch tested?
UT is written and manually verified in cluster.
Tested with unit tests + some internal tests on real cluster.
Before fix:
After fix
