-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21969][SQL] CommandUtils.updateTableStats should call refreshTable #19252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #81842 has finished for PR 19252 at commit
|
|
|
| } else { | ||
| catalog.alterTableStats(table.identifier, None) | ||
| } | ||
| catalog.refreshTable(table.identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment above this line:
Invalidate the table relation cache
|
Actually, the right fix should add |
|
@gatorsmile thanks for the feedback. I also covered Why are the results different? Is it a bug? |
|
This is not a bug. We just follow the behavior of Hive's dynamic partition insert.
|
|
|
||
| // compute stats based on the catalog table metadata and | ||
| // put the relation into the catalog cache | ||
| sql(s"EXPLAIN COST SELECT DISTINCT * FROM $table") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you replace the usage of EXPLAIN COST by
// Table lookup will make the table cached.
spark.table(table)| requireTableExists(tableIdentifier) | ||
| externalCatalog.alterTableStats(db, table, newStats) | ||
| // Invalidate the table relation cache | ||
| refreshTable(identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove the unneeded refreshTable calls in AnalyzeTableCommand and AnalyzeColumnCommand?
|
Test build #81896 has finished for PR 19252 at commit
|
|
Test build #81897 has finished for PR 19252 at commit
|
|
Test build #81941 has finished for PR 19252 at commit
|
|
LGTM |
|
Thanks! Merged to master. |
What changes were proposed in this pull request?
Tables in the catalog cache are not invalidated once their statistics are updated. As a consequence, existing sessions will use the cached information even though it is not valid anymore. Consider and an example below.
After step 3, the table will be present in the catalog relation cache. Step 4 will correctly update the metadata inside the catalog but will NOT invalidate the cache.
By the way,
spark.sql("analyze table tab1 compute statistics")between step 3 and step 4 would also solve the problem.How was this patch tested?
Current and additional unit tests.