[SPARK-21969][SQL] CommandUtils.updateTableStats should call refreshTable #19252

aokolnychyi · 2017-09-16T12:18:07Z

What changes were proposed in this pull request?

Tables in the catalog cache are not invalidated once their statistics are updated. As a consequence, existing sessions will use the cached information even though it is not valid anymore. Consider and an example below.

// step 1
spark.range(100).write.saveAsTable("tab1")
// step 2
spark.sql("analyze table tab1 compute statistics")
// step 3
spark.sql("explain cost select distinct * from tab1").show(false)
// step 4
spark.range(100).write.mode("append").saveAsTable("tab1")
// step 5
spark.sql("explain cost select distinct * from tab1").show(false)

After step 3, the table will be present in the catalog relation cache. Step 4 will correctly update the metadata inside the catalog but will NOT invalidate the cache.

By the way, spark.sql("analyze table tab1 compute statistics") between step 3 and step 4 would also solve the problem.

How was this patch tested?

Current and additional unit tests.

…able

SparkQA · 2017-09-16T14:58:59Z

Test build #81842 has finished for PR 19252 at commit ba963b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-17T18:08:42Z

TruncateTableCommand and AlterTableAddPartitionCommand also have similar issues. Could you also fix it in this PR?

gatorsmile · 2017-09-17T18:13:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala

      } else {
        catalog.alterTableStats(table.identifier, None)
      }
+      catalog.refreshTable(table.identifier)


Add a comment above this line:

Invalidate the table relation cache

gatorsmile · 2017-09-17T18:20:30Z

Actually, the right fix should add refreshTable(identifier) to the SessionCatalog's alterTableStats API.

aokolnychyi · 2017-09-17T19:47:42Z

@gatorsmile thanks for the feedback. I also covered TruncateTableCommand with additional tests. However, I see a bit strange behavior while creating a test for AlterTableAddPartitionCommand .

sql(s"CREATE TABLE t1 (col1 int, col2 int) USING PARQUET")
sql(s"INSERT INTO TABLE t1 SELECT 1, 2")
sql(s"INSERT INTO TABLE t1 SELECT 2, 4")
sql("SELECT * FROM t1").show()
+----+----+
|col1|col2|
+----+----+
|   1|   2|
|   2|   4|
+----+----+

sql(s"CREATE TABLE t2 (col1 int, col2 int) USING PARQUET PARTITIONED BY (col1)")
sql(s"INSERT INTO TABLE t2 SELECT 1, 2")
sql(s"INSERT INTO TABLE t2 SELECT 2, 4")
sql("SELECT * FROM t2").show()
+----+----+
|col2|col1|
+----+----+
|   2|   4|
|   1|   2|
+----+----+

Why are the results different? Is it a bug?

gatorsmile · 2017-09-18T00:58:31Z

This is not a bug. We just follow the behavior of Hive's dynamic partition insert.

The dynamic partition columns must be specified last in both part_spec and the input result set (of the row value lists or the select query). They are resolved by position, instead of by names. Thus, the orders must be exactly matched.

gatorsmile · 2017-09-18T23:57:03Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala


+          // compute stats based on the catalog table metadata and
+          // put the relation into the catalog cache
+          sql(s"EXPLAIN COST SELECT DISTINCT * FROM $table")


Could you replace the usage of EXPLAIN COST by

// Table lookup will make the table cached. spark.table(table)

gatorsmile · 2017-09-18T23:58:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

    requireTableExists(tableIdentifier)
    externalCatalog.alterTableStats(db, table, newStats)
+    // Invalidate the table relation cache
+    refreshTable(identifier)


Could you remove the unneeded refreshTable calls in AnalyzeTableCommand and AnalyzeColumnCommand?

SparkQA · 2017-09-19T00:04:22Z

Test build #81896 has finished for PR 19252 at commit ca09962.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-19T00:06:18Z

Test build #81897 has finished for PR 19252 at commit a5cb16d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-19T20:31:58Z

Test build #81941 has finished for PR 19252 at commit 63f9dc2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-19T21:18:13Z

LGTM

gatorsmile · 2017-09-19T21:18:22Z

Thanks! Merged to master.

[SPARK-21969][SQL] CommandUtils.updateTableStats should call refreshT…

ba963b4

…able

gatorsmile reviewed Sep 17, 2017

View reviewed changes

aokolnychyi added 3 commits September 18, 2017 23:17

Additional test cases & moved refreshTable call to alterTableStats

ca09962

Fix a typo

e35a8f4

Improve comparison

a5cb16d

gatorsmile reviewed Sep 18, 2017

View reviewed changes

Address review comments

63f9dc2

asfgit closed this in ee13f3e Sep 19, 2017

[SPARK-21969][SQL] CommandUtils.updateTableStats should call refreshTable #19252

[SPARK-21969][SQL] CommandUtils.updateTableStats should call refreshTable #19252

Uh oh!

Conversation

aokolnychyi commented Sep 16, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 16, 2017

Uh oh!

gatorsmile commented Sep 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile Sep 17, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Sep 17, 2017

Uh oh!

aokolnychyi commented Sep 17, 2017

Uh oh!

gatorsmile commented Sep 18, 2017

Uh oh!

gatorsmile Sep 18, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile Sep 18, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 19, 2017

Uh oh!

SparkQA commented Sep 19, 2017

Uh oh!

SparkQA commented Sep 19, 2017

Uh oh!

gatorsmile commented Sep 19, 2017

Uh oh!

gatorsmile commented Sep 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gatorsmile commented Sep 17, 2017 •

edited

Loading