Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

What changes were proposed in this pull request?

Tables in the catalog cache are not invalidated once their statistics are updated. As a consequence, existing sessions will use the cached information even though it is not valid anymore. Consider and an example below.

// step 1
spark.range(100).write.saveAsTable("tab1")
// step 2
spark.sql("analyze table tab1 compute statistics")
// step 3
spark.sql("explain cost select distinct * from tab1").show(false)
// step 4
spark.range(100).write.mode("append").saveAsTable("tab1")
// step 5
spark.sql("explain cost select distinct * from tab1").show(false)

After step 3, the table will be present in the catalog relation cache. Step 4 will correctly update the metadata inside the catalog but will NOT invalidate the cache.

By the way, spark.sql("analyze table tab1 compute statistics") between step 3 and step 4 would also solve the problem.

How was this patch tested?

Current and additional unit tests.

@SparkQA
Copy link

SparkQA commented Sep 16, 2017

Test build #81842 has finished for PR 19252 at commit ba963b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented Sep 17, 2017

TruncateTableCommand and AlterTableAddPartitionCommand also have similar issues. Could you also fix it in this PR?

} else {
catalog.alterTableStats(table.identifier, None)
}
catalog.refreshTable(table.identifier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment above this line:

Invalidate the table relation cache

@gatorsmile
Copy link
Member

Actually, the right fix should add refreshTable(identifier) to the SessionCatalog's alterTableStats API.

@aokolnychyi
Copy link
Contributor Author

@gatorsmile thanks for the feedback. I also covered TruncateTableCommand with additional tests. However, I see a bit strange behavior while creating a test for AlterTableAddPartitionCommand .

sql(s"CREATE TABLE t1 (col1 int, col2 int) USING PARQUET")
sql(s"INSERT INTO TABLE t1 SELECT 1, 2")
sql(s"INSERT INTO TABLE t1 SELECT 2, 4")
sql("SELECT * FROM t1").show()
+----+----+
|col1|col2|
+----+----+
|   1|   2|
|   2|   4|
+----+----+

sql(s"CREATE TABLE t2 (col1 int, col2 int) USING PARQUET PARTITIONED BY (col1)")
sql(s"INSERT INTO TABLE t2 SELECT 1, 2")
sql(s"INSERT INTO TABLE t2 SELECT 2, 4")
sql("SELECT * FROM t2").show()
+----+----+
|col2|col1|
+----+----+
|   2|   4|
|   1|   2|
+----+----+

Why are the results different? Is it a bug?

@gatorsmile
Copy link
Member

This is not a bug. We just follow the behavior of Hive's dynamic partition insert.

The dynamic partition columns must be specified last in both part_spec and the input result set (of the row value lists or the select query). They are resolved by position, instead of by names. Thus, the orders must be exactly matched.


// compute stats based on the catalog table metadata and
// put the relation into the catalog cache
sql(s"EXPLAIN COST SELECT DISTINCT * FROM $table")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you replace the usage of EXPLAIN COST by

// Table lookup will make the table cached.
spark.table(table)

requireTableExists(tableIdentifier)
externalCatalog.alterTableStats(db, table, newStats)
// Invalidate the table relation cache
refreshTable(identifier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the unneeded refreshTable calls in AnalyzeTableCommand and AnalyzeColumnCommand?

@SparkQA
Copy link

SparkQA commented Sep 19, 2017

Test build #81896 has finished for PR 19252 at commit ca09962.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 19, 2017

Test build #81897 has finished for PR 19252 at commit a5cb16d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 19, 2017

Test build #81941 has finished for PR 19252 at commit 63f9dc2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in ee13f3e Sep 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants