-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15968][SQL] Nonempty partitioned metastore tables are not cached #13818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
partitioned metastore relation when searching the internal table cache The `getCached` method of `HiveMetastoreCatalog` computes `pathsInMetastore` from the metastore relation's catalog table. This only returns the table base path, which is not correct for nonempty partitioned tables. As a result, cached lookups on nonempty partitioned tables always miss.
|
@hvanhovell I'm mentioning you here because you commented on my previous PR for this Jira issue. In response to your original question, yes, I have added a unit test for this patch. |
| } | ||
| } | ||
|
|
||
| test("SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you take a look a CachedTableSuite and add the test there (and also use a similar approach).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked in CachedTableSuite. I'm not sure that's a good place for this kind of test. That test suite seems focused on testing tables cached by the CacheManager. This patch is focused on table caching in HiveMetastoreCatalog.
It's difficult to find the best place for these kinds of caching tests. I chose this file because it already had some of these tests. Perhaps HiveMetastoreCatalogSuite would be a good candidate for an alternative?
|
Test build #3124 has finished for PR 13818 at commit
|
|
cc @cloud-fan / @liancheng |
| cachedDataSourceTables.getIfPresent(tableIdentifier) match { | ||
| case null => None // Cache miss | ||
| case logical @ LogicalRelation(relation: HadoopFsRelation, _, _) => | ||
| val pathsInMetastore = metastoreRelation.catalogTable.storage.locationUri.toSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I may not have enough background knowledge to understand this, can you explain a bit more about why this doesn't work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metastoreRelation.catalogTable.storage.locationUri.toSeq
returns the base path of the relation. This is then compared to relation.location.paths to validate the cached entry. For non-empty partitioned tables (by that I mean partitioned tables with one or more metastore partitions), relation.location.paths returns the locations of the partitions. Hence, these values will never be equal and useCached will always be false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relation.location.pathsreturns the locations of the partitions
How does this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the relation's paths are computed and the logic for empty versus non-empty partitioned tables diverges: https://github.com/VideoAmp/spark-public/blob/8a058c65c6c20e311bde5c0ade87c14c6b6b5f37/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L489-L493.
I believe this is the PR where this behavior was introduced: #13022.
|
seems a reasonable fix to me, thanks for working on it! |
|
You are very welcome. Thank you for taking time to review it! 😃 |
|
LGTM, cc @liancheng |
|
ok to test |
|
test this please |
|
Test build #61594 has finished for PR 13818 at commit
|
| |) | ||
| |PARTITIONED BY (part INT) | ||
| |STORED AS PARQUET | ||
| """.stripMargin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Indentation is off here.
|
LGTM except for minor styling issues. Thanks! |
and tidy up the other two tests from which it was copy-pasta'd
|
I believe I've addressed @liancheng's style issues in my new unit test, along with the same in the two tests from which it was copy-pasta'd (boy scout rule). Hopefully I didn't cock it up. |
|
Test build #61730 has finished for PR 13818 at commit
|
|
thanks, merging to master! |
|
Shall we also have this in branch-2.0? This seems to be a pretty serious bug. cc @rxin. |
|
I have a few questions.
My feeling is that if it is a perf issue and it is not a regression from 1.6, merging to master should be good enough. |
|
FYI this breaks Scala 2.10: |
I don't know about 1.6. I know it's a regression from 1.5.
It is a performance issue.
The problem this PR addresses occurs in the analysis phase of query planning. The property Regarding the impact, I'll quote from the last paragraph of the PR description:
For some (like us), I'd say this extends beyond a performance issue into a usability issue. We can't use Spark 2.0 as-is if it takes us several minutes to build a query plan. |
|
@zsxwing I was able to do following without error: |
This PR backports your fix (#13818) to branch 2.0. This PR addresses [SPARK-15968](https://issues.apache.org/jira/browse/SPARK-15968). ## What changes were proposed in this pull request? The `getCached` method of [HiveMetastoreCatalog](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala) computes `pathsInMetastore` from the metastore relation's catalog table. This only returns the table base path, which is incomplete/inaccurate for a nonempty partitioned table. As a result, cached lookups on nonempty partitioned tables always miss. Rather than get `pathsInMetastore` from metastoreRelation.catalogTable.storage.locationUri.toSeq I modified the `getCached` method to take a `pathsInMetastore` argument. Calls to this method pass in the paths computed from calls to the Hive metastore. This is how `getCached` was implemented in Spark 1.5: https://github.com/apache/spark/blob/e0c3212a9b42e3e704b070da4ac25b68c584427f/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L444. I also added a call in `InsertIntoHiveTable.scala` to invalidate the table from the SQL session catalog. ## How was this patch tested? I've added a new unit test to `parquetSuites.scala`: SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached relation Note that the only difference between this new test and the one above it in the file is that the new test populates its partitioned table with a single value, while the existing test leaves the table empty. This reveals a subtle, unexpected hole in test coverage present before this patch. Note I also modified a different but related unit test in `parquetSuites.scala`: SPARK-15248: explicitly added partitions should be readable This unit test asserts that Spark SQL should return data from a table partition which has been placed there outside a metastore query immediately after it is added. I changed the test so that, instead of adding the data as a parquet file saved in the partition's location, the data is added through a SQL `INSERT` query. I made this change because I could find no way to efficiently support partitioned table caching without failing that test. In addition to my primary motivation, I can offer a few reasons I believe this is an acceptable weakening of that test. First, it still validates a fix for [SPARK-15248](https://issues.apache.org/jira/browse/SPARK-15248), the issue for which it was written. Second, the assertion made is stronger than that required for non-partitioned tables. If you write data to the storage location of a non-partitioned metastore table without using a proper SQL DML query, a subsequent call to show that data will not return it. I believe this is an intentional limitation put in place to make table caching feasible, but I'm only speculating. Building a large `HadoopFsRelation` requires `stat`-ing all of its data files. In our environment, where we have tables with 10's of thousands of partitions, the difference between using a cached relation versus a new one is a matter of seconds versus minutes. Caching partitioned table metadata vastly improves the usability of Spark SQL for these cases. Author: Reynold Xin <[email protected]> Author: Michael Allman <[email protected]> Closes #14064 from yhuai/spark-15968-branch-2.0.
(Please note this is a revision of PR #13686, which has been closed in favor of this PR.)
This PR addresses SPARK-15968.
What changes were proposed in this pull request?
The
getCachedmethod of HiveMetastoreCatalog computespathsInMetastorefrom the metastore relation's catalog table. This only returns the table base path, which is incomplete/inaccurate for a nonempty partitioned table. As a result, cached lookups on nonempty partitioned tables always miss.Rather than get
pathsInMetastorefromI modified the
getCachedmethod to take apathsInMetastoreargument. Calls to this method pass in the paths computed from calls to the Hive metastore. This is howgetCachedwas implemented in Spark 1.5:spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
Line 444 in e0c3212
I also added a call in
InsertIntoHiveTable.scalato invalidate the table from the SQL session catalog.How was this patch tested?
I've added a new unit test to
parquetSuites.scala:Note that the only difference between this new test and the one above it in the file is that the new test populates its partitioned table with a single value, while the existing test leaves the table empty. This reveals a subtle, unexpected hole in test coverage present before this patch.
Note I also modified a different but related unit test in
parquetSuites.scala:This unit test asserts that Spark SQL should return data from a table partition which has been placed there outside a metastore query immediately after it is added. I changed the test so that, instead of adding the data as a parquet file saved in the partition's location, the data is added through a SQL
INSERTquery. I made this change because I could find no way to efficiently support partitioned table caching without failing that test.In addition to my primary motivation, I can offer a few reasons I believe this is an acceptable weakening of that test. First, it still validates a fix for SPARK-15248, the issue for which it was written. Second, the assertion made is stronger than that required for non-partitioned tables. If you write data to the storage location of a non-partitioned metastore table without using a proper SQL DML query, a subsequent call to show that data will not return it. I believe this is an intentional limitation put in place to make table caching feasible, but I'm only speculating.
Building a large
HadoopFsRelationrequiresstat-ing all of its data files. In our environment, where we have tables with 10's of thousands of partitions, the difference between using a cached relation versus a new one is a matter of seconds versus minutes. Caching partitioned table metadata vastly improves the usability of Spark SQL for these cases.Thanks.