-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-28341][SQL] create a public API for V2SessionCatalog #25104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...ore/src/test/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalogSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this end-to-end test checks the actual behavior the end-users will hit, as we don't set a fake session catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the test name a little bit, because we will always use default catalog when the config is set, no matter what the table provider is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that, this is never supported. The test can pass before because we set a fake session catalog which returns writable tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, this doesn't work because the ORC source is broken.
The test case is still valid, so I don't see a reason why we should get rid of it. I also think that it is a valid use to override the v2 session catalog implementation, which this relies on for tests.
|
Test build #107498 has finished for PR 25104 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use the Java-like "get"? I think def catalog(name: String) is sufficient.
I try to avoid adding get to names. It doesn't help understanding unless the method is a getter -- and this isn't -- so why include an extra word?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. No need for get.
|
This looks okay overall, but it breaks the test for v2 providers with the session catalog. I don't think this can be committed until that is resolved. @cloud-fan, can you test whether one of the test providers will work in place of ORC v2 in this test? The second issue is that this removes the ability to override the v2 session catalog implementation. I think we want to support that so you can modify the behavior of the v2 session catalog, but continue to use the v1 catalog for v1 cases. @cloud-fan, do you have any idea about how to make this possible? I think @brkyvz is interested in overriding the v2 session catalog as well. |
brkyvz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue I still have some confusion around the default.catalog conf and catalog.session.
spark.sql.default.catalog:Name of the default v2 catalog, used when a catalog is not identified in queriesspark.sql.catalog.session:Name of the default v2 catalog, used when a catalog is not identified in queries.
The description of both is identical. The default v2 catalog can be configured on a session basis just like the v2_session_catalog. Why can't DEFAULT_V2_CATALOG be used to configure the session catalog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The session catalog needs to be configurable. This is how custom data sources / table formats will plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the nice error message handling is gone, if there's an error, what does it look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error message handling is in defaultCatalog() and v2SessionCatalog.
|
@brkyvz, the description must have been copied by mistake. The session property controls what implementation we use for the v2 session catalog. See my comment above for the difference between the v2 session catalog and the default catalog. |
6a6b362 to
4f89b88
Compare
4f89b88 to
5f0c5e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be a StaticConf? Wouldn't this make it impossible to plug in your own session catalog when using the Spark Shell for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to make this a static conf because the v2 session catalog cannot change after a session loads it the first time. Static confs can still be set using --conf right?
|
Test build #107596 has finished for PR 25104 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has a method, so it isn't a marker interface.
This should note that this is for implementations that want to use Spark's session catalog for storage, which is needed for writing replacements for V2SessionCatalog. Maybe also mention that this will go away when the SessionCatalog is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this doesn't replace the existing catalog initialization method, then it should have a different name. How about setSessionCatalog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not fall back to using an implementation other than the one the user configured. If the configured v2 session catalog cannot be instantiated, then the error should be thrown each time there is an attempt to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not swallow all non-fatal errors. It should catch CatalogNotFoundException only.
|
Test build #107598 has finished for PR 25104 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this TableCatalog and not CatalogPlugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should reference a JIRA issue to track the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use the original test case to minimize changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is odd that this requires passing in the session when getTestCatalog does not. I would prefer not using this function in places where the session is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this import used? Same with V2SessionCatalog. And check HiveSessionStateBuilder, too.
5f0c5e1 to
c68d63b
Compare
|
Test build #110162 has finished for PR 25104 at commit
|
|
Test build #110169 has finished for PR 25104 at commit
|
|
Test build #110170 has finished for PR 25104 at commit
|
|
Test build #110184 has finished for PR 25104 at commit
|
|
Test build #110218 has finished for PR 25104 at commit
|
|
retest this please |
| try { | ||
| catalogs.getOrElseUpdate(CatalogManager.SESSION_CATALOG_NAME, loadV2SessionCatalog()) | ||
| } catch { | ||
| case NonFatal(_) => defaultSessionCatalog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd log the error. The user asked for a specific catalog, but we're giving them the default. There's no way to figure out the discrepancy without having the error
brkyvz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except the one case around logging the error. Should we also make V2SessionCatalog private[spark] after this case?
|
Test build #110248 has finished for PR 25104 at commit
|
|
Another question is, some guidance on how a StagingTableCatalog can leverage the CatalogExtension API. |
|
If If users want to override the staging logic, they can implement |
|
Test build #110328 has finished for PR 25104 at commit
|
|
retest this please |
|
Test build #110336 has finished for PR 25104 at commit
|
|
thanks for the review, merging to master! |
|
BTW, |
|
@cloud-fan, if execution is intended to be private, should we move those classes into catalyst? |
|
We can do it if |
## What changes were proposed in this pull request? The `V2SessionCatalog` has 2 functionalities: 1. work as an adapter: provide v2 APIs and translate calls to the `SessionCatalog`. 2. allow users to extend it, so that they can add hooks to apply custom logic before calling methods of the builtin catalog (session catalog). To leverage the second functionality, users must extend `V2SessionCatalog` which is an internal class. There is no doc to explain this usage. This PR does 2 things: 1. refine the document of the config `spark.sql.catalog.session`. 2. add a public abstract class `CatalogExtension` for users to write implementations. TODOs for followup PRs: 1. discuss if we should allow users to completely overwrite the v2 session catalog with a new one. 2. discuss to change the name of session catalog, so that it's less likely to conflict with existing namespace names. ## How was this patch tested? existing tests Closes apache#25104 from cloud-fan/session-catalog. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
The
V2SessionCataloghas 2 functionalities:SessionCatalog.To leverage the second functionality, users must extend
V2SessionCatalogwhich is an internal class. There is no doc to explain this usage.This PR does 2 things:
spark.sql.catalog.session.CatalogExtensionfor users to write implementations.TODOs for followup PRs:
How was this patch tested?
existing tests