-
Notifications
You must be signed in to change notification settings - Fork 330
API Spec: Add ConnectionConfigInfo to ExternalCatalog #1026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
025eaad
a8d8d5c
1d16f04
047a4cc
42fa76b
dadf5b3
7f3c4f5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -850,9 +850,93 @@ components: | |
| - $ref: "#/components/schemas/Catalog" | ||
| - type: object | ||
| properties: | ||
| remoteUrl: | ||
| type: string | ||
| description: URL to the remote catalog API | ||
| connectionConfigInfo: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do updates happen? It looks like changing any config property in an Having client re-submit credentials on every config change is probably not ideal 🤔 WDYT?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, the way we express updates probably needs some rework anyways. Even though the original structure of
Which is a bit ambiguous, especially when it comes to partially specifying optional portions of a required outer struct (e.g. if In practice, there's basically currently some very specialized logic for deciding which fields are supposed to be "total replace", or "delete through omission" or "ignore if omitted" which is definitely confusing when using the API. On the plus side at least we didn't just make the body of In theory the HTTP verb should've been I'm starting to think I'll leave
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dennishuo @sfc-gh-rxing this change seems removing an existing field called remoteUrl. I am wondering would that break the existing user?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah normally I'd be more hesitant to make breaking changes, but in this case there wasn't any code in Polaris actually using it. It's conceivable that someone who customized Polaris for their own services uses it, so maybe we could see if anyone vetoes it, but in general it doesn't really make sense to have a I've at least checked that one of the large stakeholders of a customized Polaris deployment (Snowflake OpenCatalog) does not use it.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As a project , I think we have to allow some leeway in breaking API changes. It may be worth marking certain areas as "alpha" / "beta" even when they are merged... until the API stabilizes. I do not think we can assume that every API change results in a "final" API contract at this stage of the project.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. I am not against doing this change, as far as we have thought about the consequence, i think we should be fine. |
||
| $ref: "#/components/schemas/ConnectionConfigInfo" | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to make
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For current backwards-compatibility, I was actually thinking of it like:
Admittedly that might be kind of a hack to only use the presence of connection config to determine static vs passthrough facade, but conceptually, it makes sense that an ExternalCatalog that can't dial out to a remote catalog fundamentally must behave as a "static facade" where content is "pushed" into the ExternalCatalog.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or should we introduce a new catalog type like
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd actually prefer to move away from more "type-differentiation" in favor of "capability differentiation", since we've talked about wanting the ability to convert from an external catalog to an internal catalog in the future someday. Then, a catalog simply may or may not have a Aside from conversion to INTERNAL, there's also a close relationship between EXTERNAL catalogs with and without ConnectionConfigInfo; maybe people start out with a typical migration tool and plain notification-based EXTERNAL catalog, but then want to enable federation on the existing catalog that might already have grants defined. The entity metadata we might have "cached" locally would then be able to be "verified" during It might still be good to have an enum of some sort so that we're not just inferring a mode-of-operation based on what fields are present, but maybe that would be better as a separate
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Your comment makes sense to me, @dennishuo ! However, if we follow that path to the logical end, I think we're looking at a full redesign of the Management REST API 😅 This is actually fine from my POV. We can have APIs v1 and v2 co-existing for a while. |
||
|
|
||
| ConnectionConfigInfo: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we allow updating the ConnectionConfigInfo on catalog entity? See
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch, yes, I'll add it to the update as well.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! For updating connection configs, I think we should only allow modifying the secret info so that users can rotate the secrets. For other connection configs, if we allow customers to modify them, customers may point to another remote catalog.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thinking about it more, I replied to @dimas-b 's comment here as well: https://github.com/apache/polaris/pull/1026/files#r1978040280 I think we might want to rework how we express updates so that it's less confusing. Right now it almost looks like the API takes a strict REST-style "replace entire object" approach, but it's already subtly conditioned on which fields are "special" to be ignored vs deleted vs modified. If we do it Iceberg's This is possibly going to be a bit complicated to hash out, so maybe it's best to leave ConnectionConfigInfo out of
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, we’ve run into a lot of issues with updating the storage config. For connection config, it would be helpful to define which properties are alterable and establish a fine-grained update spec. |
||
| type: object | ||
| description: A connection configuration representing a remote catalog service. IMPORTANT - Specifying a | ||
| ConnectionConfigInfo in an ExternalCatalog is currently an experimental API and is subject to change. | ||
| properties: | ||
| connectionType: | ||
| type: string | ||
| enum: | ||
| - ICEBERG_REST | ||
| description: The type of remote catalog service represented by this connection | ||
| uri: | ||
| type: string | ||
| description: URI to the remote catalog service | ||
| authenticationParameters: | ||
| $ref: "#/components/schemas/AuthenticationParameters" | ||
| required: | ||
| - connectionType | ||
| discriminator: | ||
| propertyName: connectionType | ||
| mapping: | ||
| ICEBERG_REST: "#/components/schemas/IcebergRestConnectionConfigInfo" | ||
|
|
||
| IcebergRestConnectionConfigInfo: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there any particular reason to put Iceberg as part of the name? the RestConnectionConfigInfo seems something can be used for non-iceberg REST service also. In your opinion, how it would look like if we want to generalize this to support none-iceberg REST service?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This one is specifically for the Iceberg REST-specific subclass, where the Anyways, right now the Iceberg REST connection does have the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the explanation! I am more of thinking the case that we might want to enable connection with other catalog service like glue/unit-catalog/hive-metastore in general, not just iceberg endpoints. I haven't looked into how discriminator works, and have one quick question, if in the future we remove the discriminator, will there be any user facing impact?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, each other If we remove the discriminator, the JSON on the wire is still generally compatible, if we flatten the fields of all possible subtypes into the base type. The internal autogenerated java classes will not be compatible, but can be rewritten if we want that. |
||
| type: object | ||
| description: Configuration necessary for connecting to an Iceberg REST Catalog | ||
| allOf: | ||
| - $ref: '#/components/schemas/ConnectionConfigInfo' | ||
| properties: | ||
| remoteCatalogName: | ||
| type: string | ||
| description: The name of a remote catalog instance within the remote catalog service; in some older systems | ||
| this is specified as the 'warehouse' when multiple logical catalogs are served under the same base | ||
| uri, and often translates into a 'prefix' added to all REST resource paths | ||
|
|
||
| AuthenticationParameters: | ||
| type: object | ||
| description: Authentication-specific information for a REST connection | ||
| properties: | ||
| authenticationType: | ||
| type: string | ||
| enum: | ||
| - OAUTH | ||
| - BEARER | ||
| description: The type of authentication to use when connecting to the remote rest service | ||
| required: | ||
| - authenticationType | ||
| discriminator: | ||
| propertyName: authenticationType | ||
| mapping: | ||
| OAUTH: "#/components/schemas/OAuthClientCredentialsParameters" | ||
| BEARER: "#/components/schemas/BearerAuthenticationParameters" | ||
|
|
||
| OAuthClientCredentialsParameters: | ||
| type: object | ||
| description: OAuth authentication based on client_id/client_secret | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it's just about the client credentials flow, maybe we can rename the type to be something like
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done, including the casing |
||
| allOf: | ||
| - $ref: '#/components/schemas/AuthenticationParameters' | ||
| properties: | ||
| tokenUri: | ||
| type: string | ||
| description: Token server URI | ||
| clientId: | ||
| type: string | ||
| description: oauth client id | ||
| clientSecret: | ||
| type: string | ||
| format: password | ||
| description: oauth client secret (input-only) | ||
| scopes: | ||
| type: array | ||
| items: | ||
| type: string | ||
| description: oauth scopes to specify when exchanging for a short-lived access token | ||
|
|
||
| BearerAuthenticationParameters: | ||
| type: object | ||
| description: Bearer authentication directly embedded in request auth headers | ||
| allOf: | ||
| - $ref: '#/components/schemas/AuthenticationParameters' | ||
| properties: | ||
| bearerToken: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure storing credentials inside catalog objects is a good idea in general. I believe we briefly discussed it elsewhere. Also apache/iceberg#12197 opens a lot of other authentication possibilities. I suppose we these options to proceed :
I'm personally fine either way, but I want to emphasize that with option 1 the API will go though a series on non-backward-compatible changes as we are approaching the finish line. WDYT?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, the intent here is to support a variety of authentication models, this could be something we hash out more on the dev mailing list as well. I agree we could iterate quickly by marking the feature as alpha initially, as I suspect the end-to-end interaction will make the considerations more obvious as we iterate on it. As for "storing", I think there are two aspects:
Are you concerned about the API structure for (1)? I agree we probably don't want (2) to default to storing plaintext in the persistence DB at any point in time. There are a few different models that could coexist for (2):
The nice thing about such a model is that it's fairly flexible for different secrets-management flows under a single interface. For example, to add direct sigv4-based auth to AWS Glue or S3Tables, the part where the SecretsManager transforms the entity would not actually be handling a secret directly, but instead performing the assumeRole relationship like how the StorageConfiguration does it; in this world, true "secrets" are only implicitly handled within the environment, but the flow looks the same -- embed the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe even passing secrets through the Management API is a security risk. Ideally, the act of configuring an external catalog would reference secrets (e.g. by URN) as opposed to submitting them directly. It might be best to start another dev ML thread and doc for this, tough. I'm there are alots of details to iron out. Iterative approach LGTM. |
||
| type: string | ||
| format: password | ||
| description: Bearer token (input-only) | ||
|
|
||
| StorageConfigInfo: | ||
| type: object | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For external catalog, is the storage config required if Polaris just passes through the response sent from remote catalog and not generates the subscoped creds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, StorageConfig might become more optional. However, this is actually an important design point about whether we're willing to return the remote catalog's subscoped creds.
At least some of the known use cases explicitly want Polaris to be the one responsible for access control and credential vending, while the remote catalog does not perform credential vending. So we want the ability for Polaris to mix-in vended credentials.
Returning the remote catalog's vended credentials will probably need to be configurable. For most real use cases we'd probably want some formal protocol for declaring the "on-behalf-of delegation chain"; e.g. the ConnectionConfig contains a "system identity" but we'd want a way to declare the identity of the calling Principal in the request to the remote catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that we should provide a configuration for customers!
The vended credentials send back from the remote catalog represents the system identity, it's very powerful and we need some sorts of
request-scoped identity. This could be achieved by passing a http header to polaris.