Skip to content
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,6 @@ private static void createCatalog(
.setStorageConfigInfo(storageConfig)
.build()
: ExternalCatalog.builder()
.setRemoteUrl("http://faraway.com")
.setName(catalogName)
.setType(catalogType)
.setProperties(props)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -466,12 +466,10 @@ public void testCreateExternalCatalog() {
.setAllowedLocations(List.of("s3://my-old-bucket/path/to/data"))
.build();
String catalogName = client.newEntityName("my-external-catalog");
String remoteUrl = "http://localhost:8080";
Catalog catalog =
ExternalCatalog.builder()
.setType(Catalog.TypeEnum.EXTERNAL)
.setName(catalogName)
.setRemoteUrl(remoteUrl)
.setProperties(new CatalogProperties("s3://my-bucket/path/to/data"))
.setStorageConfigInfo(awsConfigModel)
.build();
Expand All @@ -484,7 +482,6 @@ public void testCreateExternalCatalog() {
.isNotNull()
.isInstanceOf(ExternalCatalog.class)
.asInstanceOf(InstanceOfAssertFactories.type(ExternalCatalog.class))
.returns(remoteUrl, ExternalCatalog::getRemoteUrl)
.extracting(ExternalCatalog::getStorageConfigInfo)
.isNotNull()
.isInstanceOf(AwsStorageConfigInfo.class)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,6 @@ public void before(
.setName(externalCatalogName)
.setProperties(externalProps)
.setStorageConfigInfo(awsConfigModel)
.setRemoteUrl("http://dummy_url")
.build();

managementApi.createCatalog(externalCatalog);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,6 @@ public class CatalogEntity extends PolarisEntity {
// translated into "s3://my-bucket/base/location/ns1/ns2/table1".
public static final String REPLACE_NEW_LOCATION_PREFIX_WITH_CATALOG_DEFAULT_KEY =
"replace-new-location-prefix-with-catalog-default";
public static final String REMOTE_URL = "remoteUrl";

public CatalogEntity(PolarisBaseEntity sourceEntity) {
super(sourceEntity);
Expand All @@ -86,9 +85,6 @@ public static CatalogEntity fromCatalog(Catalog catalog) {
.setProperties(catalog.getProperties().toMap())
.setCatalogType(catalog.getType().name());
Map<String, String> internalProperties = new HashMap<>();
if (catalog instanceof ExternalCatalog) {
internalProperties.put(REMOTE_URL, ((ExternalCatalog) catalog).getRemoteUrl());
}
internalProperties.put(CATALOG_TYPE_PROPERTY, catalog.getType().name());
builder.setInternalProperties(internalProperties);
builder.setStorageConfigurationInfo(
Expand Down Expand Up @@ -120,7 +116,6 @@ public Catalog asCatalog() {
: ExternalCatalog.builder()
.setType(Catalog.TypeEnum.EXTERNAL)
.setName(getName())
.setRemoteUrl(getInternalPropertiesAsMap().get(REMOTE_URL))
.setProperties(catalogProps)
.setCreateTimestamp(getCreateTimestamp())
.setLastUpdateTimestamp(getLastUpdateTimestamp())
Expand Down
90 changes: 87 additions & 3 deletions spec/polaris-management-service.yml
Original file line number Diff line number Diff line change
Expand Up @@ -850,9 +850,93 @@ components:
- $ref: "#/components/schemas/Catalog"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For external catalog, is the storage config required if Polaris just passes through the response sent from remote catalog and not generates the subscoped creds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, StorageConfig might become more optional. However, this is actually an important design point about whether we're willing to return the remote catalog's subscoped creds.

At least some of the known use cases explicitly want Polaris to be the one responsible for access control and credential vending, while the remote catalog does not perform credential vending. So we want the ability for Polaris to mix-in vended credentials.

Returning the remote catalog's vended credentials will probably need to be configurable. For most real use cases we'd probably want some formal protocol for declaring the "on-behalf-of delegation chain"; e.g. the ConnectionConfig contains a "system identity" but we'd want a way to declare the identity of the calling Principal in the request to the remote catalog.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we should provide a configuration for customers!

The vended credentials send back from the remote catalog represents the system identity, it's very powerful and we need some sorts of request-scoped identity. This could be achieved by passing a http header to polaris.

- type: object
properties:
remoteUrl:
type: string
description: URL to the remote catalog API
connectionConfigInfo:
Copy link
Contributor

@dimas-b dimas-b Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do updates happen? It looks like changing any config property in an ExternalCatalog requires re-submitting the whole object... Is that so?

Having client re-submit credentials on every config change is probably not ideal 🤔 WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the way we express updates probably needs some rework anyways. Even though the original structure of UpdateCatalogRequest kind of looks a "replace all", it also says:

Any fields which are required in the Catalog will remain unaltered if omitted from the contents of this Update request.

Which is a bit ambiguous, especially when it comes to partially specifying optional portions of a required outer struct (e.g. if StorageConfigInfo is being partially updated).

In practice, there's basically currently some very specialized logic for deciding which fields are supposed to be "total replace", or "delete through omission" or "ignore if omitted" which is definitely confusing when using the API.

On the plus side at least we didn't just make the body of UpdateCatalogRequest be the full Catalog like a naive REST pattern would do. Probably we need to switch to a more imperative/verb-style like used in the Iceberg updateTable API and basically just flatten out all the possible individual-field updates, maybe accepting a list of updates in a single request.

In theory the HTTP verb should've been PATCH to most correctly match "partial-update" semantics, but there's some inconsistency in support of PATCH. Also it's interesting that Iceberg's updateTable is a POST as well as createTable also being a POST (on the parent namespace), which is presumably why they needed to make namespace-update a POST on /v1/{prefix}/namespaces/{namespace}/properties instead of on /v1/{prefix}/namespaces/{namespace}, instead of normally PUT being the update verb.

I'm starting to think I'll leave ConnectionConfigInfo out of scope of the current UpdateCatalog definition so we can put more thought into updates overall without blocking basic progress on federation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dennishuo @sfc-gh-rxing this change seems removing an existing field called remoteUrl. I am wondering would that break the existing user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah normally I'd be more hesitant to make breaking changes, but in this case there wasn't any code in Polaris actually using it.

It's conceivable that someone who customized Polaris for their own services uses it, so maybe we could see if anyone vetoes it, but in general it doesn't really make sense to have a remoteUrl without connection authentication settings anyways. At best it would be used as a cosmetic string to display somewhere.

I've at least checked that one of the large stakeholders of a customized Polaris deployment (Snowflake OpenCatalog) does not use it.

Copy link
Contributor

@dimas-b dimas-b Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a project , I think we have to allow some leeway in breaking API changes. It may be worth marking certain areas as "alpha" / "beta" even when they are merged... until the API stabilizes.

I do not think we can assume that every API change results in a "final" API contract at this stage of the project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I am not against doing this change, as far as we have thought about the consequence, i think we should be fine.

$ref: "#/components/schemas/ConnectionConfigInfo"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make ConnectionConfigInfo a required property for external catalogs, just like StorageConfigInfo is for internal catalogs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For current backwards-compatibility, I was actually thinking of it like:

  1. if (ExternalCatalog.getConnectionConfigInfo() == null) { internalSubtype = STATIC_FACADE; }
  2. else { internalSubtype = PASSTHROUGH_FACADE; }

Admittedly that might be kind of a hack to only use the presence of connection config to determine static vs passthrough facade, but conceptually, it makes sense that an ExternalCatalog that can't dial out to a remote catalog fundamentally must behave as a "static facade" where content is "pushed" into the ExternalCatalog.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or should we introduce a new catalog type like federated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually prefer to move away from more "type-differentiation" in favor of "capability differentiation", since we've talked about wanting the ability to convert from an external catalog to an internal catalog in the future someday.

Then, a catalog simply may or may not have a ConnectionConfigInfo; during a migration, it doesn't exactly matter what "type" we call the catalog, but it should be able to start functioning like a normal INTERNAL catalog at some point, while still using the ConnectionConfigInfo to detect updates in the remote catalog.

Aside from conversion to INTERNAL, there's also a close relationship between EXTERNAL catalogs with and without ConnectionConfigInfo; maybe people start out with a typical migration tool and plain notification-based EXTERNAL catalog, but then want to enable federation on the existing catalog that might already have grants defined. The entity metadata we might have "cached" locally would then be able to be "verified" during loadTable from the remote catalog, and then updateTable requests could be accepted as well.

It might still be good to have an enum of some sort so that we're not just inferring a mode-of-operation based on what fields are present, but maybe that would be better as a separate modeOfOperation enum than type, with key differences being:

  • The mode of operation doesn't necessarily strictly define the set of attributes in the catalog object
  • There would be no API-spec "discriminator" for sub-object based on the modeOfOperation -- the discriminator-style inheritance is honestly quite painful to work with
  • The mode of operation is more fluid, intended to be able to be changed over time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comment makes sense to me, @dennishuo !

However, if we follow that path to the logical end, I think we're looking at a full redesign of the Management REST API 😅 This is actually fine from my POV. We can have APIs v1 and v2 co-existing for a while.


ConnectionConfigInfo:
Copy link
Member

@XJDKC XJDKC Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow updating the ConnectionConfigInfo on catalog entity? See UpdateCatalogRequest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, yes, I'll add it to the update as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! For updating connection configs, I think we should only allow modifying the secret info so that users can rotate the secrets. For other connection configs, if we allow customers to modify them, customers may point to another remote catalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more, I replied to @dimas-b 's comment here as well: https://github.com/apache/polaris/pull/1026/files#r1978040280

I think we might want to rework how we express updates so that it's less confusing. Right now it almost looks like the API takes a strict REST-style "replace entire object" approach, but it's already subtly conditioned on which fields are "special" to be ignored vs deleted vs modified.

If we do it Iceberg's updateTable-style, we'd flatten out verbs to take update objects like UpdateCatalogConnectionConfigRemoteUri and UpdateCatalogConnectionConfigSecrets.

This is possibly going to be a bit complicated to hash out, so maybe it's best to leave ConnectionConfigInfo out of UpdateCatalog for now.

Copy link
Contributor

@sfc-gh-rxing sfc-gh-rxing Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we’ve run into a lot of issues with updating the storage config. For connection config, it would be helpful to define which properties are alterable and establish a fine-grained update spec.

type: object
description: A connection configuration representing a remote catalog service. IMPORTANT - Specifying a
ConnectionConfigInfo in an ExternalCatalog is currently an experimental API and is subject to change.
properties:
connectionType:
type: string
enum:
- ICEBERG_REST
description: The type of remote catalog service represented by this connection
uri:
type: string
description: URI to the remote catalog service
authenticationParameters:
$ref: "#/components/schemas/AuthenticationParameters"
required:
- connectionType
discriminator:
propertyName: connectionType
mapping:
ICEBERG_REST: "#/components/schemas/IcebergRestConnectionConfigInfo"

IcebergRestConnectionConfigInfo:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any particular reason to put Iceberg as part of the name? the RestConnectionConfigInfo seems something can be used for non-iceberg REST service also. In your opinion, how it would look like if we want to generalize this to support none-iceberg REST service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is specifically for the Iceberg REST-specific subclass, where the discriminator connectionType is ICEBERG_REST. I suppose if all different connection types end up needing the same set of parameters then we don't need to use discriminator-based subclasses and could just make the enum function by itself. Though if we then find out there are type-specific fields we'd need to basically add an "optional" of each possible specialized connection type if we didn't start out with the discriminator approach.

Anyways, right now the Iceberg REST connection does have the remoteCatalogName which is intended to be used in a way that is somewhat specific to Iceberg REST -- in particular, to pass it in as the warehouse property when calling getConfig and possibly expecting an override of the URL PREFIX. If this old-fashioned handshake is deprecated someday, then PREFIX would need to become first-class and again it would be specific to the way Iceberg REST constructs paths.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! I am more of thinking the case that we might want to enable connection with other catalog service like glue/unit-catalog/hive-metastore in general, not just iceberg endpoints. I haven't looked into how discriminator works, and have one quick question, if in the future we remove the discriminator, will there be any user facing impact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, each other connectionType should have a corresponding type-specific struct defined.

  discriminator:
    propertyName: connectionType
    mapping:
      ICEBERG_REST: "#/components/schemas/IcebergRestConnectionConfigInfo"
      HIVE: "#/components/schemas/HiveConnectionConfigInfo"

  HiveConnectionConfigInfo:
    ..

If we remove the discriminator, the JSON on the wire is still generally compatible, if we flatten the fields of all possible subtypes into the base type. The internal autogenerated java classes will not be compatible, but can be rewritten if we want that.

type: object
description: Configuration necessary for connecting to an Iceberg REST Catalog
allOf:
- $ref: '#/components/schemas/ConnectionConfigInfo'
properties:
remoteCatalogName:
type: string
description: The name of a remote catalog instance within the remote catalog service; in some older systems
this is specified as the 'warehouse' when multiple logical catalogs are served under the same base
uri, and often translates into a 'prefix' added to all REST resource paths

AuthenticationParameters:
type: object
description: Authentication-specific information for a REST connection
properties:
authenticationType:
type: string
enum:
- OAUTH
- BEARER
description: The type of authentication to use when connecting to the remote rest service
required:
- authenticationType
discriminator:
propertyName: authenticationType
mapping:
OAUTH: "#/components/schemas/OAuthClientCredentialsParameters"
BEARER: "#/components/schemas/BearerAuthenticationParameters"

OAuthClientCredentialsParameters:
type: object
description: OAuth authentication based on client_id/client_secret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's just about the client credentials flow, maybe we can rename the type to be something like OAuthClientCredentialsParameters?.. Iceberg REST Servers do not have to be restricted to client credentials, other flows may be available (e.g. delegation), which are still OAuth2, but will require different parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, including the casing OAuth to match conventions elsewhere.

allOf:
- $ref: '#/components/schemas/AuthenticationParameters'
properties:
tokenUri:
type: string
description: Token server URI
clientId:
type: string
description: oauth client id
clientSecret:
type: string
format: password
description: oauth client secret (input-only)
scopes:
type: array
items:
type: string
description: oauth scopes to specify when exchanging for a short-lived access token

BearerAuthenticationParameters:
type: object
description: Bearer authentication directly embedded in request auth headers
allOf:
- $ref: '#/components/schemas/AuthenticationParameters'
properties:
bearerToken:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure storing credentials inside catalog objects is a good idea in general. I believe we briefly discussed it elsewhere.

Also apache/iceberg#12197 opens a lot of other authentication possibilities.

I suppose we these options to proceed :

  1. Merge "as is" but treat the external catalogs feature as "alpha" , "subject to change", then incrementally improve connection auth and secret management.
  2. Hold this PR and work on improving those related areas, then re-do this API.

I'm personally fine either way, but I want to emphasize that with option 1 the API will go though a series on non-backward-compatible changes as we are approaching the finish line.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the intent here is to support a variety of authentication models, this could be something we hash out more on the dev mailing list as well.

I agree we could iterate quickly by marking the feature as alpha initially, as I suspect the end-to-end interaction will make the considerations more obvious as we iterate on it.

As for "storing", I think there are two aspects:

  1. Expressing the secret directly in the ConnectionConfig struct of the API object
  2. Persisting the secret somewhere

Are you concerned about the API structure for (1)?

I agree we probably don't want (2) to default to storing plaintext in the persistence DB at any point in time. There are a few different models that could coexist for (2):

  1. Fully require an external "secret" to already exist as a reference, e.g. to a Vault URI, and specify that Vault URI in the config info -- but this only shifts the problem to the meta-secret for accessing Vault, so at some point the caller needs to configure some secret on the Polaris side
  2. Require a separate SecretIntegration type of Polaris entity to already be created whose sole purpose is plumbing the secrets into some other configurable secret store (e.g. Polaris would manage Vault in this case), but this is somewhat complex for the API and ultimately the same kinds of internal secrets plumbing would need to be built.
  3. (Plan to do this): Within Polaris's logic for handling entities that define secrets, we can have a SecretsManager class whose purpose is to take secrets from API models and actually store them somewhere else that's safe - Vault, KMS, some other keystore, etc., and then add internal references to the stored secret inside the Polaris entity so that the secret can be found when needed. Callsites that need the unpacked secret again go through the SecretsManager to "extract" the secret from the Polaris entity; the implementation of the SecretsManager gets to decide for itself how it wants to store a reference and then unpack it again.
  4. (Should consider also supporting this model): Similar to (3) but if we want non-uniform types of secrets but a uniform secrets-management layer, the SecretsManager might, instead of only storing a reference to an external resource, could be leasing a new encryption key from the external system and then embedding the encrypted secret in some field within the Polaris entity. Then when the callsite again needs the unpacked secret, this SecretsManager would recognize to decrypt the encrypted blob in order to retrieve the original secrets

The nice thing about such a model is that it's fairly flexible for different secrets-management flows under a single interface. For example, to add direct sigv4-based auth to AWS Glue or S3Tables, the part where the SecretsManager transforms the entity would not actually be handling a secret directly, but instead performing the assumeRole relationship like how the StorageConfiguration does it; in this world, true "secrets" are only implicitly handled within the environment, but the flow looks the same -- embed the userArn and externalId into the Polaris entity, and then later when you want to unpack a secret, you use something externally-managed to become that userArn before doing an assumeRole to mint a new subscoped token.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe even passing secrets through the Management API is a security risk.

Ideally, the act of configuring an external catalog would reference secrets (e.g. by URN) as opposed to submitting them directly. It might be best to start another dev ML thread and doc for this, tough. I'm there are alots of details to iron out.

Iterative approach LGTM.

type: string
format: password
description: Bearer token (input-only)

StorageConfigInfo:
type: object
Expand Down