Remove redundant locations when constructing access policies #2149

eric-maynard · 2025-07-21T21:01:03Z

Iceberg tables can technically store data across any number of paths, but Polaris currently uses 3 different locations for credential vending:

The table's base location
The table's write.data.path, if set
The table's write.metadata.path, if set

This was intended to capture scenarios where e.g. (2) is not a child path of (1), so that the vended credentials can still be valid for reading the entire table. However, there are systems that seem to always set (2) and (3), such as:

s3:/my-bucket/base/iceberg
s3:/my-bucket/base/iceberg/data
s3:/my-bucket/base/iceberg/metadata

In such cases the extra paths (e.g. extra resources in the AWS Policy) are redundant. In one such case, these redundant paths caused the policy to exceed the maximum allowable 2048 characters.

This PR removes redundant paths -- those that are the child of another path -- from the list of accessible locations tracked for a given table and does some slight refactoring to consolidate the logic for extracting these paths from a TableMetadata.

polaris-core/src/main/java/org/apache/polaris/core/storage/StorageUtil.java

snazy · 2025-07-23T07:39:03Z

polaris-core/src/main/java/org/apache/polaris/core/storage/StorageUtil.java

+
+  /** Removes "redundant" locations, so {/a/b/, /a/b/c, /a/b/d} will be reduced to just {/a/b/} */
+  private static @Nonnull Set<String> removeRedundantLocations(Set<String> locationStrings) {
+    HashSet<String> result = new HashSet<>(locationStrings);


Since this is a new collection, it can be produced by

locationStrings.stream() .filter(Objects::nonNull) .map(StorageLocation::of) .collect(Collectors.toCollection(HashSet::new));

That would remove duplicate locations, but not redundant locations like we want to. We'd still need to loop over the collection.

Yes, but you'd save the exponential instantiation of SotrageLocation objects

Actually you could safe the inner loop with a sorted collection, if the locations end with a /.

Oh true, let's try to save the extra StorageLocation calls. Just pushed an update, let me know what you think!

I don't think we need to get too crazy trying to optimize this loop, considering the real number of locations is <=3 in practice and that we loop over them elsewhere when doing the overlap check.

Not saying it's not possible (in the code here), but there's no IAM for individual paths.

Ah, yes, that's true -- basically #1801 is needed.

The main change to the javadocs seems to be comments like Must be overridden by subclasses, but I don't think that's true. In fact it's the opposite -- to date we've only created subclasses where we wish to override, but we could create subclasses where it's not necessary. For example, a LocalStorageLocation.

True, "must" might be too strong. The implementations may however be very specific to the storage-type - that's what I've been thinking of.

That's a good point, let me make this explicit in a javadoc

polaris-core/src/test/java/org/apache/polaris/core/storage/StorageUtilTest.java

service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java

snazy · 2025-07-23T07:41:59Z

service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java

        callContext, ioImpl, properties, identifier, locations, storageActions, resolvedPath);
  }

-  private void blockedUserSpecifiedWriteLocation(Map<String, String> properties) {


loadFileIO is also unused and should be removed as well.

Agreed that it can be removed, but unlike blockedUserSpecifiedWriteLocation it's not related to this PR so I think we should do that separately.

…into smaller-policy

snazy · 2025-08-04T09:12:53Z

polaris-core/src/main/java/org/apache/polaris/core/storage/StorageUtil.java

+      for (StorageLocation potentialChild : locations) {
+        if (!potentialParent.equals(potentialChild)) {
+          if (potentialChild.isChildOf(potentialParent)) {
+            locations.remove(potentialChild);


Looks like this can run into multiple error conditions: the Set being not mutable (There are no guarantees on the type, mutability, ... of the Set returned from Collectors.toSet()), a ConcurrentModificationException in the Iterators of the nested loops, or one of the iterators (depending on the Set implementation) skipping an element after a .remove()

We can revert the change, but what about just wrapping the set in new HashSet?

I went with a revert, because it seems that the logic to do this is actually much more complicated than originally thought. The logic becomes more stateful, more imperative, and just generally longer. It seems that the extra 6 calls to StorageLocation.of could be preferable.

snazy

LGTM!

* fix(deps): update dependency io.projectreactor.netty:reactor-netty-http to v1.2.9 (apache#2326) * Add getting-started example with external authentication (apache#2244) * chore(deps): update quay.io/keycloak/keycloak docker tag to v26.3.2 (apache#2331) * fix(deps): update immutables to v2.11.3 (apache#2333) * JWTBroker: move error message (apache#2330) This change moves the `LOGGER.error` call when a token cannot be verified from `verify()` to `generateFromToken()`. On the token generation path, this should be a no-op; however, on the authentication path, this log message was excessive, especially when using mixed authentication since a failure to decode a token is perfectly normal when the token is from an external IDP. * Let CI archive html test reports (apache#2327) when having to debug CI test failures its much more convenient to be able to download the html report compared to the XML reports (as the latter requires to you find the right file/failure manually). * Make S3 `roleARN` optional (apache#2329) Fixes apache#2325 * Remove spotbugs-annotations (apache#2320) we dont seem to be running spotbugs/findbugs in our build, so depending on the annotations is not necessary. also fix name of common-codec lib. * Remove redundant locations when constructing access policies (apache#2149) Iceberg tables can technically store data across any number of paths, but Polaris currently uses 3 different locations for credential vending: 1. The table's base location 2. The table's `write.data.path`, if set 3. The table's `write.metadata.path`, if set This was intended to capture scenarios where e.g. (2) is not a child path of (1), so that the vended credentials can still be valid for reading the entire table. However, there are systems that seem to always set (2) and (3), such as: 1. `s3:/my-bucket/base/iceberg` 2. `s3:/my-bucket/base/iceberg/data` 3. `s3:/my-bucket/base/iceberg/metadata` In such cases the extra paths (e.g. extra resources in the AWS Policy) are redundant. In one such case, these redundant paths caused the policy to exceed the maximum allowable 2048 characters. This PR removes redundant paths -- those that are the child of another path -- from the list of accessible locations tracked for a given table and does some slight refactoring to consolidate the logic for extracting these paths from a TableMetadata. * Remove CallContext from IcebergPropertiesValidation (apache#2338) it is sufficient to pass the `RealmConfig`. same applies to helpers in `PolarisEndpoints`. * Add entitySubType param to BasePersistence.listEntities (apache#2317) `BasePersistence.listEntities` has 3 variants: ``` Page<EntityNameLookupRecord> listEntities(..., PageToken); Page<EntityNameLookupRecord> listEntities(..., Predicate<PolarisBaseEntity>, PageToken) <T> Page<T> listEntities(..., Predicate<PolarisBaseEntity>, Function<PolarisBaseEntity, T>, PageToken); ``` the 1st method exists to only return the subset of entity properties required to build an `EntityNameLookupRecord`. the 3rd method supports a predicate and transformer function on the underlying `PolarisBaseEntity`, which means it has to load all entity properties. the 2nd method is weird as it supports a full `Predicate<PolarisBaseEntity>`, which means it has to load all entity properties under the hood for filtering but then throws most of them away to return a `EntityNameLookupRecord`. this explains why the implementations of the 2nd method simply forward to the 3rd method usually. any performance benefits of returning a `EntityNameLookupRecord` are lost. as it turns out the 2nd method is only used, because methods 1 and 3 dont support passing a `PolarisEntitySubType` parameter to filter down the retrieved data. Note that the sub type property is available from both the `PolarisBaseEntity` as well as the `EntityNameLookupRecord`. By adding this parameter, the 2nd method can go away completely. we can even push down the sub type filtering into the queries of some of our persistence implementations. other existing implementations are free to decide whether they want to push it down as well or filter on the query results in memory. note that since we have no `TransactionalPersistence` implementation in the codebase that provides an optimized variant of method 1 we can have a default method in the interface that forwards to method 3. * Add PyIceberg example (apache#2315) It is not obvious how to connect PyIceberg to a Polaris catalog. This PR clears that up by providing an example in the getting-started section of the documentation. * fix(docs): fix some broken url. (apache#2335) * fix(docs): fix entity doc API links. (apache#2316) * fix(deps): update dependency io.netty:netty-codec-http2 to v4.2.4.final (apache#2342) * NoSQL: Misc ports * Adopt to the state of apache#2131 (OSS NoSQL PR / idgen) * Track "base locations" and use an index to detect conflicts (via PolarisMetaStoreManager.hasOverlappingSiblings). Feature must be enabled in the Polaris config. Implementation prepared for intentional overlaps. Backwards compatible, except for checks against already existing tables. * Cosmetic changes (bunch of) * Some more adoptions from OSS ... based on a `git diff` against the OSS `persistence-nosql` PR branch. * Last merged commit 4c23eb7 --------- Co-authored-by: Mend Renovate <[email protected]> Co-authored-by: Alexandre Dutra <[email protected]> Co-authored-by: Christopher Lambert <[email protected]> Co-authored-by: Eric Maynard <[email protected]> Co-authored-by: Frederic Khayat <[email protected]> Co-authored-by: Yujiang Zhong <[email protected]>

eric-maynard added 6 commits July 21, 2025 10:35

initial commit

1f73d94

remove redundancy

343293c

autolint

d169eb9

all calls

e87a836

Test

5da8ac7

autolint

7890465

github-project-automation bot added this to Basic Kanban Board Jul 21, 2025

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board Jul 21, 2025

eric-maynard added 6 commits July 21, 2025 14:01

one more test

3a2c438

autolint

240f9f1

typofix

0cb6fd8

autolint

82dcda7

polish

e3cf2cc

Merge branch 'main' into smaller-policy

2d05d74

snazy reviewed Jul 23, 2025

View reviewed changes

eric-maynard added 5 commits July 30, 2025 21:01

changes per review

c0382d1

autolint

e2fe773

another fix

f1e9bc6

Merge branch 'smaller-policy' of github.202132.xyz-oss:eric-maynard/polaris …

97aedbd

…into smaller-policy

autolint

54028ac

eric-maynard requested a review from snazy July 30, 2025 12:02

eric-maynard added 3 commits August 2, 2025 14:41

Merge branch 'main' of github.com:apache/polaris into smaller-policy

617f00a

remove repeated StorageLocation.of calls

6adb556

autolint

854ad52

snazy reviewed Aug 4, 2025

View reviewed changes

eric-maynard added 5 commits August 4, 2025 08:47

flip removal

85c1fc5

revert

8aeed87

autolint

73284c7

tweak loop

7968ec3

tweak loop

7efdc0c

eric-maynard requested a review from snazy August 5, 2025 17:53

eric-maynard added 3 commits August 11, 2025 14:25

Merge branch 'main' of github.com:apache/polaris into smaller-policy

cd93fe9

improve StorageLocation

272d6c2

autolint

75daf08

snazy approved these changes Aug 12, 2025

View reviewed changes

github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Aug 12, 2025

eric-maynard merged commit 027d80b into apache:main Aug 13, 2025
12 checks passed

github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Aug 13, 2025

dennishuo mentioned this pull request Aug 29, 2025

Service: Validate allowed locations from Catalog Storage Config #2473

Merged

Remove redundant locations when constructing access policies #2149

Remove redundant locations when constructing access policies #2149

Uh oh!

Conversation

eric-maynard commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snazy Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-maynard Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snazy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eric-maynard commented Jul 21, 2025 •

edited

Loading

snazy Aug 1, 2025 •

edited

Loading

eric-maynard Aug 7, 2025 •

edited

Loading