Persistence implementations for list pagination #1555

eric-maynard · 2025-05-09T21:08:26Z

In #1528, we introduced the interface changes necessary to paginate requests to listTables, listViews, and listNamespaces. This PR adds the persistence-level logic for pagination and a new PageToken type EntityIdPageToken used to paginate requests based on entity ID.

snazy

I don't think that the approach implemented here yields correct results. The code assumes strict ordering of integer IDs, which is from general experience w/ relational DBs and in particular looking at org.apache.polaris.extension.persistence.relational.jdbc.IdGenerator not the case.

snazy · 2025-05-12T07:50:11Z

...n/java/org/apache/polaris/extension/persistence/relational/jdbc/JdbcBasePersistenceImpl.java

    String query = QueryGenerator.generateSelectQuery(new ModelEntity(), params);
+
+    if (pageToken instanceof EntityIdPageToken entityIdPageToken) {
+      query += String.format(" AND id > %d ORDER BY id ASC", entityIdPageToken.getId());


How would this work with org.apache.polaris.extension.persistence.relational.jdbc.IdGenerator?

I think you're right that it won't; this logic is copied from EclipseLink where IDs are always increasing but does not work with the current way that the JDBC metastore creates IDs.

I'd propose to change the Page/PageToken contract in a way to push the parameter "as is" down to the persistence layer and let the persistence implementation deal with it.

I spoke with @singhpk234 who noted this is probably the same discussion as here on the old PR. With that context, I think we might be OK here.

IMO it's alright that the list ordering you'd get across metastores won't be the same. Other than that difference, seems like everything should work with JDBC's IdGenerator. Although the IDs aren't generated sequentially, pagination only uses the entity ID as an essentially arbitrary consistent ordering.

The key implication here is that if an entity gets created in the middle of a listing operation (e.g. between list calls 2 and 3) it may or may not show up in the next page. An alternative would be to try to filter it out so that the behavior is more obvious & consistent, but I think the simple approach that ultimately gives the user a chance to see these new entities is good.

Losing new entities that are stored after pagination start is fine from my POV. The JDBC persistence does not implement catalog-level versioning, so this is unavoidable, I guess.

Agreed that we will naturally lose some entities, the question is whether we are OK with entities stored after pagination start being lost nondeterministically rather than always. Right now, whether the new entity is lost or not depends on what entity ID it gets. If it gets a high entity ID you might see it in a later page and if it gets a low ID you might not.

My thought on this question is "yes", because it's better to show the entity if we can and it simplifies the code.

But if we feel like this is too unintuitive, we can add a secondary filter on the entity's creation time to try and get rid of these entities (on a best-effort basis, since clocks are not perfect).

I think current pagination behaviour wrt concurrent changes is fine.

Making it deterministic would be a great addition to Polaris, but that, I think, has a much broader scope. For example, if an entry is deleted after pagination starts, but a client re-submits a page request using an old token, the new response would still be inconsistent with the old response.

From my POV a complete and deterministic pagination solution implies catalog-level versioning.

eric-maynard · 2025-05-12T17:22:21Z

The code assumes strict ordering of integer IDs

On this note, I think it's not true actually. The code assumes that IDs are sortable but it doesn't rely on any kind of semantic meaning behind this comparison. So IDs can be created totally randomly and you can still paginate simply by breaking that random key space into pages of some size. There's no assumption that new entries will appear at the end, either.

adnanhemani

Small comments, overall LGTM

adnanhemani · 2025-05-13T20:25:39Z

polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/PageToken.java

+      try {
+        String[] parts = token.split("/");
+        if (parts.length < 1) {
+          throw new IllegalArgumentException("Invalid token format: " + token);
+        } else if (parts[0].equals(EntityIdPageToken.PREFIX)) {
+          int resolvedPageSize = pageSize == null ? Integer.parseInt(parts[2]) : pageSize;
+          return new EntityIdPageToken(Long.parseLong(parts[1]), resolvedPageSize);
+        } else {
+          throw new IllegalArgumentException("Unrecognized page token: " + token);
+        }
+      } catch (NumberFormatException | IndexOutOfBoundsException e) {
+        LOGGER.debug(e.getMessage());
+        throw new IllegalArgumentException("Invalid token format: " + token);
+      }


I, personally, find this fragment a bit more complex than it may need to be. Is there a reason why we cannot defensively check for the right amount array length after splitting it right away? Same for the NumberFormationException?

I wanted to structure the code in a way that obviously leaves the door open for other PageToken implementations -- those would have different array length expectations. So we check the prefix first, and then parse the token using the logic appropriate for the PageToken implementation that the prefix corresponds to. Ideally we could even push this parsing logic down into some method in the PageToken.

In the old PR, there were 2 parseable PageToken implementations. I do agree that it looks a little clunky with the single PageToken implementation we have now. If this really is too confusing I can simplify this logic and then we can re-complicate it if/when we add a new PageToken.

If this really is too confusing I can simplify this logic and then we can re-complicate it if/when we add a new PageToken.

I'd personally prefer this - but don't care enough if we do this or not, I can understand the reasoning.

...ce/common/src/test/java/org/apache/polaris/service/persistence/pagination/PageTokenTest.java

Following up on apache#1555 * Refactor pagination code to delineate API-level page tokens and internal "pointers to data" * Requests deal with the "previous" token, user-provided page size (optional) and the previous request's page size. * Concentrate the logic of combining page size requests and previous tokens in PageTokenUtil * PageToken subclasses are no longer necessary. EntityIdPaging handles pagination over ordered result sets with static helper methods. Co-authored-by: Eric Maynard <[email protected]>

github-actions · 2025-06-29T02:13:18Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Based on apache#1838, following up on apache#1555 * Allows multiple implementations of `Token` referencing the "next page", encapsulated in `PageToken`. No changes to `polaris-core` needed to add custom `Token` implementations. * Extensible to (later) support (cryptographic) signatures to prevent tampered page-token * Refactor pagination code to delineate API-level page tokens and internal "pointers to data" * Requests deal with the "previous" token, user-provided page size (optional) and the previous request's page size. * Concentrate the logic of combining page size requests and previous tokens in `PageTokenUtil` * `PageToken` subclasses are no longer necessary. * Serialzation of `PageToken` uses Jackson serialization (smile format) Since no (metastore level) implementation handling pagination existed before, no backwards compatibility is needed. Co-authored-by: Dmitri Bourlatchkov <[email protected]> Co-authored-by: Eric Maynard <[email protected]>

Based on #1838, following up on #1555 * Allows multiple implementations of `Token` referencing the "next page", encapsulated in `PageToken`. No changes to `polaris-core` needed to add custom `Token` implementations. * Extensible to (later) support (cryptographic) signatures to prevent tampered page-token * Refactor pagination code to delineate API-level page tokens and internal "pointers to data" * Requests deal with the "previous" token, user-provided page size (optional) and the previous request's page size. * Concentrate the logic of combining page size requests and previous tokens in `PageTokenUtil` * `PageToken` subclasses are no longer necessary. * Serialzation of `PageToken` uses Jackson serialization (smile format) Since no (metastore level) implementation handling pagination existed before, no backwards compatibility is needed. Co-authored-by: Dmitri Bourlatchkov <[email protected]> Co-authored-by: Eric Maynard <[email protected]>

eric-maynard added 3 commits May 9, 2025 13:10

add pagetoken impl

3375ff5

persistence impls

255e44b

stable

015e637

eric-maynard requested review from adutra, ashvina, collado-mike, dennishuo, dimas-b, jackye1995, jbonofre and vvcephei as code owners May 9, 2025 21:08

github-project-automation bot added this to Basic Kanban Board May 9, 2025

eric-maynard requested review from HonahX, MonkeyCanCode, RussellSpitzer, ebyhr, flyrain, pingtimeout, snazy and takidau as code owners May 9, 2025 21:08

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board May 9, 2025

eric-maynard added 3 commits May 9, 2025 14:12

another test

434ffb1

another small test

8ffc29d

autolint

66d20aa

eric-maynard force-pushed the pagination-persistence branch from 1c383cf to 66d20aa Compare May 9, 2025 21:23

typofix

a5d62ed

snazy requested changes May 12, 2025

View reviewed changes

eric-maynard requested a review from snazy May 12, 2025 17:04

adnanhemani approved these changes May 13, 2025

View reviewed changes

dimas-b mentioned this pull request Jun 9, 2025

Persistence implementation for pagination in some requests #1838

Closed

snazy mentioned this pull request Jun 25, 2025

Extensible pagination token implementation #1938

Merged

github-actions bot added the Stale label Jun 29, 2025

github-actions bot closed this Jul 6, 2025

github-project-automation bot moved this from PRs In Progress to Done in Basic Kanban Board Jul 6, 2025

Persistence implementations for list pagination #1555

Persistence implementations for list pagination #1555

Uh oh!

Conversation

eric-maynard commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snazy left a comment

Choose a reason for hiding this comment

Uh oh!

snazy May 12, 2025

Choose a reason for hiding this comment

Uh oh!

eric-maynard May 12, 2025

Choose a reason for hiding this comment

Uh oh!

snazy May 12, 2025

Choose a reason for hiding this comment

Uh oh!

eric-maynard May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimas-b May 13, 2025

Choose a reason for hiding this comment

Uh oh!

eric-maynard May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimas-b May 15, 2025

Choose a reason for hiding this comment

Uh oh!

eric-maynard commented May 12, 2025

Uh oh!

adnanhemani left a comment

Choose a reason for hiding this comment

Uh oh!

adnanhemani May 13, 2025

Choose a reason for hiding this comment

Uh oh!

eric-maynard May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adnanhemani May 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jun 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eric-maynard commented May 9, 2025 •

edited

Loading

eric-maynard May 12, 2025 •

edited

Loading

eric-maynard May 13, 2025 •

edited

Loading

eric-maynard May 13, 2025 •

edited

Loading