Skip to content

Conversation

@eric-maynard
Copy link
Contributor

@eric-maynard eric-maynard commented May 9, 2025

In #1528, we introduced the interface changes necessary to paginate requests to listTables, listViews, and listNamespaces. This PR adds the persistence-level logic for pagination and a new PageToken type EntityIdPageToken used to paginate requests based on entity ID.

@eric-maynard eric-maynard force-pushed the pagination-persistence branch from 1c383cf to 66d20aa Compare May 9, 2025 21:23
Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the approach implemented here yields correct results. The code assumes strict ordering of integer IDs, which is from general experience w/ relational DBs and in particular looking at org.apache.polaris.extension.persistence.relational.jdbc.IdGenerator not the case.

String query = QueryGenerator.generateSelectQuery(new ModelEntity(), params);

if (pageToken instanceof EntityIdPageToken entityIdPageToken) {
query += String.format(" AND id > %d ORDER BY id ASC", entityIdPageToken.getId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this work with org.apache.polaris.extension.persistence.relational.jdbc.IdGenerator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right that it won't; this logic is copied from EclipseLink where IDs are always increasing but does not work with the current way that the JDBC metastore creates IDs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd propose to change the Page/PageToken contract in a way to push the parameter "as is" down to the persistence layer and let the persistence implementation deal with it.

Copy link
Contributor Author

@eric-maynard eric-maynard May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke with @singhpk234 who noted this is probably the same discussion as here on the old PR. With that context, I think we might be OK here.

IMO it's alright that the list ordering you'd get across metastores won't be the same. Other than that difference, seems like everything should work with JDBC's IdGenerator. Although the IDs aren't generated sequentially, pagination only uses the entity ID as an essentially arbitrary consistent ordering.

The key implication here is that if an entity gets created in the middle of a listing operation (e.g. between list calls 2 and 3) it may or may not show up in the next page. An alternative would be to try to filter it out so that the behavior is more obvious & consistent, but I think the simple approach that ultimately gives the user a chance to see these new entities is good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Losing new entities that are stored after pagination start is fine from my POV. The JDBC persistence does not implement catalog-level versioning, so this is unavoidable, I guess.

Copy link
Contributor Author

@eric-maynard eric-maynard May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that we will naturally lose some entities, the question is whether we are OK with entities stored after pagination start being lost nondeterministically rather than always. Right now, whether the new entity is lost or not depends on what entity ID it gets. If it gets a high entity ID you might see it in a later page and if it gets a low ID you might not.

My thought on this question is "yes", because it's better to show the entity if we can and it simplifies the code.

But if we feel like this is too unintuitive, we can add a secondary filter on the entity's creation time to try and get rid of these entities (on a best-effort basis, since clocks are not perfect).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think current pagination behaviour wrt concurrent changes is fine.

Making it deterministic would be a great addition to Polaris, but that, I think, has a much broader scope. For example, if an entry is deleted after pagination starts, but a client re-submits a page request using an old token, the new response would still be inconsistent with the old response.

From my POV a complete and deterministic pagination solution implies catalog-level versioning.

@eric-maynard eric-maynard requested a review from snazy May 12, 2025 17:04
@eric-maynard
Copy link
Contributor Author

The code assumes strict ordering of integer IDs

On this note, I think it's not true actually. The code assumes that IDs are sortable but it doesn't rely on any kind of semantic meaning behind this comparison. So IDs can be created totally randomly and you can still paginate simply by breaking that random key space into pages of some size. There's no assumption that new entries will appear at the end, either.

Copy link
Contributor

@adnanhemani adnanhemani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments, overall LGTM

Comment on lines +62 to +75
try {
String[] parts = token.split("/");
if (parts.length < 1) {
throw new IllegalArgumentException("Invalid token format: " + token);
} else if (parts[0].equals(EntityIdPageToken.PREFIX)) {
int resolvedPageSize = pageSize == null ? Integer.parseInt(parts[2]) : pageSize;
return new EntityIdPageToken(Long.parseLong(parts[1]), resolvedPageSize);
} else {
throw new IllegalArgumentException("Unrecognized page token: " + token);
}
} catch (NumberFormatException | IndexOutOfBoundsException e) {
LOGGER.debug(e.getMessage());
throw new IllegalArgumentException("Invalid token format: " + token);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I, personally, find this fragment a bit more complex than it may need to be. Is there a reason why we cannot defensively check for the right amount array length after splitting it right away? Same for the NumberFormationException?

Copy link
Contributor Author

@eric-maynard eric-maynard May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to structure the code in a way that obviously leaves the door open for other PageToken implementations -- those would have different array length expectations. So we check the prefix first, and then parse the token using the logic appropriate for the PageToken implementation that the prefix corresponds to. Ideally we could even push this parsing logic down into some method in the PageToken.

In the old PR, there were 2 parseable PageToken implementations. I do agree that it looks a little clunky with the single PageToken implementation we have now. If this really is too confusing I can simplify this logic and then we can re-complicate it if/when we add a new PageToken.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this really is too confusing I can simplify this logic and then we can re-complicate it if/when we add a new PageToken.

I'd personally prefer this - but don't care enough if we do this or not, I can understand the reasoning.

dimas-b added a commit to dimas-b/polaris that referenced this pull request Jun 10, 2025
Following up on apache#1555

* Refactor pagination code to delineate API-level page tokens and
  internal "pointers to data"

* Requests deal with the "previous" token, user-provided page size
  (optional) and the previous request's page size.

* Concentrate the logic of combining page size requests
  and previous tokens in PageTokenUtil

* PageToken subclasses are no longer necessary. EntityIdPaging handles
  pagination over ordered result sets with static helper methods.

Co-authored-by: Eric Maynard <[email protected]>
@github-actions
Copy link

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jun 29, 2025
@github-actions github-actions bot closed this Jul 6, 2025
@github-project-automation github-project-automation bot moved this from PRs In Progress to Done in Basic Kanban Board Jul 6, 2025
snazy added a commit to snazy/polaris that referenced this pull request Jul 15, 2025
Based on apache#1838, following up on apache#1555

* Allows multiple implementations of `Token` referencing the "next page", encapsulated in `PageToken`. No changes to `polaris-core` needed to add custom `Token` implementations.
* Extensible to (later) support (cryptographic) signatures to prevent tampered page-token
* Refactor pagination code to delineate API-level page tokens and internal "pointers to data"
* Requests deal with the "previous" token, user-provided page size (optional) and the previous request's page size.
* Concentrate the logic of combining page size requests and previous tokens in `PageTokenUtil`
* `PageToken` subclasses are no longer necessary.
* Serialzation of `PageToken` uses Jackson serialization (smile format)

Since no (metastore level) implementation handling pagination existed before, no backwards compatibility is needed.

Co-authored-by: Dmitri Bourlatchkov <[email protected]>
Co-authored-by: Eric Maynard <[email protected]>
snazy added a commit that referenced this pull request Jul 16, 2025
Based on #1838, following up on #1555

* Allows multiple implementations of `Token` referencing the "next page", encapsulated in `PageToken`. No changes to `polaris-core` needed to add custom `Token` implementations.
* Extensible to (later) support (cryptographic) signatures to prevent tampered page-token
* Refactor pagination code to delineate API-level page tokens and internal "pointers to data"
* Requests deal with the "previous" token, user-provided page size (optional) and the previous request's page size.
* Concentrate the logic of combining page size requests and previous tokens in `PageTokenUtil`
* `PageToken` subclasses are no longer necessary.
* Serialzation of `PageToken` uses Jackson serialization (smile format)

Since no (metastore level) implementation handling pagination existed before, no backwards compatibility is needed.

Co-authored-by: Dmitri Bourlatchkov <[email protected]>
Co-authored-by: Eric Maynard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants