-
Notifications
You must be signed in to change notification settings - Fork 331
Add PolarisMetaStoreManager.loadEntities #2290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PolarisMetaStoreManager.loadEntities #2290
Conversation
b3363a7 to
e10c856
Compare
c4acd84 to
4a62e67
Compare
4a62e67 to
9793869
Compare
9793869 to
f960edd
Compare
| PolarisEntitySubType.NULL_SUBTYPE, | ||
| PolicyEntity::of) | ||
| .stream() | ||
| .filter(policyEntity -> policyType == null || policyEntity.getPolicyType() == policyType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note:
if the required policyType is null we could in theory use the optimized listEntities call, as we only need the name of the entity to build the PolicyIdentifier, but for filtering by policyType we need to load the full entity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note:
if the required policyType is null we could in theory use the optimized listEntities call, as we only need the name of the entity to build the PolicyIdentifier, but for filtering by policyType we need to load the full entity.
This could be worth adding in a code comment here as a // TODO either as an optional followup task or if nothing else, calling attention to the different flavors of listing so that future code changes will put more thought into the choice of listing methods.
Ideally any use of these open-ended filters will either be a short-term crutch or a proven "small" use case where we think pushdown isn't worth the complexity. Longer term we could define a more structured definition of pushdown predicates that is still extensible but communicates filter semantics down to the BasePersistence layer enough to work with different database-level indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've added a TODO and pushed a follow-up to #2370
f960edd to
a478c1d
Compare
dennishuo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a big improvement, thanks for taking this on!
I can do a more detailed dive, but at a high level, I'd like to see if we can better sort out the responsibilities of the MetaStoreManager layer vs the BasePersistence layer here. Specifically, if we can pull up the entityFilter/transformer from BasePersistence to only live in the MetaStoreManager level, just as your #2317 is a push down of the entitySubType filter. Essentially:
- We want the [Base]Persistence layer to most closely represent what a lower-level database is directly capable of doing
- We put more "advanced" logic in the MetaStoreManager layer
- In the filter/transformer scenario, what we're really saying is "BasePersistence needs to give back complete PolarisBaseEntities instead of only EntityNameLookupRecords", and then something might run Polaris-side filtering/transforms -- logically this would be in MetaStoreManager or higher.
This also tells us why it made sense for the loadTasks to use a Predicate<> as the filter, but why subTypeCode should not be an opaque Predicate<> -- subTypeCode filtering is something the lower-level database is capable of understanding so therefore it is pushed down to the layer that represents the raw database.
Whereas advanced timestamp comparison, leasing, etc., of TaskEntities is a Polaris concept that the database isn't designed to understand directly (for now) so we fallback to the opaque Predicate.
Overall, I'd like to see if we can:
- Also rename the
filter/transformervariations oflistEntitieswithin theBasePersistence, similar to how you introduced a totally different method name in the MetaStoreManager layer -- because now it's really not an overload of the same method, but really a different action entirely. I thinkloadEntitiesis reasonable to push down as the method name, but we just might want to consider the difference between "list and load full entities under a parent" and "load the full entities provided in this Collection" that we might need in the future - Pull filter/transformer evaluation out of BasePersistence into MetaStoreManager
- Reconsider whether we actually need a
Function<PolarisBaseEntity, T>generic type return value -- as far as I recall, originally we just abused that to "transform" a PolarisBaseEntity into a trimmed EntityNameLookupRecord, which was an antipattern as you identified, and then nearly every other case was just the "identity transformation". Now that we clarified the "EntityNameLookup" case, perhaps callsites never actually use the transformer anymore? I didn't have time to dig deeper yet into all the transformer use cases.
|
thanks for the feedback, since the other PR is somewhat merge-ready i will wait for that and rebase this one afterwards.
yeah we can use the same
i havent investigated this in detail yet but my current guess is that the filter needs to remain "pushed down" in order for pagination to work correctly, but will double check that later.
afaict the transformer is used heavily by all callers of unless you mean that this transformation can also happen on the "outside" ? |
18e3529 to
9c7d42d
Compare
|
rebased on latest main due to a few merge conflict. also included the rename of one of the |
polaris-core/src/main/java/org/apache/polaris/core/persistence/PolarisMetaStoreManager.java
Show resolved
Hide resolved
9c7d42d to
36657d7
Compare
|
rebased after trivial merge-conflict in |
36657d7 to
ab37780
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is on the right track, thanks!
i havent investigated this in detail yet but my current guess is that the filter needs to remain "pushed down" in order for pagination to work correctly, but will double check that later.
Good point, I wonder if we've sufficiently defined the expected semantics of inline-filtering vs post-filtering, especially in how it interacts with pagination. But indeed we shouldn't break any existing behavior. Is there anything other than loadTasks right now that relies on the inline-filtering semantic? I believe the PolicyCatalog filter scenario is a post-filter right now. I'm now wondering if the Filter can just remain only in the BasePersistence to avoid breaking changes, and opting not to add a filter/transformer for now in the PolarisMetaStoreManager layer.
For defining the PolarisMetaStoreManager interface, as the PolarisMetaStoreManager is one of the main SPI components it's worth being somewhat more conservative in adding methods.
unless you mean that this transformation can also happen on the "outside" ?
Yeah, if possible, all the syntactic sugar type-cases like PrincipalEntity.of I'd prefer to just move out to the caller for now.
The more I mull it over the the more it seems like the previous additions of filter and transformer to BasePersistence were a mistake in hindsight, especially as long as it was used simply as syntactic sugar rather than defining any true pushdown semantic, except for the filtering in loadTasks (which we want to rework anyways), and we might want to avoid perpetuating it for now until we:
- Have a concrete use case that we're sure needs to be truly pushed-down
- Have a plan with guardrails in place to carefully define what's allowed to be pushed down
Right now, every method signature in PolarisMetaStoreManager is PolarisCallContext plus JSON-serializable Plain Old Data types, and this was intentional historically, but maybe never fully documented in developer guidelines. This means PolarisMetaStoreManager is a runtime-configurable SPI type that is capable of being an RPC implementation. (Incidentally, for full disclosure I'm mentioning this not just as a philosophical observation, but can vouch for the RPC pattern being an actual stakeholder use case in production).
It's technically possible to add basic loadEntities that preserves this invariant for now without the filter/transformer arguments, while still providing the benefit of the more fundamental functionality of fetching entire entities all at once.
It's also certainly conceivable to intentionally evolve the interface to no longer satisfy the Plain Old Data types constraint, but would simply be a longer discussion.
How would you feel about these next steps?
- Remove
filterandtransformerfrom the newPolarisMetaStoreManager::loadEntitiesmethod for now - No longer have the different
loadEntitiesvsloadEntitiesAllfor now since we'll deal with filters/transformers later, and even if filter/transformer use cases are really added, the base method remains useful forexisting callsites (I appreciate you breaking these out separately though! Very helpful to start distinguishing callsites that wanted syntactic-sugar tranformations without filters, vs those that actually need filters) - Relegate "syntactic-sugar" transformation conversions to be the responsibility of the caller for now -- AFAICT in the current code all of the transformers just mean moving a
FooEntity::ofout of the method argument down two lines into astream().map(...)call -- I know technically this is still less efficient than pushdown stream-transformation, but the size of the immediate members of a PolarisBaseEntity are small compared to the actual internalProperties/properties which would be copied by reference to the type-specific wrapper - Followup discussion on mailing list for how to properly think about "filter pushdown", with PolicyCatalog likely being the primary driving use case for PolicyType filtering.
For filtering, it seems to me:
- Use cases that are fine with strict post-filtering remain at the caller
- Partial-pushdown "stream filtering" is basically what we have for
loadTaskstoday and really nothing else. This doesn't help much with efficiency of what's loaded from the database, but is helpful for already-filtered pagination. Whether we want to generalize this partial-pushdown scenario needs to be discussed - For critical high-scale filtering use cases in the future, we really need a structure that allows more extensible "full pushdown" to the database. This means expressing the filters in well-structured/enumerated ways that cooperate with definitions of secondary indexes on the database. This means a departure from just taking a java
Predicate. Structured predicate pushdown here would not only be higher-performance, but would also allow preserving having only Plain Old Data interfaces in PolarisMetaStoreManager.
| PolarisEntitySubType.NULL_SUBTYPE, | ||
| PolicyEntity::of) | ||
| .stream() | ||
| .filter(policyEntity -> policyType == null || policyEntity.getPolicyType() == policyType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note:
if the required policyType is null we could in theory use the optimized listEntities call, as we only need the name of the entity to build the PolicyIdentifier, but for filtering by policyType we need to load the full entity.
This could be worth adding in a code comment here as a // TODO either as an optional followup task or if nothing else, calling attention to the different flavors of listing so that future code changes will put more thought into the choice of listing methods.
Ideally any use of these open-ended filters will either be a short-term crutch or a proven "small" use case where we think pushdown isn't worth the complexity. Longer term we could define a more structured definition of pushdown predicates that is still extensible but communicates filter semantics down to the BasePersistence layer enough to work with different database-level indexes.
| null, | ||
| PolarisEntityType.CATALOG, | ||
| PolarisEntitySubType.ANY_SUBTYPE, | ||
| CatalogEntity::of) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these type-wrappers from https://github.com/apache/polaris/pull/2261/files the only usage of transformer that aren't Identity transformations now? Alongside the PolicyEntity.of in PolicyCatalog?
I guess I can see how it's a bit cleaner being able to pass these into the metaStoreManager and if callers don't end up doing yet another .stream().map(...).toList() maybe there's avoidance of a bit of memory copying between lists, but as we're also starting to clarify better in the mailing list discussion about Polaris SPIs, the PolarisMetaStoreManager is one critical interface boundary, and having the open-ended transformer introduces some subtle pitfalls.
Previously, it was already a bit dicey having the Transformer in the BasePersistence::listEntities method but at least it was contained because BasePersistence has been de-facto package-private in its usage (i.e. IcebergCatalog/PolarisAdminService interacts with MetaStoreManager, not BasePersistence), so the blast radius has luckily remained somewhat contained.
One example pitfall is when considering how this pushdown benefits from now having the ability to achieve SNAPSHOT_READ isolation for list contents, and how that translates into running inside the runInTransaction for the TransactionalMetaStoreManager branch, then if we're willing to run arbitrary transformer functions inside the transactional critical block we need callers to "promise" they'll only provide "trivial" transformers.
Otherwise, I could imagine someday a "convenient" heavyweight transformation, such as for example, a remote-catalog entity resolver that uses a PolarisBaseEntity as just a passthrough facade, leaking in and causing problems by making slow external remote-catalog network requests within the transactional/snapshot read section.
Also, right now the interfaces permit the MetaStoreManager implementation and/or BasePersistence implementation to potentially perform concurrent underlying operations if desired to parallelize i.e. sharded list results; depending on whether the filter/transformer are applied inline or as post-processing, we'd also need to know if the provided functions are thread-safe.
If we really want to expose it, we should at least document some initial basic contraints/guidelines in the method jaavadocs:
- Must be "lightweight" without network/IO dependencies
- Must not cause re-entrant transactions in the database layer
- Must be threadsafe
These probably need to apply to both the Transformer and the Filter.
Otherwise, my preferred approach is to change all of these FooEntity::of transformer use cases to simply pop them out to the caller like:
return metaStoreManager
.loadEntitiesAll(
getCurrentPolarisContext(),
null,
PolarisEntityType.CATALOG,
PolarisEntitySubType.ANY_SUBTYPE) // No 'transformer' arg
.stream()
.map(CatalogEntity::of);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've pushed a review commit that removes filter+transformer from the metastoremanager api, as currently its not necessary indeed.
in a followup we should revisit whether the BasePersistence api really needs those params or not but i would prefer to not handle it in this PR.
the filter/transformer arguments were specifically added to support the
I think we need to add this anyway. I've been planning on some proposed changes for fetching principal roles during authentication and improving our list iteration. Both |
ab37780 to
e099796
Compare
we could use those but we still would be running 2 queries instead of 1 i.e. for getting all principals. note that implementations of yet while |
e099796 to
3cc904d
Compare
dennishuo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the updates! LGTM other than a minor comment about expanding the javadoc description.
| @Nonnull PageToken pageToken); | ||
|
|
||
| /** | ||
| * Load entities with pagination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's expand this javadoc description just a bit to make it more obvious that this method is effectively a "list and load entities under a parent", and maybe include some statement pointing at {@link #listEntities} to make the intention clear of making this listing/pagination behavior here match that of listEntities.
Michael's mention of future additions of loadEntitiesById and loadEntitiesByName had just reminded me that the naming could feel ambiguous, so javadocs can save developers time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've added another commit to try and clarify the differences in the javadoc, hope that matches your suggestion
@collado-mike I interpreted your statement to just be an observation that such "batch load" methods would indeed be a useful option while my mention may have been ambiguous between "load entities under parent", "load batch by id", and "load batch by name", rather than a statement against a "load entities under parent", is that right? |
currently `PolarisMetaStoreManager.listEntities` only exposes a limited subset of the underlying `BasePersistence.listEntities` functionality. most of the callers have to post-process the `EntityNameLookupRecord` of `ListEntitiesResult` and call `PolarisMetaStoreManager.loadEntity` on the individual items sequentually to transform and filter them. this is bad for the following reasons: - suboptimal performance as we run N+1 queries to basically load every entity twice from the persistence backend - suffering from race-conditions when entities get dropped between the `listEntities` and `loadEntity` call - a lot of repeated code in all the callers (of which only some are dealing with the race-condition by filtering out null values) as a solution we add `PolarisMetaStoreManager.loadEntities` that takes advantage of the already existing `BasePersistence` methods. we rename one of the `listEntities` methods to `loadEntities` for consistency. since many callers dont need paging and want the result as a list, we add `PolarisMetaStoreManager.loadEntitiesAll` as a convenient wrapper. we also remove the `PolarisEntity.nameAndId` method as callers who only need name and id should not be loading the full entity to begin with. note we rework `testCatalogNotReturnedWhenDeletedAfterListBeforeGet` from `ManagementServiceTest` because the simulated race-condition scenario can no longer happen.
snazy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this PR is ready to be merged.
3cc904d to
597e81b
Compare
this is a follow-up to apache#2290 the optimization is to use `listEntities` instead of `loadEntities` when there is `policyType` filter to apply
this is a follow-up to #2290 the optimization is to use `listEntities` instead of `loadEntities` when there is `policyType` filter to apply
this is a followup to #2261
currently
PolarisMetaStoreManager.listEntitiesonly exposes a limitedsubset of the underlying
BasePersistence.listEntitiesfunctionality.most of the callers have to post-process the
EntityNameLookupRecordofListEntitiesResultand callPolarisMetaStoreManager.loadEntityon the individual items sequentually to transform and filter them.
this is bad for the following reasons:
entity twice from the persistence backend
listEntitiesandloadEntitycalldealing with the race-condition by filtering out null values)
as a solution we add
PolarisMetaStoreManager.loadEntitiesthat takesadvantage of the already existing
BasePersistencemethods.we rename one of the
listEntitiesmethods toloadEntitiesforconsistency.
since many callers dont need paging and want the result as a list, we
add
PolarisMetaStoreManager.loadEntitiesAllas a convenient wrapper.we also remove the
PolarisEntity.nameAndIdmethod as callers who onlyneed name and id should not be loading the full entity to begin with.
note we rework
testCatalogNotReturnedWhenDeletedAfterListBeforeGetfrom
ManagementServiceTestbecause the simulated race-conditionscenario can no longer happen.