-
Notifications
You must be signed in to change notification settings - Fork 333
Allow BasePolarisTableOperations to skip refreshing metadata after a commit #1378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| .description( | ||
| "If true, BasePolarisTableOperations should cache metadata that has been committed" | ||
| + " which can reduce File IO") | ||
| .defaultValue(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling by default for the time being while I investigate some failing Iceberg tests
singhpk234
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[doubt] .refresh() is gated by .doRefresh() which would hit the persistence to find if the latest pointer is same as current metadata ? if the current is already up to date its just a wasted persistence call.
Can you please elaborate which storage layer call you are referring here explicitly ?
From the Iceberg perspective, it's What happens today is that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current workflow for a commit request looks like this:
TableOperations::refresh: loads metadata from S3- Generate new metadata with the update
TableOperations::commit- Write the new metadata to S3
- Call
TableOperations::requestRefresh()to setshouldRefresh = true TableOperations::current()- Since
shouldRefreshis true, it callsTableOperations::refresh() - This leads to
doRefresh(), which updatescurrentMetadataLocationandcurrentMetadata
- Since
Given this flow, I think a cleaner approach would be:
- In
BasePolarisTableOperations::doCommit, we updaterecentlyCommittedMetadataand still setshouldRefresh = true - In
BasePolarisTableOperations::doRefresh, ifrecentlyCommittedMetadatais notnull, we updatecurrentMetadataLocationandcurrentMetadata, and then clear theshouldRefreshflag and return, without refreshing the metadata from S3 again.
This keeps things simpler and avoids needing to override a bunch of methods. From what I understand, BaseMetastoreTableOperations intentionally doesn't implement doCommit and doRefresh, it expects subclasses to plug in their own logic, which makes these the ideal spots for our custom handling. If iceberg sdk changes the codes in BaseMetastoreTableOperations, we don't need to copy paste it in our operations class.
f97ba53 to
08e78e7
Compare
| PolarisConfiguration.<Boolean>builder() | ||
| .key("TABLE_OPERATIONS_COMMIT_UPDATE_METADATA") | ||
| .description( | ||
| "If true, BasePolarisTableOperations should cache metadata that has been committed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on this description I'd propose to rename the property to CACHE_COMMITTED_METADATA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good callout, there is a mismatch between the description and the name of this flag after my rebase. I will adjust the description for now but let me know if you think we need further changes
| this.tableIdentifier = tableIdentifier; | ||
| this.fullTableName = fullTableName(catalogName, tableIdentifier); | ||
| this.tableFileIO = defaultFileIO; | ||
| this.updateMetadataOnCommit = updateMetadataOnCommit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do the same for views?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current iteration of the PR I'm making a judgement call between the complexity of the change and consistency across views & tables. Ideally we can contribute this upstream to Iceberg and then the reflection hack goes away. In the meantime, views tend to have much less metadata than tables (and also to get updated less frequently) and so I'm less concerned about the extra round trip to object storage.
service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java
Outdated
Show resolved
Hide resolved
| try { | ||
| Field tableMetadataField = TableMetadata.class.getDeclaredField("metadataFileLocation"); | ||
| tableMetadataField.setAccessible(true); | ||
| tableMetadataField.set(metadata, newLocation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should avoid modifying arguments passed to this method, especially in ways that can hardly be expected by the caller.
Is the storage call we're trying to avoid relevant in production runtime of Polaris or only in tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's relevant for Polaris running for real, yes. In TableOperations.doCommit we set TableOperations.requestRefresh, which is making the subsequent call to TableOperations.current actually make a trip to object storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the more philosophical aspect of this comment, I'm not sure if I can agree. I'm newer to Iceberg and so I'm unfamiliar with how TableMetadata is meant to be handled.
To me it seems a little odd that we have this field metadataLocation which never gets set (there is no setter); it seems like the expectation baked in to the API is that you'll always read-after-write in order to update the in-memory TableMetadata object. In practice, doing this is expensive.
After commit is done with a TableMetadata, I would expect that the TableMetadata is updated to reflect the outcome of that commit. That, or it could return a new TableMetadata -- but we don't have control over that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, @eric-maynard , I'm lost here. I do not see Polaris code calling TableOperations.requestRefresh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, this was a pretty nuanced bug to pin down.
The basic flow is like this:
IcebergCatalog.registerTableBaseMetastoreTableOperations.commit
a.BasePolarisTableOperations.doCommit
b.BaseMetastoreTableOperations.requestRefresh
We do not override commit in BasePolarisTableOperations, though I tried that in an earlier iteration of this fix. Ultimately after chatting with @singhpk234 I came to realize that BaseMetastoreTableOperations already has logic that's meant to handle this exact case -- this line here. However, the bug is that currentMetadataLocation does not properly get set during commit, so this check never actually lets doRefresh skip the trip to object storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What actually initiates the reload, yes. Coming back to IcebergCatalog.registerTable, we return the TableOperations object wrapped in a BaseTable:
return new BaseTable(ops, fullTableName(name(), identifier), metricsReporter());
Then, coming up a layer to IcebergCatalogHandler, we see that the actual call is to:
return CatalogHandlers.registerTable(baseCatalog, namespace, request);
Which in turn takes the result of IcebergCatalog.registerTable and does:
return LoadTableResponse.builder()
.withTableMetadata(((BaseTable) table).operations().current())
.build();
This TableOperations.current() call will trigger a refresh as requestRefresh was set on it previously
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we simply keep current metadata in BaseTable and use it to build the response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see that CatalogHandlers comes from Iceberg jars... well, IMHO, we're facing this optimization problem because of too tight coupling to Iceberg jars, which perhaps do not handle the Polaris use case very well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not call baseCatalog.registerTable() directly from IcebergCatalogHandler? This way we'd be able to pass back actual metadata via a Table sub-class with less hackery, I guess 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason IcebergCatalogHandler is written to be semi-agnostic of the underlying Catalog type and the code does not assume it's an IcebergCatalog -- you can see this all over the place, where we do things like:
this.viewCatalog = (baseCatalog instanceof ViewCatalog) ? (ViewCatalog) baseCatalog : null;
We could add a case here where we check, similarly, if the underlying catalog is an IcebergCatalog and then we can add a method to IcbergCatalog for register/commit which goes through our own route and our own TableOperations etc. As you note we'd also have to write our own Table, and very quickly this starts to look like the refactor/rewrite discussed elsewhere. tbh I'm happy to take that work on, but I wanted to see if we are OK with this smaller patch in the interim.
| // TODO: we should not need to do this hack, but there's no other way to modify | ||
| // currentMetadata / currentMetadataLocation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we add a Map<String, TableMetadata> to the IcebergCatalog and access it from inside the BasePolarisTableOperations? The IcebergCatalog is created for every request, so the map isn't long-lived, but it will be there between the commit and the refresh. I don't see a need for the reflection below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currentMetadataLocation is checked in a few places where we need it to be set. Notably, for this fix to work, it needs to be set inside BaseMetastoreTableOperations.refreshFromMetadataLocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, understood. But BaseMetastoreTableOperations is an inner class of IcebergCatalog. It can read any field defined in IcebergCatalog, so we can just create a Map that is written to during commitTable and then when we call ops().current(), we can just read the map value. We can also store the TableMetadata with TableMetadata.buildFrom(metadata).withMetadataLocation(latestMetadataLocation) in the commit function. No reflection necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you're describing closely resembles a previous iteration of this fix, which is validating. For reference I have exhumed it from my git reflog, see here:
main...eric-maynard:polaris:less-refresh-retro
This is maybe even more convoluted than what you're describing, but does this more or less match what you're looking for?
The unfortunate thing is that this doesn't work, it fails the new test as written. I didn't debug too much further as @singhpk234 convinced me that this new approach is cleaner, but from what I remember the issue is that TableMetadata.buildFrom(...) does not really work the way you'd like it to. The invocation you shared fails with something like:
Cannot set metadata location with changes to table metadata: 2 changes
As @RussellSpitzer put it to me when we were discussing this, Our builders are too "smart" :)
Long-term, I would like to move away from using BaseMetastoreTableOperations altogether and then we can be a bit more free in how we handle the metadata. We may still need to contribute changes to TableMetadata upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. I see the problem - it is here: https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableMetadata.java#L985 . The TableMetadata builder copies the changes field from the original, so when we set the metadataLocation, it looks like we're trying to set it on an object that has changes already 🙄
We can do this without reflection if we roundtrip to JSON. Something like this diff does work. It does require a roundtrip through Jackson, so there's a performance cost 😢. Worth considering whether we'd rather pay the perf cost or deal with reflection in the code. The perf cost would scale with the size of the TableMetadata, though for larger metadata files, we're probably not loading the TableMetadata in memory at all.
diff --git a/service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java b/service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java
index 2e8fadab..8d77fa84 100644
--- a/service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java
+++ b/service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java
@@ -1242,6 +1242,9 @@ public class IcebergCatalog extends BaseMetastoreViewCatalog
SHOULD_RETRY_REFRESH_PREDICATE,
getMaxMetadataRefreshRetries(),
metadataLocation -> {
+ if (tblMetadata.containsKey(metadataLocation)) {
+ return tblMetadata.get(metadataLocation);
+ }
String latestLocationDir =
latestLocation.substring(0, latestLocation.lastIndexOf('/'));
// TODO: Once we have the "current" table properties pulled into the resolvedEntity
@@ -1259,6 +1262,8 @@ public class IcebergCatalog extends BaseMetastoreViewCatalog
}
}
+ Map<String, TableMetadata> tblMetadata = new HashMap<>();
+
@Override
public void doCommit(TableMetadata base, TableMetadata metadata) {
LOGGER.debug(
@@ -1400,9 +1405,9 @@ public class IcebergCatalog extends BaseMetastoreViewCatalog
// TODO: we should not need to do this hack, but there's no other way to modify
// currentMetadata / currentMetadataLocation
if (updateMetadataOnCommit) {
- TableOperationsReflectionUtil reflectionUtil = TableOperationsReflectionUtil.getInstance();
- reflectionUtil.setMetadataFileLocation(metadata, newLocation);
- reflectionUtil.setCurrentMetadata(this, metadata, newLocation);
+ TableMetadata updated =
+ TableMetadataParser.fromJson(newLocation, TableMetadataParser.toJson(metadata));
+ tblMetadata.put(newLocation, updated);
}
if (null == existingLocation) {
createTableLike(tableIdentifier, entity);| Field currentMetadataLocationField = | ||
| BaseMetastoreTableOperations.class.getDeclaredField("currentMetadataLocation"); | ||
| currentMetadataLocationField.setAccessible(true); | ||
| currentMetadataLocationField.set(this, newLocation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forcing the new location may be ok on the assumption that the commit succeeds, but can we be sure at this point that the commit will succeed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding yes, or at least if it fails we should not re-use the TableMetadata object. The fact that tests pass gives me some confidence here but this is probably the part of the fix that needs the most scrutiny.
Initially I was doing this lower in the method, right before the return. However I found that moving it here helped me uncover some issues where we'd throw a CommitFailedException below if the metadata is left in an inconsistent state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, TBH, that statement makes me worry more about the correctness of this change 😅 Could you give more details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well maybe not worry but at least it's good we are being cautious. The bug and the fix here are both subtle.
What I mean is, once the metadata is actually written to the metadata location, it should be correct to attach the location to it. If the commit ends up failing, it seems like we (in the tests, at least) don't make any assumptions that the metadata is still useful after the failed commit. So the location of this in the method should not matter. If anything, I would think that the TableOperations pointing at a "failed" commit could be troublesome, but previously it was just pointing at null.
| if (updateMetadataOnCommit) { | ||
| try { | ||
| tableMetadataField.set(metadata, newLocation); | ||
| unsafe.putObject(metadata, changesFieldOffset, new ArrayList<MetadataUpdate>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have to resolve to this kind of trickery in order to make Iceberg code behave as we want it to behave, I wonder if it might be preferable to revamp this class as a whole and directly implement the logic for storing and loading metadata from storage. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I agree that we we are hitting the limits of BaseMetastoreTableOperations; I actually hit an issue similar to this on another open PR of mine. I further agree that the use of reflection here is unfortunate and I hope that we can fix this upstream and rip this logic out whether or not we do rip out BaseMetastoreTableOperations.
However I think that rewriting this whole stack is a nontrivial task that'll be pretty hard to test in our current setup. In this case the performance benefit of an immediate fix is so dramatic that I feel it may outweigh the hackiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very uneasy about modifying private stuff in the metadat method argument (especially its final fields).
Do I understand this correctly, that you guys talked about this offline and this is still the easiest fix short of rewriting IcebergCatalog for direct storage acces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my understanding -- the alternative is to basically rewrite our own TableOperations / CatalogHandler and then handle storage access etc. on our own. Essentially what you commented here. I'm supportive of doing that work, but it will be a significant refactor / overhaul.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're interested in an alternate implementation, please check out #1426 where I refactor the TableOperations / ViewOperations classes in such a way that I could follow up with logic to manage this metadata without reflection!
service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java
Outdated
Show resolved
Hide resolved
|
I agree with @XJDKC for this one. It's probably better for us to hack Refresh to do essentially a no-op refresh rather than modifying the args passed to commit. That also wouldn't require doing the reflection tricks. Is there a downside to tackling the problem in that direction? Ie; Commit sets a special variable - on Refresh set metadata to this and return without touching storage? |
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing more with Eric offline, I think this may be the simplest approach to fixing this at the moment. I think we probably have two approaches we could take to fix this in the future though
- Do a complete replace of Operations to use our own code top to bottom and not extend the Iceberg base Classes
- Attempt to do the fix in OSS with the rational that doing so will help all REST catalogs which are in similar situations
singhpk234
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from my end too, with a minor suggestion.
I think we would want to re-use metadata we committed as it is without reloading and reconstructing it, to reset basis things like changes when we have lets say committed it, there should be way to reset changes once we are done with the commit and same goes with the re:use of BaseTableOps to increase the re-usability.
polaris-core/src/main/java/org/apache/polaris/core/config/BehaviorChangeConfiguration.java
Outdated
Show resolved
Hide resolved
9f68e8b to
f62da00
Compare
This sounds reasonable to me, but I guess the cycle might be too long. Also, if we run into another similar issue, we'd have to repeat that cycle again. I wonder why Polaris needs to jump through the |
|
Closing in favor of #1426 now where I'll attempt to recreate this behavior in a cleaner way. Thanks for the review all! |
For rest catalogs,
TableOperations.refresh()can result in an expensive trip to object storage. Despite this, the method is called quite frequently and is currently called after every commit when we construct a BaseTable to return back to a client that requested a commit.In some cases, the same TableOperations is used for one commit and then is immediately used for another. We can see this in some tests, such as Iceberg's
CatalogTests.testUpdateTableSchemaThenRevert:This PR proposes that the TableOperations can update its
currentMetadatametadata on commit so that future calls todoRefreshmight be able to skip a trip to object storage.