Skip to content

Conversation

@eric-maynard
Copy link
Contributor

@eric-maynard eric-maynard commented Apr 16, 2025

For rest catalogs, TableOperations.refresh() can result in an expensive trip to object storage. Despite this, the method is called quite frequently and is currently called after every commit when we construct a BaseTable to return back to a client that requested a commit.

In some cases, the same TableOperations is used for one commit and then is immediately used for another. We can see this in some tests, such as Iceberg's CatalogTests.testUpdateTableSchemaThenRevert:

table.updateSchema().addColumn("col1", StringType.get()).addColumn("col2", StringType.get()).addColumn("col3", StringType.get()).commit();
table.updateSchema().deleteColumn("col1").deleteColumn("col2").deleteColumn("col3").commit();      

This PR proposes that the TableOperations can update its currentMetadata metadata on commit so that future calls to doRefresh might be able to skip a trip to object storage.

.description(
"If true, BasePolarisTableOperations should cache metadata that has been committed"
+ " which can reduce File IO")
.defaultValue(false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling by default for the time being while I investigate some failing Iceberg tests

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[doubt] .refresh() is gated by .doRefresh() which would hit the persistence to find if the latest pointer is same as current metadata ? if the current is already up to date its just a wasted persistence call.
Can you please elaborate which storage layer call you are referring here explicitly ?

@eric-maynard
Copy link
Contributor Author

eric-maynard commented Apr 16, 2025

@singhpk234

Can you please elaborate which storage layer call you are referring here explicitly ?

From the Iceberg perspective, it's TableMetadataParser.read. From the Polaris perspective, it's any trip to object storage.

What happens today is that TableOperations.commit calls requestRefresh, and then BaseTable calls TableOperations.current, which triggers the metadata to get loaded from object storage. In this simple case (commit followed by returning the TableMetadata) we can try to skip that additional load.

Copy link
Member

@XJDKC XJDKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current workflow for a commit request looks like this:

  1. TableOperations::refresh: loads metadata from S3
  2. Generate new metadata with the update
  3. TableOperations::commit
    • Write the new metadata to S3
    • Call TableOperations::requestRefresh() to set shouldRefresh = true
    • TableOperations::current()
      • Since shouldRefresh is true, it calls TableOperations::refresh()
      • This leads to doRefresh(), which updates currentMetadataLocation and currentMetadata

Given this flow, I think a cleaner approach would be:

  1. In BasePolarisTableOperations::doCommit, we update recentlyCommittedMetadata and still set shouldRefresh = true
  2. In BasePolarisTableOperations::doRefresh, if recentlyCommittedMetadata is not null, we update currentMetadataLocation and currentMetadata, and then clear the shouldRefresh flag and return, without refreshing the metadata from S3 again.

This keeps things simpler and avoids needing to override a bunch of methods. From what I understand, BaseMetastoreTableOperations intentionally doesn't implement doCommit and doRefresh, it expects subclasses to plug in their own logic, which makes these the ideal spots for our custom handling. If iceberg sdk changes the codes in BaseMetastoreTableOperations, we don't need to copy paste it in our operations class.

PolarisConfiguration.<Boolean>builder()
.key("TABLE_OPERATIONS_COMMIT_UPDATE_METADATA")
.description(
"If true, BasePolarisTableOperations should cache metadata that has been committed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on this description I'd propose to rename the property to CACHE_COMMITTED_METADATA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout, there is a mismatch between the description and the name of this flag after my rebase. I will adjust the description for now but let me know if you think we need further changes

this.tableIdentifier = tableIdentifier;
this.fullTableName = fullTableName(catalogName, tableIdentifier);
this.tableFileIO = defaultFileIO;
this.updateMetadataOnCommit = updateMetadataOnCommit;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do the same for views?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current iteration of the PR I'm making a judgement call between the complexity of the change and consistency across views & tables. Ideally we can contribute this upstream to Iceberg and then the reflection hack goes away. In the meantime, views tend to have much less metadata than tables (and also to get updated less frequently) and so I'm less concerned about the extra round trip to object storage.

try {
Field tableMetadataField = TableMetadata.class.getDeclaredField("metadataFileLocation");
tableMetadataField.setAccessible(true);
tableMetadataField.set(metadata, newLocation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should avoid modifying arguments passed to this method, especially in ways that can hardly be expected by the caller.

Is the storage call we're trying to avoid relevant in production runtime of Polaris or only in tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's relevant for Polaris running for real, yes. In TableOperations.doCommit we set TableOperations.requestRefresh, which is making the subsequent call to TableOperations.current actually make a trip to object storage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the more philosophical aspect of this comment, I'm not sure if I can agree. I'm newer to Iceberg and so I'm unfamiliar with how TableMetadata is meant to be handled.

To me it seems a little odd that we have this field metadataLocation which never gets set (there is no setter); it seems like the expectation baked in to the API is that you'll always read-after-write in order to update the in-memory TableMetadata object. In practice, doing this is expensive.

After commit is done with a TableMetadata, I would expect that the TableMetadata is updated to reflect the outcome of that commit. That, or it could return a new TableMetadata -- but we don't have control over that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, @eric-maynard , I'm lost here. I do not see Polaris code calling TableOperations.requestRefresh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, this was a pretty nuanced bug to pin down.

The basic flow is like this:

  1. IcebergCatalog.registerTable
  2. BaseMetastoreTableOperations.commit
    a. BasePolarisTableOperations.doCommit
    b. BaseMetastoreTableOperations.requestRefresh

We do not override commit in BasePolarisTableOperations, though I tried that in an earlier iteration of this fix. Ultimately after chatting with @singhpk234 I came to realize that BaseMetastoreTableOperations already has logic that's meant to handle this exact case -- this line here. However, the bug is that currentMetadataLocation does not properly get set during commit, so this check never actually lets doRefresh skip the trip to object storage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What actually initiates the reload, yes. Coming back to IcebergCatalog.registerTable, we return the TableOperations object wrapped in a BaseTable:

    return new BaseTable(ops, fullTableName(name(), identifier), metricsReporter());

Then, coming up a layer to IcebergCatalogHandler, we see that the actual call is to:

    return CatalogHandlers.registerTable(baseCatalog, namespace, request);

Which in turn takes the result of IcebergCatalog.registerTable and does:

      return LoadTableResponse.builder()
          .withTableMetadata(((BaseTable) table).operations().current())
          .build();

This TableOperations.current() call will trigger a refresh as requestRefresh was set on it previously

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we simply keep current metadata in BaseTable and use it to build the response?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see that CatalogHandlers comes from Iceberg jars... well, IMHO, we're facing this optimization problem because of too tight coupling to Iceberg jars, which perhaps do not handle the Polaris use case very well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call baseCatalog.registerTable() directly from IcebergCatalogHandler? This way we'd be able to pass back actual metadata via a Table sub-class with less hackery, I guess 🤔

Copy link
Contributor Author

@eric-maynard eric-maynard Apr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason IcebergCatalogHandler is written to be semi-agnostic of the underlying Catalog type and the code does not assume it's an IcebergCatalog -- you can see this all over the place, where we do things like:

    this.viewCatalog = (baseCatalog instanceof ViewCatalog) ? (ViewCatalog) baseCatalog : null;

We could add a case here where we check, similarly, if the underlying catalog is an IcebergCatalog and then we can add a method to IcbergCatalog for register/commit which goes through our own route and our own TableOperations etc. As you note we'd also have to write our own Table, and very quickly this starts to look like the refactor/rewrite discussed elsewhere. tbh I'm happy to take that work on, but I wanted to see if we are OK with this smaller patch in the interim.

Comment on lines 1349 to 1350
// TODO: we should not need to do this hack, but there's no other way to modify
// currentMetadata / currentMetadataLocation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we add a Map<String, TableMetadata> to the IcebergCatalog and access it from inside the BasePolarisTableOperations? The IcebergCatalog is created for every request, so the map isn't long-lived, but it will be there between the commit and the refresh. I don't see a need for the reflection below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currentMetadataLocation is checked in a few places where we need it to be set. Notably, for this fix to work, it needs to be set inside BaseMetastoreTableOperations.refreshFromMetadataLocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, understood. But BaseMetastoreTableOperations is an inner class of IcebergCatalog. It can read any field defined in IcebergCatalog, so we can just create a Map that is written to during commitTable and then when we call ops().current(), we can just read the map value. We can also store the TableMetadata with TableMetadata.buildFrom(metadata).withMetadataLocation(latestMetadataLocation) in the commit function. No reflection necessary.

Copy link
Contributor Author

@eric-maynard eric-maynard Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you're describing closely resembles a previous iteration of this fix, which is validating. For reference I have exhumed it from my git reflog, see here:
main...eric-maynard:polaris:less-refresh-retro

This is maybe even more convoluted than what you're describing, but does this more or less match what you're looking for?

The unfortunate thing is that this doesn't work, it fails the new test as written. I didn't debug too much further as @singhpk234 convinced me that this new approach is cleaner, but from what I remember the issue is that TableMetadata.buildFrom(...) does not really work the way you'd like it to. The invocation you shared fails with something like:

Cannot set metadata location with changes to table metadata: 2 changes

As @RussellSpitzer put it to me when we were discussing this, Our builders are too "smart" :)

Long-term, I would like to move away from using BaseMetastoreTableOperations altogether and then we can be a bit more free in how we handle the metadata. We may still need to contribute changes to TableMetadata upstream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. I see the problem - it is here: https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableMetadata.java#L985 . The TableMetadata builder copies the changes field from the original, so when we set the metadataLocation, it looks like we're trying to set it on an object that has changes already 🙄

We can do this without reflection if we roundtrip to JSON. Something like this diff does work. It does require a roundtrip through Jackson, so there's a performance cost 😢. Worth considering whether we'd rather pay the perf cost or deal with reflection in the code. The perf cost would scale with the size of the TableMetadata, though for larger metadata files, we're probably not loading the TableMetadata in memory at all.

diff --git a/service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java b/service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java
index 2e8fadab..8d77fa84 100644
--- a/service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java
+++ b/service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java
@@ -1242,6 +1242,9 @@ public class IcebergCatalog extends BaseMetastoreViewCatalog
             SHOULD_RETRY_REFRESH_PREDICATE,
             getMaxMetadataRefreshRetries(),
             metadataLocation -> {
+              if (tblMetadata.containsKey(metadataLocation)) {
+                return tblMetadata.get(metadataLocation);
+              }
               String latestLocationDir =
                   latestLocation.substring(0, latestLocation.lastIndexOf('/'));
               // TODO: Once we have the "current" table properties pulled into the resolvedEntity
@@ -1259,6 +1262,8 @@ public class IcebergCatalog extends BaseMetastoreViewCatalog
       }
     }

+    Map<String, TableMetadata> tblMetadata = new HashMap<>();
+
     @Override
     public void doCommit(TableMetadata base, TableMetadata metadata) {
       LOGGER.debug(
@@ -1400,9 +1405,9 @@ public class IcebergCatalog extends BaseMetastoreViewCatalog
       // TODO: we should not need to do this hack, but there's no other way to modify
       // currentMetadata / currentMetadataLocation
       if (updateMetadataOnCommit) {
-        TableOperationsReflectionUtil reflectionUtil = TableOperationsReflectionUtil.getInstance();
-        reflectionUtil.setMetadataFileLocation(metadata, newLocation);
-        reflectionUtil.setCurrentMetadata(this, metadata, newLocation);
+        TableMetadata updated =
+            TableMetadataParser.fromJson(newLocation, TableMetadataParser.toJson(metadata));
+        tblMetadata.put(newLocation, updated);
       }
       if (null == existingLocation) {
         createTableLike(tableIdentifier, entity);

Field currentMetadataLocationField =
BaseMetastoreTableOperations.class.getDeclaredField("currentMetadataLocation");
currentMetadataLocationField.setAccessible(true);
currentMetadataLocationField.set(this, newLocation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcing the new location may be ok on the assumption that the commit succeeds, but can we be sure at this point that the commit will succeed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding yes, or at least if it fails we should not re-use the TableMetadata object. The fact that tests pass gives me some confidence here but this is probably the part of the fix that needs the most scrutiny.

Initially I was doing this lower in the method, right before the return. However I found that moving it here helped me uncover some issues where we'd throw a CommitFailedException below if the metadata is left in an inconsistent state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, TBH, that statement makes me worry more about the correctness of this change 😅 Could you give more details?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well maybe not worry but at least it's good we are being cautious. The bug and the fix here are both subtle.

What I mean is, once the metadata is actually written to the metadata location, it should be correct to attach the location to it. If the commit ends up failing, it seems like we (in the tests, at least) don't make any assumptions that the metadata is still useful after the failed commit. So the location of this in the method should not matter. If anything, I would think that the TableOperations pointing at a "failed" commit could be troublesome, but previously it was just pointing at null.

if (updateMetadataOnCommit) {
try {
tableMetadataField.set(metadata, newLocation);
unsafe.putObject(metadata, changesFieldOffset, new ArrayList<MetadataUpdate>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have to resolve to this kind of trickery in order to make Iceberg code behave as we want it to behave, I wonder if it might be preferable to revamp this class as a whole and directly implement the logic for storing and loading metadata from storage. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I agree that we we are hitting the limits of BaseMetastoreTableOperations; I actually hit an issue similar to this on another open PR of mine. I further agree that the use of reflection here is unfortunate and I hope that we can fix this upstream and rip this logic out whether or not we do rip out BaseMetastoreTableOperations.

However I think that rewriting this whole stack is a nontrivial task that'll be pretty hard to test in our current setup. In this case the performance benefit of an immediate fix is so dramatic that I feel it may outweigh the hackiness.

Copy link
Contributor

@dimas-b dimas-b Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very uneasy about modifying private stuff in the metadat method argument (especially its final fields).

Do I understand this correctly, that you guys talked about this offline and this is still the easiest fix short of rewriting IcebergCatalog for direct storage acces?

Copy link
Contributor Author

@eric-maynard eric-maynard Apr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my understanding -- the alternative is to basically rewrite our own TableOperations / CatalogHandler and then handle storage access etc. on our own. Essentially what you commented here. I'm supportive of doing that work, but it will be a significant refactor / overhaul.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're interested in an alternate implementation, please check out #1426 where I refactor the TableOperations / ViewOperations classes in such a way that I could follow up with logic to manage this metadata without reflection!

@RussellSpitzer
Copy link
Member

I agree with @XJDKC for this one. It's probably better for us to hack Refresh to do essentially a no-op refresh rather than modifying the args passed to commit. That also wouldn't require doing the reflection tricks. Is there a downside to tackling the problem in that direction?

Ie; Commit sets a special variable - on Refresh set metadata to this and return without touching storage?

RussellSpitzer
RussellSpitzer previously approved these changes Apr 17, 2025
Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing more with Eric offline, I think this may be the simplest approach to fixing this at the moment. I think we probably have two approaches we could take to fix this in the future though

  1. Do a complete replace of Operations to use our own code top to bottom and not extend the Iceberg base Classes
  2. Attempt to do the fix in OSS with the rational that doing so will help all REST catalogs which are in similar situations

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Apr 17, 2025
Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from my end too, with a minor suggestion.

I think we would want to re-use metadata we committed as it is without reloading and reconstructing it, to reset basis things like changes when we have lets say committed it, there should be way to reset changes once we are done with the commit and same goes with the re:use of BaseTableOps to increase the re-usability.

@dimas-b
Copy link
Contributor

dimas-b commented Apr 18, 2025

Attempt to do the fix in OSS with the rational that doing so will help all REST catalogs which are in similar situations

This sounds reasonable to me, but I guess the cycle might be too long. Also, if we run into another similar issue, we'd have to repeat that cycle again.

I wonder why Polaris needs to jump through the Catalog hoop where we implement an Iceberg Catalog which is them plugged into the REST Catalog API? Why not implement the REST Catalog API in a way that accesses storage directly (perhaps reusing FileIO implementations), but not going through another (non-REST) Catalog layer?

@eric-maynard
Copy link
Contributor Author

Closing in favor of #1426 now where I'll attempt to recreate this behavior in a cleaner way. Thanks for the review all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants