Skip to content

Conversation

@gh-yzou
Copy link
Contributor

@gh-yzou gh-yzou commented May 30, 2025

Although we do a check of !realmContextInstance.isUnsatisfied() it only checks if there is matching bean, however, it doesn't check if the context is active or not. Similarly isResolvable also don't check if the scope is active. Therefore if getConfiguration is called when there is no active scope, it will get Exception like ContextNotActiveException.
This actually fails the TaskExecutor because it runs in a separate thread.

In order to fix the problem, we introduces a new getConfiguration function to handle the background tasks, and also use isResolvable instead of isUnsatisfied to handle ambiguities.

Tests:

  1. Manually verified the test with S3 setup
  2. Tried to add a SparkIntegration tests for the purge, however, either runs into dependency problem like following
 javax.annotation.Nonnull
java.lang.ClassNotFoundException: javax.annotation.Nonnull
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526)
	at org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:141)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)

or runs into prelivege issue when directly using rest call
follow up this in a separate branch here https://github.com/gh-yzou/polaris/pull/new/yzou-purge-spark-integration

Follow up tasks:

  1. PR to switch non loadTasks API to use the new API
  2. PR to update the loadTasks to use the new API

Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CDI scopes are a problem for tasks, it may be worth considering making a different, @ApplicaitonScoped Configuration Store impl. for them (and inject it only into tasks). WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why keep this comment?

Comment on lines 70 to 72
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a robust impl. for getConfiguration... The exception can be anything. Could we make conditions and error handling more specific?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the config store can be user provided, so we can't anticipate many specifics about the exception type.

Perhaps we should make this RuntimeException though, so:

  1. We don't swallow things like interrupts
  2. The user-provided config store has the ability to not get its exception swallowed if it so chooses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The originally fix is not correct, we now fixes the back group task getConfiguration by directly passing the reaml into the function

@gh-yzou
Copy link
Contributor Author

gh-yzou commented May 30, 2025

@dimas-b After some more investigation, actually, i don't think CDI injection here is a robust way to handle configuration store. The TaskExecutor was failing with error like ContextNotActiveException, the reason seems that CDI injection doesn't work well when the access is made outside the scoped, like background task, which is what TaskExecutor does. Since when handle task is called, it does have a callContext passed, like this

protected void handleTask(long taskEntityId, CallContext ctx, int attempt)

It does mean it need to get the configuration of a particular realm, instead of the default. I think the real robust way is just passing the callContext to the call its-self, instead of use CDI injection for this case. in other words, have call like following

public <T> @Nullable T getConfiguration(@Nonnull CallContext ctx, String configName)

and we will remove

@Inject private Instance<RealmContext> realmContextInstance;

WDYT?

@dimas-b
Copy link
Contributor

dimas-b commented May 30, 2025

In the longer run, I'd prefer to reduce (and eventually remove) the use of CallContext and rely on CDI concepts instead, I believe what CallContext does is supported by CDI in native ways.

In the short term, I'll try and take a stab at this PR trying to solve the config access sub-cases in a CDI-native way. Please give me a couple of days.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole refactoring to use the new interface is fairly large and touches persistent due to the loadTasks API, we will beak the whole refactoring to three parts:

  1. this PR that introduces the new API that only requires realm, and fixes the task executor to unblock the release
  2. a separate PR to switch other non-persistent part to use this new API
  3. a separate PR to switch the loadTasks to use the new API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on adding this interface method because is creates an API skew WRT other methods, which do not have the realm parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see this happen, if all we need here is actually the realm. This largely simplify the interface, which also avoid any complication of CDI. Also all callers of getConfiguration() will have the realm ready to pass in.

A side note: why would a new method have to be the same with other methods for its parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimas-b i am actually planning to completely remove the usage of the other API that doesn't take realm, but I need to move towards it step by step due to the complexity. I can mark the other ones as deprecated to avoid new usages, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimas-b is the concern actually we don't pass the whole RealmContext? i can definitely pass the whole RealmContext to the function if that solves the concern

Copy link
Contributor

@dimas-b dimas-b Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is having materially different inputs in the various lookup methods. Specifically, if one method has a "realm" parameter, all of them should have it.

I'm fine with either RealmContext or String as the type of the "realm" parameter (slightly prefer RealmContext).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would a new method have to be the same with other methods for its parameters?

Not all parameters, but those that affect lookup results, but are not always specified in method calls. In this case - realm ID (since configuration can be different in different realms). This is what I meant by "API skew".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on using RealContext to avoid hard-wiring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimas-b For the getConfiguration function, i think it make sense to have realmContext, instead of PolarisCallContext since the configuration is per Realm. So i think in short term, it make sense to deprecate the usage of the calls that uses PolarisCallContext, and start using the realmContext one, i will do that as soon as possible.
in long run, we can definitely leverage CDI once we figure out how to propagate request scoped context in the background thread correctly.

@gh-yzou gh-yzou requested a review from dimas-b June 2, 2025 01:58
@adutra
Copy link
Contributor

adutra commented Jun 2, 2025

The TaskExecutor was failing with error like ContextNotActiveException, the reason seems that CDI injection doesn't work well when the access is made outside the scoped, like background task, which is what TaskExecutor does

That is correct and that is precisely why the following methods were created:

  • org.apache.polaris.core.context.CallContext#copyOf
  • org.apache.polaris.core.PolarisCallContext#copyOf

It is also because of that that CallContext.copyOf() takes care of creating a copy of the current RealmContext and of the current PolarisCallContext:

RealmContext realmContext = () -> realmId;
PolarisCallContext polarisCallContext = PolarisCallContext.copyOf(base.getPolarisCallContext());

From that point onwards, it's 100% safe to use that instance of RealmContext, since it's not a CDI bean.

The instance returned by PolarisCallContext.copyOf() is also not a CDI-managed bean; however, we observe that it holds references to 4 beans: BasePersistence, PolarisDiagnostics, PolarisConfigurationStore and Clock:

return new PolarisCallContext(
original.getMetaStore().detach(),
original.getDiagServices(),
original.getConfigurationStore(),
original.getClock());

Of these 4 beans, 3 are application-scoped and should be usable in any thread: PolarisDiagnostics, PolarisConfigurationStore and Clock. As for BasePersistence, it's not a CDI bean and should be usable in any thread as well (I'll notice that @dimas-b added a detach() method to BasePersistence recently, which has no implementation for now but the default one that returns this – was it with the intent of creating non-CDI beans out of CDI ones?).

Taking a step back:

One could argue here that we should rely on context propagation done by Quarkus. And yes, that was my first intention when I ported the TaskExecutor code to Quarkus. If you look at the default task executor:

public ManagedExecutor taskExecutor(TaskHandlerConfiguration config) {
return SmallRyeManagedExecutor.builder()
.injectionPointName("task-executor")
.propagated(ThreadContext.ALL_REMAINING)
.maxAsync(config.maxConcurrentTasks())
.maxQueued(config.maxQueuedTasks())
.build();
}

You'll notice it's a ManagedExecutor with context propagation enabled.

But unfortunately, CallContext is forcibly closed by a dispose method:

public void closeCallContext(@Disposes CallContext callContext) {
callContext.close();
}

And my experimentations with Quarkus showed that in this case, the CallContext instance is propagated to the TaskExecutor threads, but is also closed as soon as the initial HTTP request is done, leading to failures in the task executor threads.

That was the reason why I switched to manual copies.

FYI @dimas-b @gh-yzou

@adutra
Copy link
Contributor

adutra commented Jun 2, 2025

@gh-yzou can you share the stack trace of your ContextNotActiveException? Is it happening within a task executor thread or somewhere else?

Comment on lines 46 to 49
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just get rid of this function then? I know it's used in a lot of places but it sounds unsafe to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole refactor actually turns out pretty big, and touches persistent, so I am breaking this into couple of PRs. There will be follow up PRs coming to completely remove this function if we are not able to figure out how to propogate the beans :
PR to switch non loadTasks API to use the new API
PR to update the loadTasks to use the new API

This is also in the PR description.

Comment on lines 210 to 211
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't realm implied by polarisCallContext?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you need CallContext to access RealmContext and the realm ID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the realm ID is not conveniently available from PolarisCallContext, but PolarisCallContext has a BasePersistence instance, which is specific to a realm (cf. JdbcBasePersistenceImpl)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. The boundaries between CallContext and PolarisCallContext are unclear. PolarisCallContext could store the realm ID since an instance of that type cannot be constructed without knowing the realm ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, i am always getting confused about what should be in CallContext, and what should be in PolarisCallConext. Maybe we should follow up to make a clear definition of those contexts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is more about potentially confusing realm ID is this new realm parameter and the implicit realm ID inside PolarisCallContext (even if it does not hold it as a field).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, i updated the getConfiguration to take realmContext, instead of realm. and updated this function to just take the whole CallContext to be more clear.

@gh-yzou
Copy link
Contributor Author

gh-yzou commented Jun 2, 2025

@adutra Yes, we were trying to leverage CDI to get rid of the thread local usage CallContext.getCurrentContext() in DefaultConfigurationStore, and didn't realize that it doesn't work with task executor.

The proper way to pass the realm context here would be as a method parameter, for example by passing CallContext instead of PolarisCallContext

We actually only need the realm, not even the whole call context. This PR currently introduces the new method, i will have follow up PRs to remove the other usages step by step, the whole refactor is pretty big and touches persistent, i need to move step by step to make thing simplier

@adutra
Copy link
Contributor

adutra commented Jun 2, 2025

We actually only need the realm, not even the whole call context. This PR currently introduces the new method, i will have follow up PRs to remove the other usages step by step, the whole refactor is pretty big and touches persistent, i need to move step by step to make thing simplier

OK, in this case for consistency with other components, maybe we could pass RealmContext? Something like:

  public <T> @Nullable T getConfiguration(@Nonnull RealmContext realmCtx, String configName) {

This could also answer @dimas-b concerns voiced here:

#1758 (comment)

@dimas-b
Copy link
Contributor

dimas-b commented Jun 2, 2025

getConfiguration(@Nonnull RealmContext realmCtx, String configName)

LGTM as far as this PR is concerned.

In the long run, I still believe it is preferable to have RealmContext injected via CDI as opposed to explicit method parameters.

@dimas-b
Copy link
Contributor

dimas-b commented Jun 2, 2025

The BasePersistence.detach() method is not specifically about context or CDI, it's about sharing the same object in different threads... Calling A.detach() is expected to return an object that has the same contextual properties (e.g. realm ID) as A.

@gh-yzou gh-yzou force-pushed the yzou-fix-s3-delte branch from 84327ad to e417244 Compare June 2, 2025 18:50
@gh-yzou
Copy link
Contributor Author

gh-yzou commented Jun 2, 2025

@adutra Thanks a lot! I didn't realize that was the concern. @dimas-b I think let's follow up on the long run, in long run i definitely want to remove one of the getConfiguration usage, and we can discuss about which one to remove, and how to remove them.

* @return the current value set for the configuration key for the given realm, or null if not set
* @param <T> the type of the configuration value
*/
default <T> @Nullable T getConfiguration(@Nonnull RealmContext realmContext, String configName) {
Copy link
Contributor

@dimas-b dimas-b Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with RealmContext, I believe it should be present in all methods.

As discussed in another thread, PolarisCallContext implies a particular realm, but does not expose the realm ID. Therefore, consistent behaviour is not guaranteed across methods using PolarisCallContext and RealmContext.

-1 This is a critical issue for this PR from my POV.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: having convenience lookup methods based on CallContext would be fine, because we have CallContext.getRealmContext()

Copy link
Contributor

@dimas-b dimas-b Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing -1 in the interest of making progress. There is only one implementation, which appears to processes both lookup call paths correctly (as far as I can tell from the diff). I'd be fine merging in this state after the other minor comments are addressed.

@gh-yzou : Would you mind opening a GH issue for followup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PolarisCallContext is actually not used anywhere during getConfiguration, and only realmContext is needed. So i think we should deprecate the usage of the methods that uses PolarisCallContext, and i plan to remove the usages step by step in follow up PRs, remove all of them in one PR makes the change really big and touches persistent. i can mark the other APIs as deprecated for now if that helps, would that work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added a TODO to all other functions that we will update them to use RealmContext instead of PolarisCallContext

Copy link
Contributor

@dimas-b dimas-b Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with improving config lookup with a series of PRs. I'd still appreciate a GH issue for visibility. I believe it would be a 1.0 blocker due to large lookup API impact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimas-b yes, absolutely. i created an issue here #1775, i will also add this in the comment. That will be my next highest priority on my tasks

* @param configName the name of the configuration key to check
* @return the current value set for the configuration key or null if not set
* @param <T> the type of the configuration value
* <p>This function needs to be used with caution, it can not be called outside of active
Copy link
Contributor

@dimas-b dimas-b Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this warning above parameter descriptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link

@sfc-gh-jojiang sfc-gh-jojiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @gh-yzou for working on this and everyone for reviewing!

@HonahX HonahX merged commit dd2fdf9 into apache:main Jun 2, 2025
9 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Jun 2, 2025
travis-bowen pushed a commit to travis-bowen/polaris that referenced this pull request Jun 20, 2025
…rationStore (apache#1758) (apache#47)

Although we do a check of !realmContextInstance.isUnsatisfied() it only checks if there is matching bean, however, it doesn't check if the context is active or not. Similarly isResolvable also don't check if the scope is active. Therefore if getConfiguration is called when there is no active scope, it will get Exception like ContextNotActiveException.
This actually fails the TaskExecutor because it runs in a separate thread.

In order to fix the problem, we introduces a new getConfiguration function to handle the background tasks, and also use isResolvable instead of isUnsatisfied to handle ambiguities.

Co-authored-by: gh-yzou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants