Skip to content

Conversation

@dimas-b
Copy link
Contributor

@dimas-b dimas-b commented Jun 19, 2025

This change enables using non-AWS S3 implementations (#1530) when SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION is set to false.

  • Add optional endpoint to AWS AwsStorageConfigInfo (API change)
  • Introduce a dedicated interface for StsClient suppliers and implement it using a pool of cached clients.
  • All client are "thin" and share the same SdkHttpClient. The latter is closed when the server shuts down.
  • If an endpoint is set, propagate it to S3 clients and local FileIO.
  • Use the same endpoint for STS requests (credential vending).
  • Add utility classes for testing with MinIO .
  • Add integration test based on MinIO.

Existing catalogs are not affected by this change.

Dev ML discussion: https://lists.apache.org/thread/72psyjb40fb9l73sxld2qcr69l4tf4cw

@eric-maynard
Copy link
Contributor

I took a quick look -- pooling the STS clients seems like a reasonable thing to do and this approach to doing so looks pretty correct to me. However it's not totally clear to me why this is needed to implement #32. Looking at #389, this part of the code wasn't touched at all.

Could we instead start more top-down with S3-compatible storage support and figure out if these changes are actually necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we copying such generic code from another project?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is convenient as opposed to writing it from scratch.

Copy link
Contributor

@flyrain flyrain Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what will happen if anyone change this class later? Should we keep CODE_COPIED_TO_POLARIS for ever?

To reduce the maintenance burden, I'd suggest to rewrite to avoid copying from another project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OSS is meant to be reused. Rewriting to avoid attribution is against the grain of OSS concepts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: forever: I'd guess it's very likely as I do not personally see a reason for rewriting this code in any substantial way in the foreseeable future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-maynard @flyrain We have a ton of code taken from other projects already in the code base. Most of that was already part of the initial contribution to Apache. A lot of that was also "generic" and literally copies of rather trivial code. So I seriously do not understand what you're trying to say here. The goal should really not be to reinvent "all the wheels".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s an example of similarly generic code that’s been copied?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always a good practice to reduce the dependency for code. In this case, we have to add additional information in the LICENSE. If rewriting isn't a big task which I believe so, we mighty avoid extra maintenance burden. That's also just a suggestion, I'm NOT gonna block this PR for that.

Copy link
Contributor Author

@dimas-b dimas-b Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License checks for CODE_COPIED_TO_POLARIS marks are automated - there's minimal cognitive effort involved in maintaining these license references (plus they are fixed and do not need to change over time).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is, as mentioned, a lot of other projects' code in Polaris (OpenAPI generator files, the web site, code copied from Iceberg, build scripts, etc etc etc).

There is no ongoing maintenance burden, but there is a huge plus taking code that's tested and known to work in production for years. New developed code is way more work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really related to S3-compatible storage support?

Copy link
Contributor Author

@dimas-b dimas-b Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stats are not directly relates to supporting S3-compatible storage. However, the STS clients pool is. Having a pool in the system, it is good to expose metrics about it for observability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are using AWS ask it has metrics inbuilt https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/metrics-list.html is the current mechanism more comprehensive than that ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these stats are about the cache of STS clients we keep in Polaris code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we just plotting a gauge for max clients, wouldn't it be better to have the stats for the http connection pool ? for ex lets say i have 2 entires in the cache and one entry hogs my 49 connection and other only 1 i would want to know who hogs 49 connection ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stats for the HTTP client would be nice to have for sure, but let's defer it to another PR. I do not have that code handy ATM and I would not want to bloat the scope of this PR even more :)

Comment on lines +78 to +79
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only usage of SdkHttpClient outside of QuarkusProducers itself, I wonder if we can simplify things by just injecting the StsClientBuilder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StsClientsPool is injected where it is needed. SdkHttpClient is injected into StsClientsPool. Are you suggesting to hide the SdkHttpClient from CDI? If we did that, it would not be possible to close the singleton SdkHttpClient using CDI callbacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the suggestion is to make SdkHttpClient an inner (owned) fields in StsClientsPool and close the pool, IMHO, that introduces unnecessary coupling, because the pool only needs the interface (namely SdkHttpClient), but QuarkusProducers chooses a specific implementation of that interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the suggestion. They are always coupled together, so encapsulating the client inside the pool simplifies usage of the pool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the pool and the HTTP client are not coupled. The pool needs a client. Yes, the client can be reused by other parts of Polaris (if it becomes necessary in the future).

The pool is easily injectable in the current state of the PR. Use sites need not be aware of how the client is constructed.

I'd like to keep the current approach of having distinct CDI beans for the HTTP client and the STS client pool.

Copy link
Contributor Author

@dimas-b dimas-b Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as @singhpk234 commented in another thread, we're going to support multiple HTTP client implementations (later), selectable in runtime (by config).

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my comment doesn't shows up linking here

  1. #1913 (comment)
  2. #1913 (comment)
  3. #1913 (comment)

@dimas-b
Copy link
Contributor Author

dimas-b commented Jun 20, 2025

Could we instead start more top-down with S3-compatible storage support and figure out if these changes are actually necessary?

I can do that, but the diff is going to be larger ;)

@dimas-b
Copy link
Contributor Author

dimas-b commented Jun 20, 2025

@singhpk234 : Sorry, I do not see your comments (even via above links). Could you re-summit them?

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, i resubmitted !
I think i messed up last time in submitting, thank you for patience ~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if HTTP connection pool is shared, shouldn't we restrict max connection per route ? as if lets say my pool size is 50 and for route 1 hit with 50 concurrent request so pool now only has 50 connection of route 1 and if now route 2 comes it starves ? I am not sure how to do it here HTTPClient use to PoolingConnectionManager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the AWS SDK not able to deal with this situation on its side?

Did you observe this kind of starvation in practice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is practically possible, i think since we don't specify it default kicks in from http client, TBH i haven't hit in practice as in AWS EMR spark jobs generally have one endpoint to connect to, but with a long running service like polaris, i think the possibility being increased, WDYT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snazy WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only ApacheHttpClient and not URLConnection client ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make it configurable in runtime if you prefer. Let's do that as a follow-up PR, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ApacheHttpClient is more capable in general, AFAIK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg supports both, hence coming from there, Its good to have alternative specially considering how they behave. for example The Apache HTTP connection will always read the entire stream on close, which results in much more data being read than needed. The URL connection client does not behave this way. for more details : apache/iceberg#7262

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly. If do not mind supporting both, but let's do that in a follow-up PR to avoid bloating the scope of this PR (the main idea in this PR is not the ApacheHttpClient client, but supporting multiple endpoints).

@dimas-b
Copy link
Contributor Author

dimas-b commented Jun 20, 2025

Marking this PR as "draft" to add end-to-end support for MinIO. Please feel free to comment anyway.

@dimas-b dimas-b marked this pull request as draft June 20, 2025 16:49
@dimas-b dimas-b changed the title Add STS clients pool feat: Support customizing S3 endpoints Jun 25, 2025
@dimas-b dimas-b marked this pull request as ready for review June 25, 2025 04:09
@dimas-b
Copy link
Contributor Author

dimas-b commented Jun 25, 2025

CatalogSerializationTest currently fails because unset endpoint values serialize as null. This is technically a breaking change for older clients. I'll see how it can be fixed later (under this PR). Please review the main change for now.

@dimas-b
Copy link
Contributor Author

dimas-b commented Jun 27, 2025

Prerequisite PR: #1955

@dimas-b dimas-b marked this pull request as draft June 28, 2025 00:52
@dimas-b dimas-b force-pushed the sts-client-pool branch 2 times, most recently from 8363e38 to de9c323 Compare June 30, 2025 21:45
@dimas-b dimas-b marked this pull request as ready for review June 30, 2025 21:45
snazy
snazy previously approved these changes Jul 2, 2025
@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Jul 2, 2025
@dimas-b
Copy link
Contributor Author

dimas-b commented Jul 3, 2025

@eric-maynard PTAL :)

@flyrain
Copy link
Contributor

flyrain commented Jul 3, 2025

Thanks for doing this. I think it's the right direction. We could also make CLI to support new fields, which could be done as a followup.

@dimas-b
Copy link
Contributor Author

dimas-b commented Jul 3, 2025

This is not the final PR for non-AWS S3 :) We should certainly add CLI support, but probably in a follow-up PR.

snazy
snazy previously approved these changes Jul 3, 2025
dimas-b added 10 commits July 3, 2025 15:00
No functional change.

Introduce a dedicated interface for StsClient suppliers
and implement it using a pool of cached clients.

All client are "thin" and share the same `SdkHttpClient`.
The latter is closed when the server shuts down.

This is a step towards supporting non-AWS S3 storage (apache#1530).
For this reason the STS endpoint is present in new
interfaces, but is not used yet.
@dimas-b dimas-b force-pushed the sts-client-pool branch from 91a6ee4 to c61861b Compare July 3, 2025 19:00
private final Function<StsDestination, StsClient> clientBuilder;

public StsClientsPool(
S3AccessConfig effectiveSts, SdkHttpClient sdkHttpClient, MeterRegistry meterRegistry) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass effectiveSts as an integer instead of passing the whole S3AccessConfig object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

certainly - updated.

@snazy snazy merged commit 95d1eac into apache:main Jul 4, 2025
11 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Jul 4, 2025
snazy added a commit to snazy/polaris that referenced this pull request Jul 17, 2025
snazy added a commit that referenced this pull request Jul 21, 2025
…ess (#2127)

This change adds support for endpoint, sts-endpoint, path-style-access to the Polaris Python client.

Amends #1913 and #2012
ashevchuk123 pushed a commit to mapr/polaris that referenced this pull request Oct 27, 2025
No functional change.

Introduce a dedicated interface for `StsClient` suppliers and implement it using a pool of cached clients.

All client are "thin" and share the same `SdkHttpClient`. The latter is closed when the server shuts down.

This is a step towards supporting non-AWS S3 storage (apache#1530).
For this reason the STS endpoint is present in new interfaces, but is not used yet.

(cherry picked from commit 95d1eac)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants