-
Notifications
You must be signed in to change notification settings - Fork 332
feat: Support customizing S3 endpoints #1913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I took a quick look -- pooling the STS clients seems like a reasonable thing to do and this approach to doing so looks pretty correct to me. However it's not totally clear to me why this is needed to implement #32. Looking at #389, this part of the code wasn't touched at all. Could we instead start more top-down with S3-compatible storage support and figure out if these changes are actually necessary? |
polaris-core/src/main/java/org/apache/polaris/core/storage/aws/StsClientSupplier.java
Outdated
Show resolved
Hide resolved
runtime/service/src/main/java/org/apache/polaris/service/quarkus/config/QuarkusProducers.java
Outdated
Show resolved
Hide resolved
runtime/service/src/main/java/org/apache/polaris/service/storage/aws/S3AccessConfig.java
Outdated
Show resolved
Hide resolved
runtime/service/src/main/java/org/apache/polaris/service/storage/aws/S3AccessConfig.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we copying such generic code from another project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it is convenient as opposed to writing it from scratch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what will happen if anyone change this class later? Should we keep CODE_COPIED_TO_POLARIS for ever?
To reduce the maintenance burden, I'd suggest to rewrite to avoid copying from another project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OSS is meant to be reused. Rewriting to avoid attribution is against the grain of OSS concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: forever: I'd guess it's very likely as I do not personally see a reason for rewriting this code in any substantial way in the foreseeable future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-maynard @flyrain We have a ton of code taken from other projects already in the code base. Most of that was already part of the initial contribution to Apache. A lot of that was also "generic" and literally copies of rather trivial code. So I seriously do not understand what you're trying to say here. The goal should really not be to reinvent "all the wheels".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What’s an example of similarly generic code that’s been copied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's always a good practice to reduce the dependency for code. In this case, we have to add additional information in the LICENSE. If rewriting isn't a big task which I believe so, we mighty avoid extra maintenance burden. That's also just a suggestion, I'm NOT gonna block this PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
License checks for CODE_COPIED_TO_POLARIS marks are automated - there's minimal cognitive effort involved in maintaining these license references (plus they are fixed and do not need to change over time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is, as mentioned, a lot of other projects' code in Polaris (OpenAPI generator files, the web site, code copied from Iceberg, build scripts, etc etc etc).
There is no ongoing maintenance burden, but there is a huge plus taking code that's tested and known to work in production for years. New developed code is way more work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really related to S3-compatible storage support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stats are not directly relates to supporting S3-compatible storage. However, the STS clients pool is. Having a pool in the system, it is good to expose metrics about it for observability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are using AWS ask it has metrics inbuilt https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/metrics-list.html is the current mechanism more comprehensive than that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these stats are about the cache of STS clients we keep in Polaris code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, we just plotting a gauge for max clients, wouldn't it be better to have the stats for the http connection pool ? for ex lets say i have 2 entires in the cache and one entry hogs my 49 connection and other only 1 i would want to know who hogs 49 connection ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stats for the HTTP client would be nice to have for sure, but let's defer it to another PR. I do not have that code handy ATM and I would not want to bloat the scope of this PR even more :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only usage of SdkHttpClient outside of QuarkusProducers itself, I wonder if we can simplify things by just injecting the StsClientBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StsClientsPool is injected where it is needed. SdkHttpClient is injected into StsClientsPool. Are you suggesting to hide the SdkHttpClient from CDI? If we did that, it would not be possible to close the singleton SdkHttpClient using CDI callbacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the suggestion is to make SdkHttpClient an inner (owned) fields in StsClientsPool and close the pool, IMHO, that introduces unnecessary coupling, because the pool only needs the interface (namely SdkHttpClient), but QuarkusProducers chooses a specific implementation of that interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the suggestion. They are always coupled together, so encapsulating the client inside the pool simplifies usage of the pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the pool and the HTTP client are not coupled. The pool needs a client. Yes, the client can be reused by other parts of Polaris (if it becomes necessary in the future).
The pool is easily injectable in the current state of the PR. Use sites need not be aware of how the client is constructed.
I'd like to keep the current approach of having distinct CDI beans for the HTTP client and the STS client pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, as @singhpk234 commented in another thread, we're going to support multiple HTTP client implementations (later), selectable in runtime (by config).
singhpk234
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my comment doesn't shows up linking here
I can do that, but the diff is going to be larger ;) |
|
@singhpk234 : Sorry, I do not see your comments (even via above links). Could you re-summit them? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, i resubmitted !
I think i messed up last time in submitting, thank you for patience ~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if HTTP connection pool is shared, shouldn't we restrict max connection per route ? as if lets say my pool size is 50 and for route 1 hit with 50 concurrent request so pool now only has 50 connection of route 1 and if now route 2 comes it starves ? I am not sure how to do it here HTTPClient use to PoolingConnectionManager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the AWS SDK not able to deal with this situation on its side?
Did you observe this kind of starvation in practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is practically possible, i think since we don't specify it default kicks in from http client, TBH i haven't hit in practice as in AWS EMR spark jobs generally have one endpoint to connect to, but with a long running service like polaris, i think the possibility being increased, WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snazy WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only ApacheHttpClient and not URLConnection client ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make it configurable in runtime if you prefer. Let's do that as a follow-up PR, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ApacheHttpClient is more capable in general, AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg supports both, hence coming from there, Its good to have alternative specially considering how they behave. for example The Apache HTTP connection will always read the entire stream on close, which results in much more data being read than needed. The URL connection client does not behave this way. for more details : apache/iceberg#7262
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly. If do not mind supporting both, but let's do that in a follow-up PR to avoid bloating the scope of this PR (the main idea in this PR is not the ApacheHttpClient client, but supporting multiple endpoints).
runtime/service/src/main/java/org/apache/polaris/service/storage/aws/StsClientsPool.java
Outdated
Show resolved
Hide resolved
|
Marking this PR as "draft" to add end-to-end support for MinIO. Please feel free to comment anyway. |
|
|
|
Prerequisite PR: #1955 |
8363e38 to
de9c323
Compare
runtime/service/src/main/java/org/apache/polaris/service/quarkus/config/QuarkusProducers.java
Outdated
Show resolved
Hide resolved
|
@eric-maynard PTAL :) |
|
Thanks for doing this. I think it's the right direction. We could also make CLI to support new fields, which could be done as a followup. |
|
This is not the final PR for non-AWS S3 :) We should certainly add CLI support, but probably in a follow-up PR. |
No functional change. Introduce a dedicated interface for StsClient suppliers and implement it using a pool of cached clients. All client are "thin" and share the same `SdkHttpClient`. The latter is closed when the server shuts down. This is a step towards supporting non-AWS S3 storage (apache#1530). For this reason the STS endpoint is present in new interfaces, but is not used yet.
| private final Function<StsDestination, StsClient> clientBuilder; | ||
|
|
||
| public StsClientsPool( | ||
| S3AccessConfig effectiveSts, SdkHttpClient sdkHttpClient, MeterRegistry meterRegistry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass effectiveSts as an integer instead of passing the whole S3AccessConfig object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
certainly - updated.
No functional change. Introduce a dedicated interface for `StsClient` suppliers and implement it using a pool of cached clients. All client are "thin" and share the same `SdkHttpClient`. The latter is closed when the server shuts down. This is a step towards supporting non-AWS S3 storage (apache#1530). For this reason the STS endpoint is present in new interfaces, but is not used yet. (cherry picked from commit 95d1eac)
This change enables using non-AWS S3 implementations (#1530) when
SKIP_CREDENTIAL_SUBSCOPING_INDIRECTIONis set tofalse.endpointto AWSAwsStorageConfigInfo(API change)SdkHttpClient. The latter is closed when the server shuts down.Existing catalogs are not affected by this change.
Dev ML discussion: https://lists.apache.org/thread/72psyjb40fb9l73sxld2qcr69l4tf4cw