-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[WIP][SPARK-44937][CORE] Add SSL/TLS support for RPC and Shuffle communications #42685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc: @JoshRosen, @mridulm, @vanzin and @turp1twin as potential reviewers to look at this PR based on blame. Happy to discuss/share more context, I understand this is a big PR. I'm looking to gather high level feedback so I can address any concerns and work towards getting this merged. Thank you in advance! |
|
@hasnain-db Wow, this a blast from the past.. Happy to review... Cheers! |
common/network-common/src/main/java/org/apache/spark/network/ssl/SSLFactory.java
Outdated
Show resolved
Hide resolved
|
I wonder if the failing test is flaky, it consistently passes for me locally |
716b3c8 to
cd6810f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good to me:
- Overall, this PR's changes are well-flagged.
- There is only one small change which isn't guarded by a flag and it relates to faster marking of "close()-in-progress" connections in order to avoid race conditions where those connections are used for new requests. I believe that this is a safe change and don't feel a need to flag it.
- Key functionality is end-to-end(-ish) tested via subclasses of existing suites.
- I took a careful look at the changes to the network server and client code, including Netty buffer management.
- Some of the files (e.g.
SslMessageEncoder) are substantially similar to existing ones, but have small differences in one or two key places. I think this duplication is okay, though, as the code is slow-changing and isn't super complex: abstracting to de-duplicate would make the code harder to understand and could potentially have performance impacts.
- Some of the files (e.g.
- Confirmed that various SSL objects have the expected lifecycle (e.g. ensuring that we're reusing the
SSLContextso that session resumption can work). - Spark's pre-existing SSL configurations only apply to its Jetty servers (e.g. the web UI, history server, and master / worker UIs) and this PR is introducing new types of SSL configurations for SSL features that are only supported in the Netty RPC encryption code. This has some potential to be confusing if users set new keys at the global SSL configuration namespace because some of those configurations would only apply to RPC or UI encryption but not both. I don't see any great clean solutions for this, so I think this PR's current approach of "call out the caveats clearly in documentation" is a reasonable option.
@mridulm @dongjoon-hyun, would be great if you could also take a look at this.
| * @see <a href="https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html">Hadoop MapReduce Next Generation - Encrypted Shuffle</a> | ||
| */ | ||
| public final class ReloadingX509TrustManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bundling this as source avoids the need to add hadoop-common dependencies to common/network-common. Currently it looks like nothing under common/ depends on Hadoop stuff. Based on https://github.com/c9n/hadoop/blob/master/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/ssl/ReloadingX509TrustManager.java it looks like this code hasn't changed in over a decade, so this is unlikely to pose maintainability risks.
My read of https://infra.apache.org/licensing-howto.html is that this is permissible via
- https://infra.apache.org/licensing-howto.html#alv2-dep
- https://infra.apache.org/licensing-howto.html#bundle-asf-product
and I don't see any NOTICE portions in hadoop-common that specifically pertain to this file, so I believe we're clear from a license perspective 👍
| // Mark the connection as timed out, so we do not return a connection that's being closed | ||
| // from the TransportClientFactory if closing takes some time (e.g. with SSL) | ||
| this.timedOut = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is necessary for SSL because SSL connections can take a long time to close, but I think it's also okay to have it apply for regular unencrypted connections (as you've done here): if we have a client that is in the process of closing then marking it as timed out during that close helps to avoid a race condition in which a pooled client might be handed to a requester only to immediately close. 👍
|
Thanks for working on this @hasnain-db , this is a very nice addition to spark ! We can keep this PR as a reference - to get an idea of what the final version would look like. |
|
Thanks @mridulm ! Happy to do that. I can think of a nice split in line with the bullet points listed in the summary here. Just to confirm (since I'm not sure this repo has support for stacked PRs - if there is, please link me to an example) - you're proposing I put up one PR, get it approved and merged, then put up the second PR, and so on, right (since most changes depend on each other). One option I'd prefer (if possible) is to split this PR into N commits which can be reviewed independently - based on the outline above - would that suffice for now? A lot of the changed lines are adding test files or just copying existing tests to use the new configs, which are more "mechanical" to review. The changes to spark core aren't too large, so hopefully that would make it easier to focus the review on the important bits. Please let me know if that works for you |
|
+CC @otterc |
If there are independent PR's, those can be done in parallel. |
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in #9853. ### Why are the changes needed? Without this change, some of the new tests added in #42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in #42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes #43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in #9853. ### Why are the changes needed? Without this change, some of the new tests added in #42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in #42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes #43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in #9853. ### Why are the changes needed? Without this change, some of the new tests added in #42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in #42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes #43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in #9853. ### Why are the changes needed? Without this change, some of the new tests added in #42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in #42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes #43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
### What changes were proposed in this pull request? This PR introduces test keys for SSL functionality. They keys were generated using something like the following: ``` openssl genpkey -algorithm RSA -out key.pem -hexseed deadbeef openssl pkcs8 -topk8 -in key.pem -out key.pem.out openssl req -new -key key.pem -out csr.csr -days 3650 -subj "/CN=test/ST=California/L=San Francisco/OU=Org1/O=Org2/C=US" openssl x509 -req -in csr.csr -signkey key.pem -out certchain.pem -days 3650 rm key.pem csr.csr mv key.pem.enc key.pem ``` And then copied to all the relevant folders. I also copied over the keystore and trustore files (did not regenerate those). ### Why are the changes needed? We need these test files to run tests using PEM keys for SSL connections. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? These test files will be used by follow up PRs. This was tested as part of #42685, whic is being split. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43163 from hasnain-db/spark-tls-test-files. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request? Handle `InputStream`s in the `NettyLogger` so we can print out how many available bytes there are. ### Why are the changes needed? As part of the SSL support we are going to transfer `InputStream`s via Netty, and this functionality makes it easy to see the size of the streams in the log at a glance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI. Tested as part of the changes in #42685 which this is split out of, I observed the logs there. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43165 from hasnain-db/spark-tls-netty-logger. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Sean Owen <[email protected]>
### What changes were proposed in this pull request? As the title suggests. In addition to that API, add a config to the `TransportConf` to configure the default block size if desired. ### Why are the changes needed? Netty's SSL support does not support zero-copy transfers. In order to support SSL using Netty we need to add another API to the `ManagedBuffer` which lets buffers return a different data type. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI. This will have tests added later - it's tested as part of #42685 from which this is split out. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43166 from hasnain-db/spark-tls-buffers. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request? This PR adds a new dependency on this library so that we can use it for SSL functionality. We also include the network-common test helper library in a few places, which will be needed for reusing some test configuration files in the future. ### Why are the changes needed? These dependency changes are needed so we can use this library to provide SSL functionality, and for test helpers. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI. This is used in unit tests and functionality as part of #42685, from which this is being split out ### Was this patch authored or co-authored using generative AI tooling? No Closes #43164 from hasnain-db/spark-tls-deps. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request? This change adds new settings to `TransportConf` which are needed for the RPC SSL functionality to work. Additionally, add some sample configurations which are used by tests in follow up PRs (see #42685 for the full context) ### Why are the changes needed? These changes are needed so that other modules can easily access configurations, and that the sample configurations are easily accessible for tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a test, then ran: ``` ./build/sbt > project network-common > testOnly org.apache.spark.network.TransportConfSuite ``` There are more follow up tests coming (see #42685) ### Was this patch authored or co-authored using generative AI tooling? No Closes #43220 from hasnain-db/spark-tls-configs-low. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request? This change adds new settings to `TransportConf` which are needed for the RPC SSL functionality to work. Additionally, add some sample configurations which are used by tests in follow up PRs (see apache#42685 for the full context) ### Why are the changes needed? These changes are needed so that other modules can easily access configurations, and that the sample configurations are easily accessible for tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a test, then ran: ``` ./build/sbt > project network-common > testOnly org.apache.spark.network.TransportConfSuite ``` There are more follow up tests coming (see apache#42685) ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43220 from hasnain-db/spark-tls-configs-low. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request? This adds in support for trust store reloading, mirroring the Hadoop implementation (see source comments for a link). I believe reusing the existing code instead of adding a dependency is fine license wise (see https://github.com/apache/spark/pull/42685/files#r1333667328) ### Why are the changes needed? This helps us refresh trust stores without needing downtime ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests (also copied from upstream) ``` build/sbt > project network-common > testOnly org.apache.spark.network.ssl.ReloadingX509TrustManagerSuite ``` The rest of the changes and integration were tested as part of #42685 ### Was this patch authored or co-authored using generative AI tooling? No Closes #43249 from hasnain-db/spark-tls-reloading. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
…portConf ### What changes were proposed in this pull request? This PR adds the options added in #43220 to `SSLOptions` and `SparkTransportConf`. By adding it to the `SSLOptions` we can support inheritance of options, so settings for the UI and RPC SSL settings can be shared as much as possible. The `SparkTransportConf` changes are needed to support propagating these settings. I also make some changes to `SecurityManager` to log when this feature is enabled, and make the existing `spark.network.crypto` options mutually exclusive with this new settings (it would just involve double encryption then). Lastly, make these flags propagate down to when executors are launched, and allow the passwords to be sent via environment variables (similar to how it's done for an existing secret). This ensures they are not visible in plaintext, but also ensures they are available at executor startup (otherwise it can never talk to the driver/worker) ### Why are the changes needed? The propagation of these options are needed for the RPC functionality to work ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI Added some unit tests which I verified passed: ``` build/sbt > project core > testOnly org.apache.spark.SparkConfSuite org.apache.spark.SSLOptionsSuite org.apache.spark.SecurityManagerSuite org.apache.spark.deploy.worker.CommandUtilsSuite ``` The rest of the changes and integration were tested as part of #42685 ### Was this patch authored or co-authored using generative AI tooling? No Closes #43238 from hasnain-db/spark-tls-ssloptions. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request? This PR adds helper classes for SSL RPC communication that are needed to work around the fact that `netty` does not support zero-copy transfers. These mirror the existing `MessageWithHeader` and `MessageEncoder` classes with very minor differences. But the differences were just enough that it didn't seem easy to refactor/consolidate, and since we don't expect these classes to change much I hope it's ok. ### Why are the changes needed? These are needed to support transferring `ManagedBuffer`s into a form that can be transferred by `netty` over the network, since netty's encryption support does not support zero-copy transfers. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests ``` build/sbt > project network-common > testOnly org.apache.spark.network.protocol.EncryptedMessageWithHeaderSuite ``` The rest of the changes and integration were tested as part of #42685 ### Was this patch authored or co-authored using generative AI tooling? No Closes #43244 from hasnain-db/spark-tls-helpers. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in apache#9853. ### Why are the changes needed? Without this change, some of the new tests added in apache#42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in apache#42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> (cherry picked from commit 85bf705)
### What changes were proposed in this pull request? As titled - add a factory which supports creating SSL engines, and a corresponding builder for it. This will be used in a follow up PR by the `TransportContext` and related files to add SSL support. ### Why are the changes needed? We need a mechanism to initialize the appropriate SSL implementation with the configured settings (such as protocol, ciphers, etc) for RPC SSL support. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests. This will be more thoroughly tested in a follow up PR which adds callsites to it. It has been integration tested as part of #42685 ### Was this patch authored or co-authored using generative AI tooling? No Closes #43386 from hasnain-db/spark-tls-factory. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
…rtConf ### What changes were proposed in this pull request? This change ensures that RPC SSL options settings inheritance works properly after #43238 - we pass `sslOptions` wherever we call `fromSparkConf`. In addition to that minor mechanical change, duplicate/add tests for every place that calls this method, to add a test case that runs with SSL support in the config. ### Why are the changes needed? These changes are needed to ensure that the RPC SSL functionality can work properly with settings inheritance. In addition, through these tests we can ensure that any changes to these modules are also tested with SSL support and avoid regressions in the future. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Full integration testing also done as part of #42685 Added some tests and ran them: ``` build/sbt > project core > testOnly org.apache.spark.*Ssl* > testOnly org.apache.spark.network.netty.NettyBlockTransferSecuritySuite ``` and ``` build/sbt -Pyarn > project yarn > testOnly org.apache.spark.network.yarn.SslYarnShuffleServiceWithRocksDBBackendSuite ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #43387 from hasnain-db/spark-tls-integrate-everywhere. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
|
The overall feature has been merged. |
### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in apache#9853. ### Why are the changes needed? Without this change, some of the new tests added in apache#42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in apache#42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88fea) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
What changes were proposed in this pull request?
This PR adds support for SSL/TLS based communication for Spark RPCs and block transfers - providing an alternative to the existing encryption / authentication implementation documented at https://spark.apache.org/docs/latest/security.html#spark-rpc-communication-protocol-between-spark-processes
This is based on an existing PR from 2015: #9416 - with some refactoring and a number of updates to make it work with changes to Spark since.
I understand this is a large PR, so I've broken this down by a high level summary of changes and a suggested review order:
netty-tcnative-boringssl-staticwhich provides support for faster TLS communicationdocs/security.mdwhich describe the new flags and configuration.SSLOptionsandTransportConfto support the new flags for this featureTransportContextto optionally add an SSL based handler if configured, and make similar changes in the transport client and server factoriesManagedBufferto convert objects to a Netty object for SSL encoding (since we can't use zero-copy transfers, we can't useconvertToNetty()directly)EncryptedMessageWithHeaderandSSLMessageEncoderare quite similar to the existing variants but just different enough that it was hard to consolidateSSLFactoryto create the JDK / Netty SSL handlers as appropriate. This handles configuration of the protocols, ciphers, etcReloadingX509TrustManagerto support trust store reloading.SecurityManagerto disable the existing authentication/encryption mechanisms if this new feature is enabledSparkConfandCommandUtilsto pass passwords via env variables if needed, to preserve security guarantees (similar to the existing SSL password propagation)SparkTransportConfto propagate SSL optionsSSLSampleConfigsclass for sample configurations used in testsTransportContextwhich rerun the same tests but with this new feature enabled (this caught a bunch of bugs). This will ensure features are compatible with SSL going forwardWhy are the changes needed?
Spark currently does not support TLS/SSL for RPC and block transfers. It is helpful to have this as an alternative encryption method for users which may need to use more standard encryption mechanisms instead of one that is more internal to spark.
Does this PR introduce any user-facing change?
Yes, this includes new configuration options and updates associated documentation. Besides that, though, all other aspects of Spark should remain unchanged (modulo performance differences).
How was this patch tested?
Added a bunch of unit tests that pass.
I also ran some queries locally to ensure they still work.
I verified traffic was encrypted using TLS using two mechanisms:
Was this patch authored or co-authored using generative AI tooling?
No