-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[Spark-5682] Add spark encrypted shuffle by using chimera lib #5307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
core/pom.xml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not generally needed.
|
Left some initial comments. I like the idea of using a library for this, my main concern is about the stability of that library (and who'll be maintaining it going forward). Aside from that, there are some parts of the code that need cleaning up and some style updates, and some broken error handling, but nothing too major. |
|
@vanzin :
i have updated the code according to your valuable suggestions. Chimera is a project which is maintained by Intel team(My teammate dianfu is the contributor of it). Chimera is a project which strips code related to the encryption and decryption part of Transparent Encryption in HDFS from hadoop. Transparent Encryption is imported since hadoop 2.6 and the code is stable after hadoop 2.6 release. Although the current version of Chimera is 0.0.1, it is same stable as current hadoop because we just strip code from current trunk of hadoop. If someone reports new bugs to hadoop about the encryption and decryption part later, my teammate will watch it and patch the patches of those bugs to Chimera. More information about encrypted shuffle you can reference https://issues.apache.org/jira/browse/SPARK-5682. Any suggestions/advices/guidance are welcome. |
4021dc6 to
f8a0083
Compare
9121bb2 to
e7908aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you cannot get the user credential, you can write a default number to configuration file every time when a certain user started the jobs, and retrieve this number in such case.
however, i think it maybe unsafe to store SPARK_SHUFFLE_TOKEN and related information in plain text in configuration file, it is better to cypher the texts then store the values.
|
@vanzin, could you help to clarify whether Spark currently includes encryption of shuffle data in transit? I believe that your SASL patch added this encryption? If so, I think that we should close this PR and the following JIRAs:
I don't see any mention of SASL on https://spark.apache.org/docs/1.5.0/security.html, so if we do indeed have encryption then we should update the docs to say so. Right now, the docs claim that "SSL is not supported yet for WebUI and block transfer service", making it sound like we have no encryption support for data in-transit. |
|
Spark supports encrypted communication for shuffle data; we should fix the docs (I'll file a bug for that). But this PR is not about on-the-wire encryption, it's data at rest encryption (i.e. the shuffle files on disk). |
|
Ah, gotcha. We should update this PR / JIRA title to better reflect this. |
|
And to be clear, I do think we can close https://issues.apache.org/jira/browse/SPARK-6373, no? |
|
re: SPARK-6373, it would just add another way of encrypting shuffle traffic; personally I don't see much value in it, but maybe others disagree. |
|
Hi @JoshRosen and @vanzin, do you have any further comments about this PR? The library Chimera is maintained by Intel. One way to address your comments is to move the library into Spark project. I'd like to move this jira forwards. So I am seeking for your suggestions and thoughts. Thank you! |
|
@vanzin, given that you've been spending a lot of time on encryption-related things in Spark, I'll leave it to you to make a judgement call here. |
|
@winningsix at the very least this PR needs to be updated to resolve conflicts. I need to take a look at the library being pulled; my main concern is that being a new library, it may not be super stable, and having it outside the project would make it more difficult to quickly address bugs (because you have to go thorough another project's release cycle before you can fix Spark). If it's not a big library, it might be worth it to pull it into Spark for the first release of this code, and later pull it out so that other Hadoop projects can also use it. Unless other projects are already using it, in which case my concerns are probably outdated. |
|
Thanks @vanzin for your prompt reply. In the current patch it consists of two parts. One is the encryption framework using JCE encryption and another is performance acceleration using Openssl library. Maybe we can get the first part in since it doesn't require the Chimera library. Then we could move the second part forwards either using Chimera or pull them into Spark. Any thoughts? Thank you! |
|
@winningsix that would be awesome. |
|
@vanzin I create a new PR(#8880) addressing the common part and JCE key provider support. Could you help me review it? Ticket is also filed in SPARK-10771. Thank you! |
|
Can one of the admins verify this patch? |
Chimera is a project which strips code related to CryptoInputStream/CryptoOutputStream from Hadoop to facilitate AES-NI based data encryption in other projects. It provides JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec. JceAesCtrypoCodec uses encrypted algorithms jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl provides. We can directly use Chimera to implement spark encrypted shuffle.