-
Notifications
You must be signed in to change notification settings - Fork 28.9k
Support for accessing secured HDFS in Standalone Mode #2320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
yes as @pwendell mentions in pr #265 if you are going to store anything in a file, you have to make sure its a secure location and if its transferred people might want the option for that connection to be secure. Also we don't want this running when running in yarn mode so we need a way to only do it for other modes. The jira doesn't have much details. Can you explain the end goal? Are you wanting to support multiple users all using different credentials? can you give more details on the entire setup. If something like this does go in we definitely need to document how people configure it. Who do you have the spark daemons running as and how are they having permissions to read the users keytab? That in itself could be a huge security hole. If one user can use the keytab of another. |
|
@pwendell @tgravescs Thanks for your advice. Yes I want to support multiple users all using different credentials and in our application scenario users can't access their keytab each other. In my opinion, the spark administrator is responsible for producing and placing kerberos keytab file to the spark client, for example set "/home/tim/.tim_spark.keytab" with permission "400" and config param "_HOST/._USER_spark.keytab" in spark-defaults.conf, so it can be transparent to the spark user. I don't think we need to care about the scene that one user can use the keytab of another for this is another thing. Reading the keytab operation occurred in the process of the driver to create SparkContext instance. So it has nothing to do with whom start the spark daemons but with whom start the driver. As for document and letting the feature not to run in yarn mode, I will do it if the main problem can be solve. The most important is how to distribute the token file security and there are two feasible method. 1:Extending HTTP file server interface to enable user to share files with a secret(set secret in a conf file on all of the worker nodes or other way) to authenticate this HTTP request, for example "def addFileSecurity(path: String)". Can we share a random number as a application level secret to all slaves by setting it in spark conf when starting a driver? So every slave shares this secret and HTTP file server's getFile request must be validated by the secret ("http://spark:${secret}@someip:20813/files/spark.token"). User need such common functions in the long run. 2:Only encrypt token file before distributing it, and executor decode token file before using it. I don't know which is more suitable. Or there are other better design? |
Well, standalone supports cluster mode. In cluster mode, the Worker daemons start the driver. Since the Worker daemons currently have no way to execute child processes as a different user, if you send the keytab to the driver, then any other process spawned by that worker can read it. Using secrets in Spark today is kind of an iffy solution because, as far as I can see, Spark does not support TLS for communication. So your secret would be sent in the clear over the wire. Unless you're willing to implement a Diffie-Hellman key exhange for the parties involved in the exchange, so that you can negotiate a private key over an insecure channel. I think that unless you're willing to fix one of the two problems (add TLS to Spark's communication layer, allow Workers to run children as a different users), there isn't a secure solution to this problem. (BTW, Yarn has those two features.) |
|
QA tests have started for PR 2320 at commit
|
|
QA tests have finished for PR 2320 at commit
|
|
I just know it that standalone supports cluster mode. I guess keytab should not be transfered on the net no matter in what mode. Token should be generated on spark client in advance in cluster mode, then token and driver jars be transfered to the Worker which will start the driver. How spark transfer driver jars to Workers security in standalone mode? Reusing the existing security mechanisms to share tokens is a feasible way. Adding TLS to Spark's communication layer is a larger workload. Now I want to find a simple way to support for accessing secured HDFS in standalone mode. It will be difficult to support cluster mode in spark standalone mode. |
|
@huozhanfeng I don't think there's any way to transfer files securely to workers right now. Perhaps a mode where the launcher / driver uses HDFS to distribute files instead of how it's done today, but that's a different change. I'm really uncomfortable with this change because of the limitations of standalone mode today. If standalone supported running executors as different users this would be a lot more palatable. But if this is blocking you in any way, it seems like it would be possible to add this feature to your application without having to modify Spark, if the security concerns are not an issue. That way you can have something working for you and we have more time to come up with a proper solution for Spark. |
|
@vanzin I agree with you and you can think about it in the after work. Thanks. |
|
@vanzin there is currently a path where the @tgravescs if a user is manually setting the shared secret on the driver and worker nodes, am I correct in understanding that addFile will be properly authenticated, at least for the transfer? This is my understanding based on the original design, but please correct me if that's wrong. I think the user requirement is the following:
AFAIK there are a fairly wide number of use cases like this. I think this is achievable if the user just manually sets I'll update the requirements on the JIRA to match those proposed in this comment. |
|
yes that is how it works in standalone mode and it will. The master, workers, and all the applications/clients/drivers need to have the same shared secret. It will do authentication before being able to fetch the file that was added. I think this is fine to support as long as we make it very clear exactly what is supported and what is not supported. |
|
Okay - then let's aim to provide this level of security with standalone mode, with some clear documentation about what it provides. |
But that's just authentication, right? Or is it actually encrypting the bytes being transferred after the authentication happens? (So, my main reluctance here is that to get the keytab or the delegation tokens to the right places, that either needs to be done over an unencrypted channel, or using the filesystem, which is sub-optimal in standalone mode since everything runs as the same user.) |
|
I was responding to your comment about needing Diffie Hellman. We already use an authentication approach based on HD. Encryption is an orthogonal issue. In general though Spark only provides authentication and not encryption. We assume that someone can't sniff the network randomly. That's the model we have right now in the rest of Spark for YARN etc. We do protect against someone connecting to well known ports and trying to e.g. ask for credential data. That's what we are trying to protect here. |
|
This PR has gone stale. Do we want to update it or close it out? Also, @huozhanfeng, could you update the title of the PR to "[SPARK-3438] Add support for accessing secured HDFS"? This matches our current convention. |
|
Let's close this issue. There is an alternative PR that is currently ongoing. |
|
@pwendell : The alternative PR (aka #4106) also being closed. I think you and @tgravescs have narrowed down the use cases to allow standalone Spark to have access to secured HDFS. This is useful for those launching mini Spark clusters that are deployed in Docker based containers, via Kubernetes for example, that have ephemeral lifetime and only shared with maybe one or two users rather than in a multi-tenant model. |
|
@hsaputra if you don't care about multiple users, just create a principal / keytab pair for each of your workers, and set up a cron job to "kinit" that principal periodically. Then all processes running as the kinit'ed user (you should run the Worker processes as that user) will be able to talk to a secured HDFS. There's no need to modify Spark for that. |
|
@vanzin I'm not sure that actually works anymore. It used to but I think it was broke. See https://issues.apache.org/jira/browse/SPARK-2541 |
|
@vanzin : as @tgravescs mentioned, it did not work anymore hence the need to fix it. All the PRs and JIRAs for this issue had been closed to point to #4106 which also being closed due to comparisons with YARN mode. Standalone mode serve different purpose than running in YARN. Most deployment would be for small clusters and semi-private mode where we assume safe way to add keytab for users of Spark. @tgravescs and @pwendell have made good use case and description on what and when the access to secured HDFS is needed in this PR, so I would suggest we revisit this again. |
The fix for the bug Tom mentions is separate from distributing keytabs or delegation tokens in an insecure deployment, which is what I'm strongly against.
Sorry, I disagree. If you have kerberos, it means you care about security, and here you'd be adding a mechanism for Spark to completely break that. So: I'm ok with a fix that allows the workaround I mentioned above which seems to be broken by SPARK-2541. I'm against any fix that means exposing user's keytabs or delegation tokens to other users, regardless of whether you think it's ok in your own environment. |
|
Yes, I think @tgravescs proposal above including moving the key tabs and token to other users. @tgravescs : should we reopen SPARK-2541 to fix this? |
That's not what I said at all. What I said is that because spark standalone is insecure, and it runs everything as the same OS user, it should be up to the admin deploying the cluster to kinit a single user that will become the "Spark User" for the cluster, and everybody submitting jobs to the Spark standalone master will effectively be running Spark apps as that user. That way, Spark does not create a security hole in the cluster, because:
|
|
Ah, I need to clarify, when I said "there is a way or process to do "kinit" for each user" I meant just that. The process of doing kinit or setting up the keytab is NOT done via Spark. |
|
Ok, so we seems like we agree on the overall approach. Regarding SPARK-2541, it used to be that just overriding the That way, if you set |
|
Seemed like now when Executor trying to access HDFS it gets error: |
|
Well, is the user logged in (= kinit was run) on all the worker nodes, so that processes have proper kerberos tokens to use? |
JIRA Issue:https://issues.apache.org/jira/browse/SPARK-3438
Reading data from secure HDFS into spark is a usefull feature.