Skip to content

Conversation

@huozhanfeng
Copy link

JIRA Issue:https://issues.apache.org/jira/browse/SPARK-3438

Reading data from secure HDFS into spark is a usefull feature.

@SparkQA
Copy link

SparkQA commented Sep 8, 2014

Can one of the admins verify this patch?

@tgravescs
Copy link
Contributor

yes as @pwendell mentions in pr #265 if you are going to store anything in a file, you have to make sure its a secure location and if its transferred people might want the option for that connection to be secure.

Also we don't want this running when running in yarn mode so we need a way to only do it for other modes.

The jira doesn't have much details. Can you explain the end goal? Are you wanting to support multiple users all using different credentials? can you give more details on the entire setup.

If something like this does go in we definitely need to document how people configure it.

Who do you have the spark daemons running as and how are they having permissions to read the users keytab? That in itself could be a huge security hole. If one user can use the keytab of another.

@huozhanfeng
Copy link
Author

@pwendell @tgravescs Thanks for your advice.

Yes I want to support multiple users all using different credentials and in our application scenario users can't access their keytab each other. In my opinion, the spark administrator is responsible for producing and placing kerberos keytab file to the spark client, for example set "/home/tim/.tim_spark.keytab" with permission "400" and config param "_HOST/._USER_spark.keytab" in spark-defaults.conf, so it can be transparent to the spark user. I don't think we need to care about the scene that one user can use the keytab of another for this is another thing.

Reading the keytab operation occurred in the process of the driver to create SparkContext instance. So it has nothing to do with whom start the spark daemons but with whom start the driver.

As for document and letting the feature not to run in yarn mode, I will do it if the main problem can be solve.

The most important is how to distribute the token file security and there are two feasible method.

1:Extending HTTP file server interface to enable user to share files with a secret(set secret in a conf file on all of the worker nodes or other way) to authenticate this HTTP request, for example "def addFileSecurity(path: String)".

Can we share a random number as a application level secret to all slaves by setting it in spark conf when starting a driver? So every slave shares this secret and HTTP file server's getFile request must be validated by the secret ("http://spark:${secret}@someip:20813/files/spark.token").

User need such common functions in the long run.

2:Only encrypt token file before distributing it, and executor decode token file before using it.

I don't know which is more suitable. Or there are other better design?

@vanzin
Copy link
Contributor

vanzin commented Sep 10, 2014

So it has nothing to do with whom start the spark daemons but with whom start the driver.

Well, standalone supports cluster mode. In cluster mode, the Worker daemons start the driver. Since the Worker daemons currently have no way to execute child processes as a different user, if you send the keytab to the driver, then any other process spawned by that worker can read it.

Using secrets in Spark today is kind of an iffy solution because, as far as I can see, Spark does not support TLS for communication. So your secret would be sent in the clear over the wire. Unless you're willing to implement a Diffie-Hellman key exhange for the parties involved in the exchange, so that you can negotiate a private key over an insecure channel.

I think that unless you're willing to fix one of the two problems (add TLS to Spark's communication layer, allow Workers to run children as a different users), there isn't a secure solution to this problem. (BTW, Yarn has those two features.)

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have started for PR 2320 at commit 915bc52.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2320 at commit 915bc52.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huozhanfeng
Copy link
Author

I just know it that standalone supports cluster mode. I guess keytab should not be transfered on the net no matter in what mode. Token should be generated on spark client in advance in cluster mode, then token and driver jars be transfered to the Worker which will start the driver.

How spark transfer driver jars to Workers security in standalone mode? Reusing the existing security mechanisms to share tokens is a feasible way.

Adding TLS to Spark's communication layer is a larger workload. Now I want to find a simple way to support for accessing secured HDFS in standalone mode.

It will be difficult to support cluster mode in spark standalone mode.

@vanzin
Copy link
Contributor

vanzin commented Sep 11, 2014

@huozhanfeng I don't think there's any way to transfer files securely to workers right now. Perhaps a mode where the launcher / driver uses HDFS to distribute files instead of how it's done today, but that's a different change.

I'm really uncomfortable with this change because of the limitations of standalone mode today. If standalone supported running executors as different users this would be a lot more palatable.

But if this is blocking you in any way, it seems like it would be possible to add this feature to your application without having to modify Spark, if the security concerns are not an issue. That way you can have something working for you and we have more time to come up with a proper solution for Spark.

@huozhanfeng
Copy link
Author

@vanzin I agree with you and you can think about it in the after work. Thanks.

@pwendell
Copy link
Contributor

@vanzin there is currently a path where the addFile HTTP server is authenticated via a shared secret and this under the hood uses Diffie-Helman. This is used in YARN mode.

@tgravescs if a user is manually setting the shared secret on the driver and worker nodes, am I correct in understanding that addFile will be properly authenticated, at least for the transfer? This is my understanding based on the original design, but please correct me if that's wrong.

I think the user requirement is the following:

  1. A company is running a standalone cluster.
  2. They are fine if all Spark jobs in the cluster share a global secret, i.e. all Spark jobs can trust one another.
  3. They are able to provide a Hadoop login on the driver node via a keytab or kinit. They want tokens from this login to be distributed to the executors.
  4. They also don't want to trust the network on the cluster. I.e. don't want to allow someone to fetch HDFS tokens easily over a known protocol, without authentication.

AFAIK there are a fairly wide number of use cases like this.

I think this is achievable if the user just manually sets spark.authenticate.secret at the driver and workers. And then we use sc.addFile to disseminate tokens. @tgravescs - does that seem correct? We also need to test whether this works well... for instance I think at this point we'll actually ship the value of spark.authenticate.secret across the wire if it's set. But architecturally, I do think this would work.

I'll update the requirements on the JIRA to match those proposed in this comment.

@tgravescs
Copy link
Contributor

yes that is how it works in standalone mode and it will. The master, workers, and all the applications/clients/drivers need to have the same shared secret. It will do authentication before being able to fetch the file that was added.

I think this is fine to support as long as we make it very clear exactly what is supported and what is not supported.

@pwendell
Copy link
Contributor

Okay - then let's aim to provide this level of security with standalone mode, with some clear documentation about what it provides.

@vanzin
Copy link
Contributor

vanzin commented Sep 15, 2014

... a path where the addFile HTTP server is authenticated ...

But that's just authentication, right? Or is it actually encrypting the bytes being transferred after the authentication happens?

(So, my main reluctance here is that to get the keytab or the delegation tokens to the right places, that either needs to be done over an unencrypted channel, or using the filesystem, which is sub-optimal in standalone mode since everything runs as the same user.)

@pwendell
Copy link
Contributor

I was responding to your comment about needing Diffie Hellman. We already use an authentication approach based on HD.

Encryption is an orthogonal issue. In general though Spark only provides authentication and not encryption. We assume that someone can't sniff the network randomly. That's the model we have right now in the rest of Spark for YARN etc.

We do protect against someone connecting to well known ports and trying to e.g. ask for credential data. That's what we are trying to protect here.

@nchammas
Copy link
Contributor

This PR has gone stale. Do we want to update it or close it out?

Also, @huozhanfeng, could you update the title of the PR to "[SPARK-3438] Add support for accessing secured HDFS"? This matches our current convention.

@pwendell
Copy link
Contributor

Let's close this issue. There is an alternative PR that is currently ongoing.

@asfgit asfgit closed this in 24f358b Feb 17, 2015
@huozhanfeng huozhanfeng deleted the my_change branch July 27, 2015 06:34
@huozhanfeng huozhanfeng changed the title Added support for accessing secured HDFS Add support for accessing secured HDFS Jul 27, 2015
@hsaputra
Copy link
Contributor

@pwendell : The alternative PR (aka #4106) also being closed. I think you and @tgravescs have narrowed down the use cases to allow standalone Spark to have access to secured HDFS.

This is useful for those launching mini Spark clusters that are deployed in Docker based containers, via Kubernetes for example, that have ephemeral lifetime and only shared with maybe one or two users rather than in a multi-tenant model.

@vanzin
Copy link
Contributor

vanzin commented Feb 17, 2016

@hsaputra if you don't care about multiple users, just create a principal / keytab pair for each of your workers, and set up a cron job to "kinit" that principal periodically. Then all processes running as the kinit'ed user (you should run the Worker processes as that user) will be able to talk to a secured HDFS.

There's no need to modify Spark for that.

@tgravescs
Copy link
Contributor

@vanzin I'm not sure that actually works anymore. It used to but I think it was broke. See https://issues.apache.org/jira/browse/SPARK-2541

@hsaputra
Copy link
Contributor

@vanzin : as @tgravescs mentioned, it did not work anymore hence the need to fix it.

All the PRs and JIRAs for this issue had been closed to point to #4106 which also being closed due to comparisons with YARN mode.

Standalone mode serve different purpose than running in YARN. Most deployment would be for small clusters and semi-private mode where we assume safe way to add keytab for users of Spark.
So we need to consider the use cases rather than mixing it with YARN mode.

@tgravescs and @pwendell have made good use case and description on what and when the access to secured HDFS is needed in this PR, so I would suggest we revisit this again.

@vanzin
Copy link
Contributor

vanzin commented Feb 17, 2016

@vanzin : as @tgravescs mentioned, it did not work anymore hence the need to fix it.

The fix for the bug Tom mentions is separate from distributing keytabs or delegation tokens in an insecure deployment, which is what I'm strongly against.

Standalone mode serve different purpose than running in YARN. Most deployment would be for small clusters and semi-private mode where we assume safe way to add keytab for users of Spark.

Sorry, I disagree. If you have kerberos, it means you care about security, and here you'd be adding a mechanism for Spark to completely break that.

So: I'm ok with a fix that allows the workaround I mentioned above which seems to be broken by SPARK-2541. I'm against any fix that means exposing user's keytabs or delegation tokens to other users, regardless of whether you think it's ok in your own environment.

@hsaputra
Copy link
Contributor

Yes, I think @tgravescs proposal above including moving the key tabs and token to other users.
What I was agreeing was to add fix for that broke SPARK-2541, which means as you mentioned, there is a way or process to do "kinit" for each user that need access to standalone Spark cluster to access secure HDFS.

@tgravescs : should we reopen SPARK-2541 to fix this?

@vanzin
Copy link
Contributor

vanzin commented Feb 17, 2016

which means as you mentioned, there is a way or process to do "kinit" for each user

That's not what I said at all. What I said is that because spark standalone is insecure, and it runs everything as the same OS user, it should be up to the admin deploying the cluster to kinit a single user that will become the "Spark User" for the cluster, and everybody submitting jobs to the Spark standalone master will effectively be running Spark apps as that user.

That way, Spark does not create a security hole in the cluster, because:

  • admins have to opt in to this insecure Spark setup
  • they can easily revoke privileges by not allowing that Spark user to do anything else in the cluster
  • users don't risk having their keytabs or delegation tokens being read by other users

@hsaputra
Copy link
Contributor

Ah, I need to clarify, when I said "there is a way or process to do "kinit" for each user" I meant just that. The process of doing kinit or setting up the keytab is NOT done via Spark.
So, when admin deploying the standalone Spark cluster, it should do the kinit for a user that would be effectively be the active user to run the job and Spark app to access secure HDFS as the principal.

@vanzin
Copy link
Contributor

vanzin commented Feb 17, 2016

Ok, so we seems like we agree on the overall approach. Regarding SPARK-2541, it used to be that just overriding the SPARK_USER env variable would allow you to "impersonate" anyone. Doesn't that work anymore?

That way, if you set SPARK_USER to the name of the principal running the Spark daemons, things should work without any modifications to the current code.

@hsaputra
Copy link
Contributor

Seemed like now when Executor trying to access HDFS it gets error:

Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS];

@vanzin
Copy link
Contributor

vanzin commented Feb 18, 2016

Well, is the user logged in (= kinit was run) on all the worker nodes, so that processes have proper kerberos tokens to use?

@hsaputra
Copy link
Contributor

@vanzin : yes =(
At least now we have agreement on the approach how to do it in standalone, I will move the discussions to mailing list. Was not my intention to have hijack this PR comment sections, mea culpa.

Thanks to @vanzin for entertaining my concern and comments.

@huozhanfeng huozhanfeng changed the title Add support for accessing secured HDFS Support for accessing secured HDFS in Standalone Mode Sep 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants