Support for accessing secured HDFS in Standalone Mode #2320

huozhanfeng · 2014-09-08T15:15:39Z

JIRA Issue:https://issues.apache.org/jira/browse/SPARK-3438

Reading data from secure HDFS into spark is a usefull feature.

SparkQA · 2014-09-08T15:45:51Z

Can one of the admins verify this patch?

tgravescs · 2014-09-08T19:24:04Z

yes as @pwendell mentions in pr #265 if you are going to store anything in a file, you have to make sure its a secure location and if its transferred people might want the option for that connection to be secure.

Also we don't want this running when running in yarn mode so we need a way to only do it for other modes.

The jira doesn't have much details. Can you explain the end goal? Are you wanting to support multiple users all using different credentials? can you give more details on the entire setup.

If something like this does go in we definitely need to document how people configure it.

Who do you have the spark daemons running as and how are they having permissions to read the users keytab? That in itself could be a huge security hole. If one user can use the keytab of another.

huozhanfeng · 2014-09-10T11:29:24Z

@pwendell @tgravescs Thanks for your advice.

Yes I want to support multiple users all using different credentials and in our application scenario users can't access their keytab each other. In my opinion, the spark administrator is responsible for producing and placing kerberos keytab file to the spark client, for example set "/home/tim/.tim_spark.keytab" with permission "400" and config param "_HOST/._USER_spark.keytab" in spark-defaults.conf, so it can be transparent to the spark user. I don't think we need to care about the scene that one user can use the keytab of another for this is another thing.

Reading the keytab operation occurred in the process of the driver to create SparkContext instance. So it has nothing to do with whom start the spark daemons but with whom start the driver.

As for document and letting the feature not to run in yarn mode, I will do it if the main problem can be solve.

The most important is how to distribute the token file security and there are two feasible method.

1:Extending HTTP file server interface to enable user to share files with a secret(set secret in a conf file on all of the worker nodes or other way) to authenticate this HTTP request, for example "def addFileSecurity(path: String)".

Can we share a random number as a application level secret to all slaves by setting it in spark conf when starting a driver? So every slave shares this secret and HTTP file server's getFile request must be validated by the secret ("http://spark:${secret}@someip:20813/files/spark.token").

User need such common functions in the long run.

2:Only encrypt token file before distributing it, and executor decode token file before using it.

I don't know which is more suitable. Or there are other better design?

vanzin · 2014-09-10T17:49:44Z

So it has nothing to do with whom start the spark daemons but with whom start the driver.

Well, standalone supports cluster mode. In cluster mode, the Worker daemons start the driver. Since the Worker daemons currently have no way to execute child processes as a different user, if you send the keytab to the driver, then any other process spawned by that worker can read it.

Using secrets in Spark today is kind of an iffy solution because, as far as I can see, Spark does not support TLS for communication. So your secret would be sent in the clear over the wire. Unless you're willing to implement a Diffie-Hellman key exhange for the parties involved in the exchange, so that you can negotiate a private key over an insecure channel.

I think that unless you're willing to fix one of the two problems (add TLS to Spark's communication layer, allow Workers to run children as a different users), there isn't a secure solution to this problem. (BTW, Yarn has those two features.)

SparkQA · 2014-09-11T00:14:33Z

QA tests have started for PR 2320 at commit 915bc52.

This patch merges cleanly.

SparkQA · 2014-09-11T01:23:37Z

QA tests have finished for PR 2320 at commit 915bc52.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

huozhanfeng · 2014-09-11T03:14:49Z

I just know it that standalone supports cluster mode. I guess keytab should not be transfered on the net no matter in what mode. Token should be generated on spark client in advance in cluster mode, then token and driver jars be transfered to the Worker which will start the driver.

How spark transfer driver jars to Workers security in standalone mode? Reusing the existing security mechanisms to share tokens is a feasible way.

Adding TLS to Spark's communication layer is a larger workload. Now I want to find a simple way to support for accessing secured HDFS in standalone mode.

It will be difficult to support cluster mode in spark standalone mode.

vanzin · 2014-09-11T20:58:13Z

@huozhanfeng I don't think there's any way to transfer files securely to workers right now. Perhaps a mode where the launcher / driver uses HDFS to distribute files instead of how it's done today, but that's a different change.

I'm really uncomfortable with this change because of the limitations of standalone mode today. If standalone supported running executors as different users this would be a lot more palatable.

But if this is blocking you in any way, it seems like it would be possible to add this feature to your application without having to modify Spark, if the security concerns are not an issue. That way you can have something working for you and we have more time to come up with a proper solution for Spark.

huozhanfeng · 2014-09-12T03:39:31Z

@vanzin I agree with you and you can think about it in the after work. Thanks.

pwendell · 2014-09-14T05:09:46Z

@vanzin there is currently a path where the addFile HTTP server is authenticated via a shared secret and this under the hood uses Diffie-Helman. This is used in YARN mode.

@tgravescs if a user is manually setting the shared secret on the driver and worker nodes, am I correct in understanding that addFile will be properly authenticated, at least for the transfer? This is my understanding based on the original design, but please correct me if that's wrong.

I think the user requirement is the following:

A company is running a standalone cluster.
They are fine if all Spark jobs in the cluster share a global secret, i.e. all Spark jobs can trust one another.
They are able to provide a Hadoop login on the driver node via a keytab or kinit. They want tokens from this login to be distributed to the executors.
They also don't want to trust the network on the cluster. I.e. don't want to allow someone to fetch HDFS tokens easily over a known protocol, without authentication.

AFAIK there are a fairly wide number of use cases like this.

I think this is achievable if the user just manually sets spark.authenticate.secret at the driver and workers. And then we use sc.addFile to disseminate tokens. @tgravescs - does that seem correct? We also need to test whether this works well... for instance I think at this point we'll actually ship the value of spark.authenticate.secret across the wire if it's set. But architecturally, I do think this would work.

I'll update the requirements on the JIRA to match those proposed in this comment.

tgravescs · 2014-09-15T13:21:19Z

yes that is how it works in standalone mode and it will. The master, workers, and all the applications/clients/drivers need to have the same shared secret. It will do authentication before being able to fetch the file that was added.

I think this is fine to support as long as we make it very clear exactly what is supported and what is not supported.

pwendell · 2014-09-15T16:03:57Z

Okay - then let's aim to provide this level of security with standalone mode, with some clear documentation about what it provides.

vanzin · 2014-09-15T17:03:02Z

... a path where the addFile HTTP server is authenticated ...

But that's just authentication, right? Or is it actually encrypting the bytes being transferred after the authentication happens?

(So, my main reluctance here is that to get the keytab or the delegation tokens to the right places, that either needs to be done over an unencrypted channel, or using the filesystem, which is sub-optimal in standalone mode since everything runs as the same user.)

pwendell · 2014-09-15T19:04:17Z

I was responding to your comment about needing Diffie Hellman. We already use an authentication approach based on HD.

Encryption is an orthogonal issue. In general though Spark only provides authentication and not encryption. We assume that someone can't sniff the network randomly. That's the model we have right now in the rest of Spark for YARN etc.

We do protect against someone connecting to well known ports and trying to e.g. ask for credential data. That's what we are trying to protect here.

nchammas · 2015-02-12T03:17:06Z

This PR has gone stale. Do we want to update it or close it out?

Also, @huozhanfeng, could you update the title of the PR to "[SPARK-3438] Add support for accessing secured HDFS"? This matches our current convention.

pwendell · 2015-02-17T04:35:32Z

Let's close this issue. There is an alternative PR that is currently ongoing.

hsaputra · 2016-02-17T02:18:03Z

@pwendell : The alternative PR (aka #4106) also being closed. I think you and @tgravescs have narrowed down the use cases to allow standalone Spark to have access to secured HDFS.

This is useful for those launching mini Spark clusters that are deployed in Docker based containers, via Kubernetes for example, that have ephemeral lifetime and only shared with maybe one or two users rather than in a multi-tenant model.

vanzin · 2016-02-17T03:06:57Z

@hsaputra if you don't care about multiple users, just create a principal / keytab pair for each of your workers, and set up a cron job to "kinit" that principal periodically. Then all processes running as the kinit'ed user (you should run the Worker processes as that user) will be able to talk to a secured HDFS.

There's no need to modify Spark for that.

tgravescs · 2016-02-17T14:24:43Z

@vanzin I'm not sure that actually works anymore. It used to but I think it was broke. See https://issues.apache.org/jira/browse/SPARK-2541

hsaputra · 2016-02-17T22:45:32Z

@vanzin : as @tgravescs mentioned, it did not work anymore hence the need to fix it.

All the PRs and JIRAs for this issue had been closed to point to #4106 which also being closed due to comparisons with YARN mode.

Standalone mode serve different purpose than running in YARN. Most deployment would be for small clusters and semi-private mode where we assume safe way to add keytab for users of Spark.
So we need to consider the use cases rather than mixing it with YARN mode.

@tgravescs and @pwendell have made good use case and description on what and when the access to secured HDFS is needed in this PR, so I would suggest we revisit this again.

vanzin · 2016-02-17T22:56:37Z

@vanzin : as @tgravescs mentioned, it did not work anymore hence the need to fix it.

The fix for the bug Tom mentions is separate from distributing keytabs or delegation tokens in an insecure deployment, which is what I'm strongly against.

Standalone mode serve different purpose than running in YARN. Most deployment would be for small clusters and semi-private mode where we assume safe way to add keytab for users of Spark.

Sorry, I disagree. If you have kerberos, it means you care about security, and here you'd be adding a mechanism for Spark to completely break that.

So: I'm ok with a fix that allows the workaround I mentioned above which seems to be broken by SPARK-2541. I'm against any fix that means exposing user's keytabs or delegation tokens to other users, regardless of whether you think it's ok in your own environment.

hsaputra · 2016-02-17T23:07:06Z

Yes, I think @tgravescs proposal above including moving the key tabs and token to other users.
What I was agreeing was to add fix for that broke SPARK-2541, which means as you mentioned, there is a way or process to do "kinit" for each user that need access to standalone Spark cluster to access secure HDFS.

@tgravescs : should we reopen SPARK-2541 to fix this?

vanzin · 2016-02-17T23:13:18Z

which means as you mentioned, there is a way or process to do "kinit" for each user

That's not what I said at all. What I said is that because spark standalone is insecure, and it runs everything as the same OS user, it should be up to the admin deploying the cluster to kinit a single user that will become the "Spark User" for the cluster, and everybody submitting jobs to the Spark standalone master will effectively be running Spark apps as that user.

That way, Spark does not create a security hole in the cluster, because:

admins have to opt in to this insecure Spark setup
they can easily revoke privileges by not allowing that Spark user to do anything else in the cluster
users don't risk having their keytabs or delegation tokens being read by other users

hsaputra · 2016-02-17T23:22:42Z

Ah, I need to clarify, when I said "there is a way or process to do "kinit" for each user" I meant just that. The process of doing kinit or setting up the keytab is NOT done via Spark.
So, when admin deploying the standalone Spark cluster, it should do the kinit for a user that would be effectively be the active user to run the job and Spark app to access secure HDFS as the principal.

vanzin · 2016-02-17T23:28:14Z

Ok, so we seems like we agree on the overall approach. Regarding SPARK-2541, it used to be that just overriding the SPARK_USER env variable would allow you to "impersonate" anyone. Doesn't that work anymore?

That way, if you set SPARK_USER to the name of the principal running the Spark daemons, things should work without any modifications to the current code.

hsaputra · 2016-02-18T00:24:37Z

Seemed like now when Executor trying to access HDFS it gets error:

Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS];

vanzin · 2016-02-18T00:28:27Z

Well, is the user logged in (= kinit was run) on all the worker nodes, so that processes have proper kerberos tokens to use?

hsaputra · 2016-02-18T00:42:03Z

@vanzin : yes =(
At least now we have agreement on the approach how to do it in standalone, I will move the discussions to mailing list. Was not my intention to have hijack this PR comment sections, mea culpa.

Thanks to @vanzin for entertaining my concern and comments.

Added support for accessing secured HDFS

915bc52

huozhanfeng mentioned this pull request Sep 8, 2014

Added support for accessing secured HDFS #265

Closed

asfgit closed this in 24f358b Feb 17, 2015

huozhanfeng deleted the my_change branch July 27, 2015 06:34

huozhanfeng changed the title ~~Added support for accessing secured HDFS~~ Add support for accessing secured HDFS Jul 27, 2015

huozhanfeng changed the title ~~Add support for accessing secured HDFS~~ Support for accessing secured HDFS in Standalone Mode Sep 5, 2017

Support for accessing secured HDFS in Standalone Mode #2320

Support for accessing secured HDFS in Standalone Mode #2320

Uh oh!

Conversation

huozhanfeng commented Sep 8, 2014

Uh oh!

SparkQA commented Sep 8, 2014

Uh oh!

tgravescs commented Sep 8, 2014

Uh oh!

huozhanfeng commented Sep 10, 2014

Uh oh!

vanzin commented Sep 10, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

huozhanfeng commented Sep 11, 2014

Uh oh!

vanzin commented Sep 11, 2014

Uh oh!

huozhanfeng commented Sep 12, 2014

Uh oh!

pwendell commented Sep 14, 2014

Uh oh!

tgravescs commented Sep 15, 2014

Uh oh!

pwendell commented Sep 15, 2014

Uh oh!

vanzin commented Sep 15, 2014

Uh oh!

pwendell commented Sep 15, 2014

Uh oh!

nchammas commented Feb 12, 2015

Uh oh!

pwendell commented Feb 17, 2015

Uh oh!

hsaputra commented Feb 17, 2016

Uh oh!

vanzin commented Feb 17, 2016

Uh oh!

tgravescs commented Feb 17, 2016

Uh oh!

hsaputra commented Feb 17, 2016

Uh oh!

vanzin commented Feb 17, 2016

Uh oh!

hsaputra commented Feb 17, 2016

Uh oh!

vanzin commented Feb 17, 2016

Uh oh!

hsaputra commented Feb 17, 2016

Uh oh!

vanzin commented Feb 17, 2016

Uh oh!

hsaputra commented Feb 18, 2016

Uh oh!

vanzin commented Feb 18, 2016

Uh oh!

hsaputra commented Feb 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants