-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22151] : PYTHONPATH not picked up from the spark.yarn.appMaste… #21468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rEnv properly
Running in yarn cluster mode and trying to set pythonpath via spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
the yarn Client code looks at the env variables:
val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
But when you set spark.yarn.appMasterEnv it puts it into the local env.
So the python path set in spark.yarn.appMasterEnv isn't properly set.
You can work around if you are running in cluster mode by setting it on the client like:
PYTHONPATH=./addon/python/ spark-submit
In Client.scala, PYTHONPATH was being overridden, so changed code to append values to PYTHONPATH
|
Jenkins, ok to test |
|
Test build #91345 has finished for PR 21468 at commit
|
|
LGTM @pgandhi999 Hope @tgravescs can confirm it |
|
Thank you for your feedback, @redsanket |
| .mkString(ApplicationConstants.CLASS_PATH_SEPARATOR) | ||
| env("PYTHONPATH") = pythonPathStr | ||
| sparkConf.setExecutorEnv("PYTHONPATH", pythonPathStr) | ||
| val newValue = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just say env.get("PYTHONPATH") ++=: pythonPath before turning the list into a string.
But there's also two extra questions here:
- precedence; should the env come before or after the files added with
py-files? I kinda think after makes more sense, since files are generally provided in the command line. - should
appMasterEnvbe reflected in executors? With your code it is. I'm not so sure it should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good questions
-
precedence: So right now you can work around this issue by exporting PYTHONPATH before you launch spark-submit, I think this is something that could just be in someone's env on the launcher box and might not be what you want in a yarn container. I would think that specifying explicit pythonpath via spark.yarn.appMasterEnv would take precedence over that since you explicitly configured. Now the second question is where that fails with the py-files and that one isn't as clear to me since like you said its explicitly specified. Maybe we do py-files then spark.yarn.appMasterEnv.PYTHONPATH and then last env PYTHONPATH. that is different from the way it is now though. thoughts?
-
agree this should not be reflected in the executors so if it is we shouldn't do that. We should make sure the spark. executorEnv.PYTHONPATH works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comments @vanzin , have made the necessary changes. As far as precedence is concerned, I am still not sure whether I understood your question at first, however @tgravescs clarified it for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note as @vanzin said you can just use the ++=: operator with the listbuffer to prepend and get rid of the if conditions before converting to string.
env.get("PYTHONPATH") ++=: (sys.env.get("PYTHONPATH") ++=: pythonPath)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgravescs Have replaced the if-else code with ++ operator.
- spark.executorEnv should not be overridden by appMasterEnv - Replacing ++ with ++=:
|
Test build #91571 has finished for PR 21468 at commit
|
|
ok to test |
|
Test build #91603 has finished for PR 21468 at commit
|
|
Thank you @HyukjinKwon |
|
sorry for the delay on this, will get to next monday, |
vanzin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code seems correct but could be simplified a lot, which also would help with clarity.
| // Finally, update the Spark config to propagate PYTHONPATH to the AM and executors. | ||
| if (pythonPath.nonEmpty) { | ||
| val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath) | ||
| val pythonPathStr = (sys.env.get("PYTHONPATH") ++=: pythonPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is modifying pythonPath, but it's not used again after this line, so why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, have changed it to ++
| .mkString(ApplicationConstants.CLASS_PATH_SEPARATOR) | ||
| env("PYTHONPATH") = pythonPathStr | ||
| sparkConf.setExecutorEnv("PYTHONPATH", pythonPathStr) | ||
| val newValue = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary. What you're trying to do here is concatenate the values of the current PYTHONPATH env variable, the PYTHONPATH created by Spark, and the PYTHONPATH set in spark.yarn.appMasterEnv.
That's, respectively, sys.env.get("PYTHONPATH"), pythonPath, and env.get("PYTHONPATH"). So just concatenate those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have done it, but I need the pythonPathStr variable to set it in executorEnv so have not modified that bit. Let me know if you think otherwise.
| pythonPathStr | ||
| } | ||
| env("PYTHONPATH") = newValue | ||
| if (!sparkConf.getExecutorEnv.toMap.contains("PYTHONPATH")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the previous code was overriding this in the executor env; but perhaps the right thing here is to concatenate them, otherwise the executor might be missing the py4j/pyspark stuff this class adds.
So, basically, what you want is:
- driver: env.get(pp) ++ sys.env.get(pp) ++ pythonPath
- executor: sparkConf.getExecutorEnv(pp) ++ sys.env.get(pp) ++ pythonPath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is indeed an issue, fixed it. Thanks for pointing that out.
| } | ||
| env("PYTHONPATH") = newValue | ||
| if (!sparkConf.getExecutorEnv.toMap.contains("PYTHONPATH")) { | ||
| sparkConf.setExecutorEnv("PYTHONPATH", newValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I did some testing of this, this was causing the job to fail, Parth is investigating. I agree with Vanzin on the executor path comment as well.
…THONPATH Instead of overriding the executorenv PYTHONPATH, appending existing pythonpath string to it and avoiding passing value of appMasterEnv.PYTHONPATH to the executorEnv variable
|
Test build #92434 has finished for PR 21468 at commit
|
| } else { | ||
| val pythonPathExecutorEnv = sparkConf.getExecutorEnv.toMap.get("PYTHONPATH").get + | ||
| ApplicationConstants.CLASS_PATH_SEPARATOR + pythonPathStr | ||
| sparkConf.setExecutorEnv("PYTHONPATH", pythonPathExecutorEnv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move the setExecutorEnv outside of the if and have the if return the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
Test build #93128 has finished for PR 21468 at commit
|
|
Test build #93140 has finished for PR 21468 at commit
|
|
+1 |
|
merged thanks @pgandhi999 |
…rEnv properly
Running in yarn cluster mode and trying to set pythonpath via spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
the yarn Client code looks at the env variables:
val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
But when you set spark.yarn.appMasterEnv it puts it into the local env.
So the python path set in spark.yarn.appMasterEnv isn't properly set.
You can work around if you are running in cluster mode by setting it on the client like:
PYTHONPATH=./addon/python/ spark-submit
What changes were proposed in this pull request?
In Client.scala, PYTHONPATH was being overridden, so changed code to append values to PYTHONPATH instead of overriding them.
How was this patch tested?
Added log statements to ApplicationMaster.scala to check for environment variable PYTHONPATH, ran a spark job in cluster mode before the change and verified the issue. Performed the same test after the change and verified the fix.