-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18099][YARN] Fail if same files added to distributed cache for --files and --archives #15627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #67525 has finished for PR 15627 at commit
|
|
@kishorvpatil please look at the test failure |
9bb1623 to
a3eb6d4
Compare
|
Test build #67600 has finished for PR 15627 at commit
|
… under archives and files
a3eb6d4 to
2c55fc2
Compare
|
Jenkins, test this please |
|
Jenkins, add to whitelist |
| cachedSecondaryJarLinks += localizedPath | ||
| } | ||
| } else { | ||
| require(localizedPath !=null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add space after !=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets change the error to illegal argument exception.
Also lets comment this to indicate jars are ok due to spark 2.0 jar install, everything else shouldn't have multiple of same jar/file/archive.
|
Test build #67648 has finished for PR 15627 at commit
|
|
Test build #67649 has finished for PR 15627 at commit
|
| } else { | ||
| if (localizedPath != null) { | ||
| throw new IllegalArgumentException(s"Attempt to add ($file) multiple times. " + | ||
| "Please check the values of 'spark.yarn.dist.files' and/or " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove the part about check values of those specific configs because there are multiple ways for these to specified (configs or --files, --jars, etc). Perhaps just say please check the values you specified for uploading files,jars, and archives to make sure one isn't specified multiple times..
|
Test build #67719 has finished for PR 15627 at commit
|
|
Test build #67720 has finished for PR 15627 at commit
|
| val (_, localizedPath) = distribute(file, resType = resType) | ||
| if (addToClasspath && localizedPath != null) { | ||
| cachedSecondaryJarLinks += localizedPath | ||
| if (addToClasspath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment here explaining what exactly thi sis doing to help explain and keep from breaking in future.
Also can you add another unit test to cover this case.
|
Test build #67826 has finished for PR 15627 at commit
|
| val userLib1 = Utils.createTempDir() | ||
| val userLib2 = Utils.createTempDir() | ||
|
|
||
| val jar1 = TestUtils.createJarWithFiles(Map(), jarsDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not used anywhere
| test("distribute archive multiple times") { | ||
| val libs = Utils.createTempDir() | ||
| val jarsDir = new File(libs, "jars") | ||
| assert(jarsDir.mkdir()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see jarsDir being used anywhere either
| val output = new FileOutputStream(target) | ||
| Utils.copyStream(input, output, closeStreams = true) | ||
| target.toURI.toURL | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can cleanup the variables names above I think it would help a lot, the test is confusing. I know you just copy and pasted but would be nice to clean up.
Also can we have 3 tests or 3 asserts,
- one for same file in --files
- one for same file in --archives
- one for same file in --files and --archives
|
Test build #68083 has finished for PR 15627 at commit
|
|
+1 |
… --files and --archives ## What changes were proposed in this pull request? During spark-submit, if yarn dist cache is instructed to add same file under --files and --archives, This code change ensures the spark yarn distributed cache behaviour is retained i.e. to warn and fail if same files is mentioned in both --files and --archives. ## How was this patch tested? Manually tested: 1. if same jar is mentioned in --jars and --files it will continue to submit the job. - basically functionality [SPARK-14423] #12203 is unchanged 1. if same file is mentioned in --files and --archives it will fail to submit the job. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. … under archives and files Author: Kishor Patil <[email protected]> Closes #15627 from kishorvpatil/spark18099. (cherry picked from commit 098e4ca) Signed-off-by: Tom Graves <[email protected]>
|
@kishorvpatil @tgravescs It seems this pr is breaking functionalities of |
| } | ||
| } else { | ||
| require(localizedPath !=null) | ||
| if (localizedPath != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess here is localizedPath == null ?
|
thanks for pointing this out |
|
SPARK-18357 filed to fix |
## What changes were proposed in this pull request? The #15627 broke functionality with yarn --files --archives does not accept any files. This patch ensures that --files and --archives accept unique files. ## How was this patch tested? A. I added unit tests. B. Also, manually tested --files with --archives to throw exception if duplicate files are specified and continue if unique files are specified. Author: Kishor Patil <[email protected]> Closes #15810 from kishorvpatil/SPARK18357. (cherry picked from commit 245e5a2) Signed-off-by: Tom Graves <[email protected]>
## What changes were proposed in this pull request? The apache#15627 broke functionality with yarn --files --archives does not accept any files. This patch ensures that --files and --archives accept unique files. ## How was this patch tested? A. I added unit tests. B. Also, manually tested --files with --archives to throw exception if duplicate files are specified and continue if unique files are specified. Author: Kishor Patil <[email protected]> Closes apache#15810 from kishorvpatil/SPARK18357.
|
@kishorvpatil Thank you for fixing this! |
… --files and --archives ## What changes were proposed in this pull request? During spark-submit, if yarn dist cache is instructed to add same file under --files and --archives, This code change ensures the spark yarn distributed cache behaviour is retained i.e. to warn and fail if same files is mentioned in both --files and --archives. ## How was this patch tested? Manually tested: 1. if same jar is mentioned in --jars and --files it will continue to submit the job. - basically functionality [SPARK-14423] apache#12203 is unchanged 1. if same file is mentioned in --files and --archives it will fail to submit the job. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. … under archives and files Author: Kishor Patil <[email protected]> Closes apache#15627 from kishorvpatil/spark18099.
## What changes were proposed in this pull request? The apache#15627 broke functionality with yarn --files --archives does not accept any files. This patch ensures that --files and --archives accept unique files. ## How was this patch tested? A. I added unit tests. B. Also, manually tested --files with --archives to throw exception if duplicate files are specified and continue if unique files are specified. Author: Kishor Patil <[email protected]> Closes apache#15810 from kishorvpatil/SPARK18357.
What changes were proposed in this pull request?
During spark-submit, if yarn dist cache is instructed to add same file under --files and --archives, This code change ensures the spark yarn distributed cache behaviour is retained i.e. to warn and fail if same files is mentioned in both --files and --archives.
How was this patch tested?
Manually tested:
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
… under archives and files