SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting files by name #546

nsuthar · 2014-04-25T02:43:03Z

Getting absloute path instead of relative path. : Its a same bug as Jira 1527

Details: rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1

rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1

Its same issue as Jira 1527 - fixed it to use absolute path instead of relative

AmplabJenkins · 2014-04-25T02:47:56Z

Can one of the admins verify this patch?

pwendell · 2014-04-25T03:07:31Z

Jenkins, test this please.

AmplabJenkins · 2014-04-25T03:07:55Z

Merged build triggered.

AmplabJenkins · 2014-04-25T03:08:04Z

Merged build started.

AmplabJenkins · 2014-04-25T03:10:28Z

Merged build finished.

AmplabJenkins · 2014-04-25T03:10:28Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14472/

pwendell · 2014-04-25T03:13:32Z

Mind changing the name? I created a new JIRA for this:
https://issues.apache.org/jira/browse/SPARK-1623

Also - the fix as-is doesn't seem to compile.

techaddict · 2014-04-25T04:06:04Z

core/src/main/scala/org/apache/spark/broadcast/HttpBroadcast.scala

Shouldn't this be new File(file.toString).getCanonicalPath ?

(Removed my earlier incorrect comment) @techaddict is correct since the value file is a String. Niraj your two new versions are either create a String argument where a File is needed or call File methods on a String. I think the correct invocation is simply deleteBroadcastFile(new File(file)). Everything else would be superfluous. (And this too would simplify if the set of values contained Files not Strings of paths.)

Sure Sean, I think I agree with you cause hashSet entries are string:

private val files = new TimeStampedHashSet[String]

and I am making it to deleteBroadcastFile(new File(file)) as per your
suggestion.

Niraj

On Thu, Apr 24, 2014 at 10:52 PM, Sean Owen [email protected]:

In core/src/main/scala/org/apache/spark/broadcast/HttpBroadcast.scala:

@@ -229,7 +229,7 @@ private[spark] object HttpBroadcast extends Logging {
val (file, time) = (entry.getKey, entry.getValue)
if (time < cleanupTime) {
iterator.remove()

deleteBroadcastFile(new File(file.toString))

deleteBroadcastFile(new File(file.getCanonicalPath))

(Removed my earlier incorrect comment) @techaddicthttps://github.com/techaddictis correct since the value
file is a String. Niraj your two new versions are either create a Stringargument where a
File is needed or call File methods on a String. I think the correct
invocation is simply deleteBroadcastFile(new File(file)). Everything else
would be superfluous. (And this too would simplify if the set of values
contained Files not Strings of paths.)

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/546/files#r11983954
.

nsuthar · 2014-04-25T05:39:05Z

Hi Patrick,

By changing the name..I suppose you meant.."Assignee Name". I have assigned
it to myself.

Please correct me if i am wrong.
Thank you,
Niraj

On Thu, Apr 24, 2014 at 8:13 PM, Patrick Wendell
[email protected]:

Mind changing the name? I created a new JIRA for this:
https://issues.apache.org/jira/browse/SPARK-1623

Also - the fix as-is doesn't seem to compile.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/546#issuecomment-41355382
.

pwendell · 2014-04-25T05:43:25Z

@nsuthar ah I wasn't clear - I meant change the title of this pull request to have SPARK-1623 at the front of the title.

srowen · 2014-04-25T05:44:12Z

@nsuthar pretty sure he meant the name of the PR, to something like
SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting files by name
(Note spelling of Canonical ! :) )

nsuthar · 2014-04-25T05:45:55Z

OK, Sorry I misunderstood it. did it.

On Thu, Apr 24, 2014 at 10:44 PM, Sean Owen [email protected]:

@nsuthar https://github.com/nsuthar pretty sure he meant the name of
the PR, to something like
SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting
files by name
(Note spelling of Canonical ! :) )

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/546#issuecomment-41360674
.

SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting files by name

nsuthar · 2014-04-25T06:31:45Z

fixed it to get abs path....jira 1623

nsuthar · 2014-04-25T17:13:23Z

could someone pls verify this patch. Thanks.

Niraj

tdas · 2014-04-25T22:54:33Z

Jenkins, test this please.

mridulm · 2014-04-27T14:02:08Z

core/src/main/scala/org/apache/spark/broadcast/HttpBroadcast.scala

not required to wrap it in another File.

mridulm · 2014-04-27T14:06:40Z

Are you actually seeing problems or is this a cleanup exercise to use appropriate api ?
Creation of the file happens from within spark and is not externally provided - and that should match how it gets used.

If we are not seeing any actual issues, I would rather not go down the path of fixing this.
getCanonicalPath, etc are expensive operations reqiuring filesystem IO to resolve.

srowen · 2014-04-27T14:13:16Z

getAbsolutePath() would also resolve both issues as far as I can tell,
and does not involve I/O. IIRC there was a real issue observed in
DiskBlockManagerSuite.

For HttpBroadcast, see how the file names are retrieved with
getAbsolutePath(), so this is a real issue albeit probably theoretical
right now. (That one's also resolved by just using a Set of Files.)

(Is the I/O going to matter though? These are executed once.)

On Sun, Apr 27, 2014 at 3:06 PM, Mridul Muralidharan
[email protected] wrote:

Are you actually seeing problems or is this a cleanup exercise to use
appropriate api ?
Creation of the file happens from within spark and is not externally
provided - and that should match how it gets used.

If we are not seeing any actual issues, I would rather not go down the path
of fixing this.
getCanonicalPath, etc are expensive operations reqiuring filesystem IO to
resolve.

—
Reply to this email directly or view it on GitHub.

mridulm · 2014-04-27T17:52:05Z

It goes back to the problem we are trying to solve.
If the set/map can contain arbitrary paths then file.getCanonical is unavoidable.
But then (multiple) IO and native call overhead will be there

If it is in context of a specific usecase - then it becomes function of that : do we need canonical names or are the path always constructed using common patterns which are consistent (always relative to cwd or specified w.r.t some root (which can be symlink).
IMO spark does the latter - unless there are recent changes I am missing.

In any case, getAbsolutePath/File does not buy us anything much.

srowen · 2014-04-27T18:01:34Z

Agree and that leads to the changes. For example HttpBroadcast.scala adds paths to the set using the string from getAbsolutePath (line 173) but removes them with the value of toString, which is getPath, which is not necessarily the same. (It may happen to be here.) Seems better to be consistent. In DiskBlockManagerSuite.scala I also don't see we have a guarantee that the path is absolute, and it needs to be IIUC, and the OP suggests this doesn't actually work in some cases.

pwendell · 2014-04-27T22:57:23Z

Just fly-by-commenting here, but the performance cost of doing a few directory traversals is not relevant in this code path. It gets called only once in the lifetime of a broadcast variable. It think the bigger issues are around correctness and consistency...

mridulm · 2014-05-03T17:00:45Z

It is not about a few uses here or there - either spark codebase as a whole moves to a) canonical path always; or always sticks to b) paths relative to cwd and/or what is returned by File.createTempFile - there is no middle ground IMO.
(b) is where we are at currently with some bugs as @srowen mentioned. Either we fix those to conform to (b) or move to (a) entirely.

Btw, regarding cost - we do quite a lot of path manipulations elsewhere (block management, shuffle, etc) - which adds to the cost.

Trying to carefully cordon off different sections of the code to different idioms is just asking for bugs as the codebase evolves. As we are apparently already hitting !

pwendell · 2014-05-07T17:07:14Z

@mridulm @srowen sorry not totally following this one. Is there an immediate isolated bug here, or is this a broader issue that we need to add consistency across the code base?

I guess my question is - in what case does the current implementation cause a problem for users?

vanzin · 2014-05-07T17:26:12Z

Just my 2 bits, but using getCanonicalPath() is usually a code smell; unless it's explicitly necessary (i.e. you know you have a symlink and you want to remove the actual file it references), it's generally better to use getAbsolutePath().

(Not that I think this applies here, but just commenting on the choice of API.)

srowen · 2014-05-07T21:15:41Z

@pwendell @nsuthar

First, note SPARK-1527 (https://issues.apache.org/jira/browse/SPARK-1527) and its PR (#436).

@advancedxy says that was a real issue, and I think it is fixed by the PR, which uses getAbsolutePath(). FWIW I agree with that change.

I think that is slightly preferable to the change in this PR. getCanonicalPath() is probably just fine too, and also solves the issue, but may be unnecessary.

There is a separate issue addressed in this PR / SPARK-1623. That is that the set files, which has Strings, is populated with the result of File.getAbsolutePath() but then later unpopulated with the result of File.toString(), which is File.getPath(), which is not necessarily the same for the same file.

Whether it happens to all work out because the arguments happen to always be absolute, I think it's probably nicer to resolve this by a) calling files.remove(file.getAbsolutePath) or b) just making files a set of File to begin with.

So I suggest:

commit the PR for SPARK-1527 and close it
abandon this PR and make a new one that implements b) or a) above to resolve SPARK-1623

advancedxy · 2014-05-08T08:45:03Z

Sorry for being late for this conversation. I am busy with my own project.

First, I agree with @srowen. SPARK-1527 was discovered when there were untracked files in my spark git dir after running tests. In DiskBlockManagerSuite, files were created using file.getName as parent folds, which was a relative path to cwd. The cleanup code deletes the original temp dir.

val rootDir0 = Files.createTempDir()
rootDir0.deleteOnExit()
val rootDir1 = Files.createTempDir()
rootDir1.deleteOnExit()
val rootDirs = rootDir0.getName + "," + rootDir1.getName

This is a problem or a bug I prefer because it doesn't do it tends to do.

In SPARK-1527, srowen suggested that toString could return relative path and should use getAbsolutePath instead. After thinking twice, I think it would be ok to use toString if we make sure we don't change working directory between file creation and deletion. That is to say even if tempDir.toString is a relative path, we can still delete this file using file.toString as file path.

Back in this pr, as @srowen said,

the set files, which has Strings, is populated with the result of File.getAbsolutePath() but then later unpopulated with the result of File.toString(), which is File.getPath(), which is not necessarily the same for the same file.

There is a problem. File.getAbsolutePaht is not necessarily the same as File.getPath, or even the same as File.getCanonicalPath
see the code below:

scala>val fstr = "./1.txt"
fstr: String = ./1.txt

scala>val f = new File(fstr)
f: java.io.File = ./1.txt

scala> f.getPath
res0: String = ./1.txt

scala> f.getAbsolutePath
res1: String = /Users/yexianjin/./1.txt

scala> f.getCanonicalPath
res2: String = /Users/yexianjin/1.txt

@mridulm I think you are right, we are at case b)paths relative to cwd and/or what is returned by Files.createTempFile, and I think we should stick to this case. But, we need to review the codebase to make sure path manipulation are correct.

So I suggest(like @srowen said):

choose getPath or getAbsolutePath. getAbsolutePath can deal with changing working directory situation, but getPath or toString is more consistent with spark codebase.
PR for SPARK-1527 could be committed if we use getAbsolutePath, otherwise, we can abandon PR for SPARK-1527 and this PR.
make a new one that implements b) or something similar based on our choice.

what do you think?

mateiz · 2014-07-29T21:03:22Z

Since we merged in #749 I believe it's okay to close this, right @pwendell ?

…he#546)

Add a walk-around for missing ``started`` time in case k8s cluster failed to setup. Also copy full cluster setup logs to logdir for better debuging. Related-Bug: theopenlab/openlab#257

…ache#546)

nirajguavus and others added 4 commits April 23, 2014 16:23

Fixed bug for Jira Issue# 1527

6c25d6b

Details: rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1

Jira 1527

53ed683

rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1

Merge branch 'bugFixes'

25b34f4

fixed it to use absolute path instead of relative

59e1ad1

Its same issue as Jira 1527 - fixed it to use absolute path instead of relative

techaddict reviewed Apr 25, 2014
View reviewed changes

nsuthar changed the title ~~Getting absloute path instead of relative path.~~ SPARK-1623: Getting absloute path instead of relative path. Apr 25, 2014

nsuthar changed the title ~~SPARK-1623: Getting absloute path instead of relative path.~~ SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting files by name Apr 25, 2014

should take abs path.

d35b071

SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting files by name

mridulm reviewed Apr 27, 2014
View reviewed changes

core/src/main/scala/org/apache/spark/broadcast/HttpBroadcast.scala

Copy link

Contributor

mridulm Apr 27, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not required to wrap it in another File.

asfgit closed this in ee91eb8 Aug 27, 2014

whatlulumomo pushed a commit to whatlulumomo/spark_src that referenced this pull request Jul 9, 2019

[SPARK-25299] Fix semantic merge conflict that broke the build. (apac…

5b4bb3f

…he#546)

turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025

[HADP-55008] Repartition before writing also support sort columns (ap…

d177485

…ache#546)

SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting files by name #546

SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting files by name #546

Uh oh!

Conversation

nsuthar commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

pwendell commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

pwendell commented Apr 25, 2014

Uh oh!

techaddict Apr 25, 2014

Choose a reason for hiding this comment

Uh oh!

srowen Apr 25, 2014

Choose a reason for hiding this comment

Uh oh!

nsuthar Apr 25, 2014

Choose a reason for hiding this comment

Uh oh!

nsuthar commented Apr 25, 2014

Uh oh!

pwendell commented Apr 25, 2014

Uh oh!

srowen commented Apr 25, 2014

Uh oh!

nsuthar commented Apr 25, 2014

Uh oh!

nsuthar commented Apr 25, 2014

Uh oh!

nsuthar commented Apr 25, 2014

Uh oh!

tdas commented Apr 25, 2014

Uh oh!

mridulm Apr 27, 2014

Choose a reason for hiding this comment

Uh oh!

mridulm commented Apr 27, 2014

Uh oh!

srowen commented Apr 27, 2014

Uh oh!

mridulm commented Apr 27, 2014

Uh oh!

srowen commented Apr 27, 2014

Uh oh!

pwendell commented Apr 27, 2014

Uh oh!

mridulm commented May 3, 2014

Uh oh!

pwendell commented May 7, 2014

Uh oh!

vanzin commented May 7, 2014

Uh oh!

srowen commented May 7, 2014

Uh oh!

advancedxy commented May 8, 2014

Uh oh!

mateiz commented Jul 29, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants