-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17810] [SQL] Default spark.sql.warehouse.dir is relative to local FS but can resolve as HDFS path #15382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CC @vanzin -- I think your solution at #13868 (comment) is better, will probably switch to it. CC @avulanov and @koertkuipers |
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I am just fixing some duplication of docs that relate to this change. There were 3 identical stanzas
|
Test build #66459 has finished for PR 15382 at commit
|
|
i think working dir makes more sense than home dir. but could this catch people by surprise because we now expect write permission in the working dir? |
|
Good point. The docs all say it should be working dir, so I was following the apparent intent along the way here. I don't know if one is more surprising than the other. |
|
Update on my thinking: Docs say the default should be "spark-warehouse" in the local working directory. The original code did establish a default on the local file system... though in the home dir, not working dir. So I still favor fixing that, but that's a small, orthogonal question. Although the default appears to be intended to be on the local FS, I don't think it must be. @avulanov mentions it might be on S3; the Hive warehouse dir is on HDFS too by default, in comparison. I don't have evidence that this has been prohibited so I guess we shouldn't assume this is a local path. It seems simple to default to "spark-warehouse" in the local working dir of whatever file system is being used, be it local or HDFS. That means SQLConf shouldn't specify a default with any URI scheme or path. SQLConf should then resolve it, though, at the moment there's no good way to plumb through a Hadoop Configuration. (I just made a new default one in here for the moment.) Then we're back to an interesting detail, that Hadoop Path's toURI won't add a trailing slash in the case of a local file. Maybe it doesn't in general since it can't always tell whether a path is a directory or not. Local URIs however will have a trailing slash. It's probably best to leave the behavior and adjust the tests to not expect a trailing slash, as the original code seemed to (?) Does this still work on Windows? As far as I can tell it would. Is it OK to instantiate I think this change actually restores the ability to specify a warehouse path with a URI scheme. This second, it will be interpreted as a file path only if you write "file:/foo". However there's no way to write a relative file URI, so no way to guarantee that a default of "spark-warehouse" is a relative path on the local FS. After this change, HDFS users would see that this correctly defaults to "spark-warehouse" in an HDFS working directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I've just updated the example's location to match the new default.
|
Test build #66500 has finished for PR 15382 at commit
|
|
Test build #66501 has finished for PR 15382 at commit
|
|
i don't think there is such a thing as a HDFS working directory, but that probably means it just uses the home dir on hdfs (/user/${username}) for any relative paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to use Utils.resolveURI here? That resolves to file: by default like most of the rest of Spark, and the previous versions of this code. And handles absolute vs. relative paths too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utils.resolveURI creates an java.net.URI from a String path, and returns the URI if it has schema. Apparently, it treats Windows paths as having a schema and breaks down with back slashes. Utils.resolveURI returns a different results comparing to hadoop Path.toUri:
scala> (new URI("/my")).getScheme()
res22: String = null
scala> (new URI("c:/my")).getScheme()
res23: String = c
scala> (new URI("c:\\my")).getScheme()
java.net.URISyntaxException: Illegal character
hadoop Path adds a slash to the path:
scala> new Path("c:/my").toUri
res25: java.net.URI = /c:/my
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avulanov just so I'm testing the right things, what would a user set for the warehouse dir, on Windows, if they wanted to specify a local path? It would have to be like file:///C:/... right? That seems to be correctly resolved. That is, we can't expect "C:/..." or "C:..." to work, because we need this to be parsed as a URI or relative path -- is that right? If so then I suspect @vanzin is right, let me try that approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two issues. Linux paths will work without adding file://, and Windows will not. That might lead to confusion. Also, I doubt that both Windows and Linux users typically add file:// if they specify a local path. As a rule of thumb, if no schema is given, then by default it is a local path. The second issue is that URI requires percent encoded string, so it does not work with white spaces or special symbols in the path. As far as I understand, URI is for URIs and not for paths. If we want to use URI then we need to change the variable name to warehouseURI and add a proper documentation about this configuration parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, if the Utils method doesn't work well with Windows paths it should be avoided... I seem to remember at some point it used to handle those (using the windowsDrive method in Utils) but that seems to be gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean just updating the documentation or changing the behavior? Updating the docs are fine. Updating the behavior (and name I guess) would affect a bunch of other places in the code, and that seems like a bigger issue to tackle separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean fixing the behavior because it is incorrect. We cannot use this function in its present state, it will introduce more issues. This seems to be a part of a bigger issue that paths in Spark are processed in many different ways that are not consistent. Here few examples: #12695 (comment), #13868 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we split the difference here and treat those two as different issues?
- for this bug, let's hardcode
.transform(new File(_).toURI().toString())into the config constant definition; that's on par with the previous version of this code modified in 11a6844, which hardcoded thefile:scheme into the default value. - separately, if it's a real issue, let's make changes to make this support non-local directories.
That way we fix the immediate issue and can have more discussions about how to properly solve the other and maybe even fix resolveURI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could do that. I could make a different PR with a subset/variant of this PR that would assume the path must be local and then rebase this one on top.
The thing is, I'm not clear it was ever actually meant to only be a local path, even though its default is a local path. It was always parsed as a Hadoop Path. Right now in the 2.0.1 release it's definitely going to support non-local paths. And this is supposed to be some analog of the Hive warehouse path, right? which can be on HDFS.
Therefore I kind of convinced myself that forcing this to a local file isn't going to be a valid intermediate state, and so we shouldn't stop there. Then we're just back to the change here.
How strongly do you feel about it, or rather, what's the problem with just allowing it to be non-local as is implemented here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I just feel we're getting stuck in this "how to fix resolveURI" argument instead of focusing on fixing this bug. If resolveURI is broken then don't use it. Simple. If you care, file a separate bug to fix it.
Instead, how about making WAREHOUSE_PATH an optional config and changing warehousePath to this:
getConf(WAREHOUSE_PATH).getOrElse(new File("spark-warehouse").toURI().toString())
Maybe with an extra check to make sure the resulting path is absolute. Then whoever wants to override the default value is required to provide a valid, complete URI.
|
OK, now using Inputs with a scheme can be specified, still, to point to HDFS or S3. I adjusted the tests to, I think, further standardize the handling of URIs. |
|
Test build #66580 has finished for PR 15382 at commit
|
|
Test build #3310 has finished for PR 15382 at commit
|
|
Test build #3314 has finished for PR 15382 at commit
|
|
Test build #3322 has finished for PR 15382 at commit
|
|
Test build #66810 has finished for PR 15382 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I know I suggested this, but looking at it now, it seems wrong.
Before, if a user explicitly set this value in the configuration to "/user/blah/warehouse", that would be interpreted as an HDFS path (or rather a path under fs.defaultFS). With this change it would be a local path. Right? So Utils.resolveURI should really only apply to the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, before SPARK-15899 it was always interpreted as a local file path. After that change (i.e. right now) it would be interpreted as an HDFS path if that's what fs.defaultFS says. My proposition is that:
- It should default to
spark-warehousein the local filesystem working dir, because that's what the docs say (therefore, we have a bug at the moment after SPARK-15899) - It should be interpreted as a local path if no scheme is given, because other stuff in Spark works that way via
Utils.resolveURI - It should be possible to set this to a non-local path, because you can do this with the Hive warehouse dir, which this option sort of parallels
I think the current change matches those requirements. I think we do not want "/user/blah/warehouse" to be interpreted as an HDFS path by default because that's not how it worked historically, and that it works this way now is I believe an error.
Is that convincing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interpreted as an HDFS path by default because that's not how it worked historically
Except that is how it's worked historically for this piece of code at least. Taking a small detour, the reason I really dislike the use of resolveURI everywhere (and I commented on this when that code was added) is that it makes it awkward to use HDFS. Instead of relying on fs.defaultFS now you have to know the HDFS scheme (which can be "hdfs", "maprfs", "s3", "webhdfs", "viewfs", or others depending on your particular configuration) or things will break in mysterious ways.
Anyway, I don't really care that much for this particular setting, so if you think changing the behavior is better, it's fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, the difference was that the default before was always a local path. Right now ends up defaulting to an HDFS path (when using with HDFS for example). But explicitly setting "/foo/bar" has been interpreted as a HDFS path in the past and that would change. Hm, OK I see your point about only applying this to the default value.
Looking at hive.metastore.warehouse.dir, that seems to accept a path without scheme and that's supposed to be read as an HDFS path, I suppose.
The catch remains Windows paths. It should always be possible to write these as URIs, but, then becomes mandatory to write them as a URI, if I make this new change. That's a downside. That's only if you need to specify a local, non-default path on Windows, which might be a fairly corner case.
I can make the change to have it available for consideration here which I assume @vanzin would favor, and I can see the logic of. How about @avulanov ? I know this has dragged on but I want to get it right. It's complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with @vanzin about dislike for resolveURI. i expect paths without schemes to on my default filesystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Just for clarity, what is the proposed change in behavior expected (other than default being under cwd instead of $HOME) ? Thx
…eURI; a little standardization of related tests
…use" through Utils.resolveURI
|
@vanzin this may be more like what you're suggesting. I think the Windows issue makes this a little bit more complex than what you're describing but I think this ends up being workable. @avulanov FYI @mridulm I think the change in behavior from what's in
The net change is actually pretty minor after all. |
|
Test build #67191 has finished for PR 15382 at commit
|
|
LGTM. This should restore the behavior pre- 11a6844 and be compatible with Windows. |
|
Great. Is it going to 2.0.x, too? |
|
cc @yhuai |
| assert(spark.sessionState.conf.warehousePath | ||
| === new Path(s"${System.getProperty("user.dir")}/spark-warehouse").toString) | ||
| assert(new Path(Utils.resolveURI("spark-warehouse")).toString === | ||
| spark.sessionState.conf.warehousePath + "/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could "/" change on Windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be a forward-slash because this will have to be a URI in any event in order to work on Windows. (The reason it's there is because the URI of a local directory will resolve with a trailing slash if the JDK knows the directory exists, otherwise it will be left as-is. The default URI won't have a trailing slash because it doesn't find the directory; it will exist at this point in the test though. Both are equally fine for Spark's purposes but that's why this exists in the test, for the record.)
|
I think this would have to go into 2.0.x as I think it's a regression from 2.0.0, in that the default location moved from a local file to HDFS for many users, and, it's an HDFS path that can't be created. I'll leave it open for a few more days. |
…al FS but can resolve as HDFS path Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir Existing tests. Author: Sean Owen <[email protected]> Closes #15382 from srowen/SPARK-17810. (cherry picked from commit 4ecbe1b) Signed-off-by: Sean Owen <[email protected]>
|
Merged to master/2.0 |
…al FS but can resolve as HDFS path ## What changes were proposed in this pull request? Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir ## How was this patch tested? Existing tests. Author: Sean Owen <[email protected]> Closes apache#15382 from srowen/SPARK-17810.
…al FS but can resolve as HDFS path ## What changes were proposed in this pull request? Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir ## How was this patch tested? Existing tests. Author: Sean Owen <[email protected]> Closes apache#15382 from srowen/SPARK-17810.
What changes were proposed in this pull request?
Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir
How was this patch tested?
Existing tests.