[SPARK-17810] [SQL] Default spark.sql.warehouse.dir is relative to local FS but can resolve as HDFS path #15382

srowen · 2016-10-06T19:50:29Z

What changes were proposed in this pull request?

Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir

How was this patch tested?

Existing tests.

srowen · 2016-10-06T19:51:03Z

CC @vanzin -- I think your solution at #13868 (comment) is better, will probably switch to it.

CC @avulanov and @koertkuipers

srowen · 2016-10-06T19:51:32Z

docs/sql-programming-guide.md

Here I am just fixing some duplication of docs that relate to this change. There were 3 identical stanzas

SparkQA · 2016-10-06T20:45:11Z

Test build #66459 has finished for PR 15382 at commit 4fb0e58.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

koertkuipers · 2016-10-07T05:39:15Z

i think working dir makes more sense than home dir. but could this catch people by surprise because we now expect write permission in the working dir?

srowen · 2016-10-07T07:25:53Z

Good point. The docs all say it should be working dir, so I was following the apparent intent along the way here. I don't know if one is more surprising than the other.

srowen · 2016-10-07T14:36:28Z

Update on my thinking:

Docs say the default should be "spark-warehouse" in the local working directory. The original code did establish a default on the local file system... though in the home dir, not working dir. So I still favor fixing that, but that's a small, orthogonal question.

Although the default appears to be intended to be on the local FS, I don't think it must be. @avulanov mentions it might be on S3; the Hive warehouse dir is on HDFS too by default, in comparison. I don't have evidence that this has been prohibited so I guess we shouldn't assume this is a local path.

It seems simple to default to "spark-warehouse" in the local working dir of whatever file system is being used, be it local or HDFS. That means SQLConf shouldn't specify a default with any URI scheme or path. SQLConf should then resolve it, though, at the moment there's no good way to plumb through a Hadoop Configuration. (I just made a new default one in here for the moment.)

Then we're back to an interesting detail, that Hadoop Path's toURI won't add a trailing slash in the case of a local file. Maybe it doesn't in general since it can't always tell whether a path is a directory or not. Local URIs however will have a trailing slash. It's probably best to leave the behavior and adjust the tests to not expect a trailing slash, as the original code seemed to (?)

Does this still work on Windows? As far as I can tell it would.

Is it OK to instantiate FileSystem.get(new Configuration()) in SQLConf? there's no easy access to the actual conf, but, it is only trying to read the default file system which shouldn't depend on Spark-specific Hadoop conf.

I think this change actually restores the ability to specify a warehouse path with a URI scheme. This second, it will be interpreted as a file path only if you write "file:/foo". However there's no way to write a relative file URI, so no way to guarantee that a default of "spark-warehouse" is a relative path on the local FS. After this change, HDFS users would see that this correctly defaults to "spark-warehouse" in an HDFS working directory.

srowen · 2016-10-07T14:37:12Z

examples/src/main/java/org/apache/spark/examples/sql/hive/JavaSparkHiveExample.java

Here I've just updated the example's location to match the new default.

SparkQA · 2016-10-07T15:42:17Z

Test build #66500 has finished for PR 15382 at commit 67c2b58.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-07T15:52:47Z

Test build #66501 has finished for PR 15382 at commit cd4b4b9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

koertkuipers · 2016-10-07T16:27:32Z

i don't think there is such a thing as a HDFS working directory, but that probably means it just uses the home dir on hdfs (/user/${username}) for any relative paths

vanzin · 2016-10-07T16:35:01Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Maybe it's better to use Utils.resolveURI here? That resolves to file: by default like most of the rest of Spark, and the previous versions of this code. And handles absolute vs. relative paths too.

Utils.resolveURI creates an java.net.URI from a String path, and returns the URI if it has schema. Apparently, it treats Windows paths as having a schema and breaks down with back slashes. Utils.resolveURI returns a different results comparing to hadoop Path.toUri:

scala> (new URI("/my")).getScheme() res22: String = null scala> (new URI("c:/my")).getScheme() res23: String = c scala> (new URI("c:\\my")).getScheme() java.net.URISyntaxException: Illegal character

hadoop Path adds a slash to the path:

scala> new Path("c:/my").toUri res25: java.net.URI = /c:/my

@avulanov just so I'm testing the right things, what would a user set for the warehouse dir, on Windows, if they wanted to specify a local path? It would have to be like file:///C:/... right? That seems to be correctly resolved. That is, we can't expect "C:/..." or "C:..." to work, because we need this to be parsed as a URI or relative path -- is that right? If so then I suspect @vanzin is right, let me try that approach.

There are two issues. Linux paths will work without adding file://, and Windows will not. That might lead to confusion. Also, I doubt that both Windows and Linux users typically add file:// if they specify a local path. As a rule of thumb, if no schema is given, then by default it is a local path. The second issue is that URI requires percent encoded string, so it does not work with white spaces or special symbols in the path. As far as I understand, URI is for URIs and not for paths. If we want to use URI then we need to change the variable name to warehouseURI and add a proper documentation about this configuration parameter.

Agree, if the Utils method doesn't work well with Windows paths it should be avoided... I seem to remember at some point it used to handle those (using the windowsDrive method in Utils) but that seems to be gone.

Do you mean just updating the documentation or changing the behavior? Updating the docs are fine. Updating the behavior (and name I guess) would affect a bunch of other places in the code, and that seems like a bigger issue to tackle separately.

I mean fixing the behavior because it is incorrect. We cannot use this function in its present state, it will introduce more issues. This seems to be a part of a bigger issue that paths in Spark are processed in many different ways that are not consistent. Here few examples: #12695 (comment), #13868 (comment)

Can we split the difference here and treat those two as different issues?

for this bug, let's hardcode .transform(new File(_).toURI().toString()) into the config constant definition; that's on par with the previous version of this code modified in 11a6844, which hardcoded the file: scheme into the default value.

separately, if it's a real issue, let's make changes to make this support non-local directories.

That way we fix the immediate issue and can have more discussions about how to properly solve the other and maybe even fix resolveURI.

I could do that. I could make a different PR with a subset/variant of this PR that would assume the path must be local and then rebase this one on top.

The thing is, I'm not clear it was ever actually meant to only be a local path, even though its default is a local path. It was always parsed as a Hadoop Path. Right now in the 2.0.1 release it's definitely going to support non-local paths. And this is supposed to be some analog of the Hive warehouse path, right? which can be on HDFS.

Therefore I kind of convinced myself that forcing this to a local file isn't going to be a valid intermediate state, and so we shouldn't stop there. Then we're just back to the change here.

How strongly do you feel about it, or rather, what's the problem with just allowing it to be non-local as is implemented here?

I see. I just feel we're getting stuck in this "how to fix resolveURI" argument instead of focusing on fixing this bug. If resolveURI is broken then don't use it. Simple. If you care, file a separate bug to fix it.

Instead, how about making WAREHOUSE_PATH an optional config and changing warehousePath to this:

getConf(WAREHOUSE_PATH).getOrElse(new File("spark-warehouse").toURI().toString())

Maybe with an extra check to make sure the resulting path is absolute. Then whoever wants to override the default value is required to provide a valid, complete URI.

srowen · 2016-10-08T15:56:30Z

OK, now using Utils.resolveURI. This means all inputs without a scheme resolve as file: URIs, including the default "spark-warehouse". This will continue to resolve to the local working dir by default, as per docs.

Inputs with a scheme can be specified, still, to point to HDFS or S3.

I adjusted the tests to, I think, further standardize the handling of URIs.

SparkQA · 2016-10-08T17:13:59Z

Test build #66580 has finished for PR 15382 at commit 71f124a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-10T13:02:15Z

Test build #3310 has finished for PR 15382 at commit 71f124a.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-10T14:47:33Z

Test build #3314 has finished for PR 15382 at commit 71f124a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-11T09:35:27Z

Test build #3322 has finished for PR 15382 at commit 71f124a.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-10-12T12:56:27Z

Test build #66810 has finished for PR 15382 at commit ef7a141.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-10-18T00:39:31Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

So, I know I suggested this, but looking at it now, it seems wrong.

Before, if a user explicitly set this value in the configuration to "/user/blah/warehouse", that would be interpreted as an HDFS path (or rather a path under fs.defaultFS). With this change it would be a local path. Right? So Utils.resolveURI should really only apply to the default value.

Well, before SPARK-15899 it was always interpreted as a local file path. After that change (i.e. right now) it would be interpreted as an HDFS path if that's what fs.defaultFS says. My proposition is that:

It should default to spark-warehouse in the local filesystem working dir, because that's what the docs say (therefore, we have a bug at the moment after SPARK-15899)

It should be interpreted as a local path if no scheme is given, because other stuff in Spark works that way via Utils.resolveURI

It should be possible to set this to a non-local path, because you can do this with the Hive warehouse dir, which this option sort of parallels

I think the current change matches those requirements. I think we do not want "/user/blah/warehouse" to be interpreted as an HDFS path by default because that's not how it worked historically, and that it works this way now is I believe an error.

Is that convincing?

interpreted as an HDFS path by default because that's not how it worked historically

Except that is how it's worked historically for this piece of code at least. Taking a small detour, the reason I really dislike the use of resolveURI everywhere (and I commented on this when that code was added) is that it makes it awkward to use HDFS. Instead of relying on fs.defaultFS now you have to know the HDFS scheme (which can be "hdfs", "maprfs", "s3", "webhdfs", "viewfs", or others depending on your particular configuration) or things will break in mysterious ways.

Anyway, I don't really care that much for this particular setting, so if you think changing the behavior is better, it's fine with me.

You are right, the difference was that the default before was always a local path. Right now ends up defaulting to an HDFS path (when using with HDFS for example). But explicitly setting "/foo/bar" has been interpreted as a HDFS path in the past and that would change. Hm, OK I see your point about only applying this to the default value.

Looking at hive.metastore.warehouse.dir, that seems to accept a path without scheme and that's supposed to be read as an HDFS path, I suppose.

The catch remains Windows paths. It should always be possible to write these as URIs, but, then becomes mandatory to write them as a URI, if I make this new change. That's a downside. That's only if you need to specify a local, non-default path on Windows, which might be a fairly corner case.

I can make the change to have it available for consideration here which I assume @vanzin would favor, and I can see the logic of. How about @avulanov ? I know this has dragged on but I want to get it right. It's complex.

i agree with @vanzin about dislike for resolveURI. i expect paths without schemes to on my default filesystem.

@srowen Just for clarity, what is the proposed change in behavior expected (other than default being under cwd instead of $HOME) ? Thx

…eURI; a little standardization of related tests

…use" through Utils.resolveURI

srowen · 2016-10-19T11:07:00Z

@vanzin this may be more like what you're suggesting. I think the Windows issue makes this a little bit more complex than what you're describing but I think this ends up being workable.

@avulanov FYI

@mridulm I think the change in behavior from what's in master now is:

Default to a local filesystem path (again) regardless of Hadoop config
Default to spark-warehouse in working dir not user dir (as in the docs)

The net change is actually pretty minor after all.

SparkQA · 2016-10-19T13:15:03Z

Test build #67191 has finished for PR 15382 at commit 69c5383.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-10-19T16:43:25Z

LGTM. This should restore the behavior pre- 11a6844 and be compatible with Windows.

dongjoon-hyun · 2016-10-19T22:03:38Z

Great. Is it going to 2.0.x, too?

gatorsmile · 2016-10-19T22:09:51Z

cc @yhuai

felixcheung · 2016-10-19T23:40:39Z

sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala

-      assert(spark.sessionState.conf.warehousePath
-        === new Path(s"${System.getProperty("user.dir")}/spark-warehouse").toString)
+      assert(new Path(Utils.resolveURI("spark-warehouse")).toString ===
+        spark.sessionState.conf.warehousePath + "/")


could "/" change on Windows?

I think it will be a forward-slash because this will have to be a URI in any event in order to work on Windows. (The reason it's there is because the URI of a local directory will resolve with a trailing slash if the JDK knows the directory exists, otherwise it will be left as-is. The default URI won't have a trailing slash because it doesn't find the directory; it will exist at this point in the test though. Both are equally fine for Spark's purposes but that's why this exists in the test, for the record.)

srowen · 2016-10-21T10:38:00Z

I think this would have to go into 2.0.x as I think it's a regression from 2.0.0, in that the default location moved from a local file to HDFS for many users, and, it's an HDFS path that can't be created.

I'll leave it open for a few more days.

…al FS but can resolve as HDFS path Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir Existing tests. Author: Sean Owen <[email protected]> Closes #15382 from srowen/SPARK-17810. (cherry picked from commit 4ecbe1b) Signed-off-by: Sean Owen <[email protected]>

srowen · 2016-10-24T10:20:45Z

Merged to master/2.0

…al FS but can resolve as HDFS path ## What changes were proposed in this pull request? Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir ## How was this patch tested? Existing tests. Author: Sean Owen <[email protected]> Closes apache#15382 from srowen/SPARK-17810.

srowen commented Oct 6, 2016

View reviewed changes

srowen commented Oct 7, 2016

View reviewed changes

vanzin reviewed Oct 7, 2016

View reviewed changes

srowen force-pushed the SPARK-17810 branch from 71f124a to ef7a141 Compare October 12, 2016 10:38

vanzin reviewed Oct 18, 2016

View reviewed changes

srowen added 3 commits October 19, 2016 11:28

Resolve all paths without scheme as file: using standard Utils.resolv…

69f8f05

…eURI; a little standardization of related tests

Directly handle warehouse path as URI in tests too

5e6350c

Back out some of last change; only pass default path of "spark-wareho…

69c5383

…use" through Utils.resolveURI

srowen force-pushed the SPARK-17810 branch from ef7a141 to 69c5383 Compare October 19, 2016 11:04

felixcheung reviewed Oct 19, 2016

View reviewed changes

asfgit closed this in 4ecbe1b Oct 24, 2016

srowen deleted the SPARK-17810 branch October 24, 2016 10:21

[SPARK-17810] [SQL] Default spark.sql.warehouse.dir is relative to local FS but can resolve as HDFS path #15382

[SPARK-17810] [SQL] Default spark.sql.warehouse.dir is relative to local FS but can resolve as HDFS path #15382

Uh oh!

Conversation

srowen commented Oct 6, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen commented Oct 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 6, 2016

Uh oh!

koertkuipers commented Oct 7, 2016

Uh oh!

srowen commented Oct 7, 2016

Uh oh!

srowen commented Oct 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 7, 2016

Uh oh!

SparkQA commented Oct 7, 2016

Uh oh!

koertkuipers commented Oct 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avulanov Oct 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Oct 8, 2016

Uh oh!

SparkQA commented Oct 8, 2016

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

SparkQA commented Oct 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Oct 19, 2016

Uh oh!

SparkQA commented Oct 19, 2016

Uh oh!

vanzin commented Oct 19, 2016

koertkuipers commented Oct 7, 2016 •

edited

Loading

avulanov Oct 8, 2016 •

edited

Loading