Skip to content

Conversation

@srowen
Copy link
Member

@srowen srowen commented Oct 6, 2016

What changes were proposed in this pull request?

Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir

How was this patch tested?

Existing tests.

@srowen
Copy link
Member Author

srowen commented Oct 6, 2016

CC @vanzin -- I think your solution at #13868 (comment) is better, will probably switch to it.

CC @avulanov and @koertkuipers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I am just fixing some duplication of docs that relate to this change. There were 3 identical stanzas

@SparkQA
Copy link

SparkQA commented Oct 6, 2016

Test build #66459 has finished for PR 15382 at commit 4fb0e58.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@koertkuipers
Copy link
Contributor

i think working dir makes more sense than home dir. but could this catch people by surprise because we now expect write permission in the working dir?

@srowen
Copy link
Member Author

srowen commented Oct 7, 2016

Good point. The docs all say it should be working dir, so I was following the apparent intent along the way here. I don't know if one is more surprising than the other.

@srowen
Copy link
Member Author

srowen commented Oct 7, 2016

Update on my thinking:

Docs say the default should be "spark-warehouse" in the local working directory. The original code did establish a default on the local file system... though in the home dir, not working dir. So I still favor fixing that, but that's a small, orthogonal question.

Although the default appears to be intended to be on the local FS, I don't think it must be. @avulanov mentions it might be on S3; the Hive warehouse dir is on HDFS too by default, in comparison. I don't have evidence that this has been prohibited so I guess we shouldn't assume this is a local path.

It seems simple to default to "spark-warehouse" in the local working dir of whatever file system is being used, be it local or HDFS. That means SQLConf shouldn't specify a default with any URI scheme or path. SQLConf should then resolve it, though, at the moment there's no good way to plumb through a Hadoop Configuration. (I just made a new default one in here for the moment.)

Then we're back to an interesting detail, that Hadoop Path's toURI won't add a trailing slash in the case of a local file. Maybe it doesn't in general since it can't always tell whether a path is a directory or not. Local URIs however will have a trailing slash. It's probably best to leave the behavior and adjust the tests to not expect a trailing slash, as the original code seemed to (?)

Does this still work on Windows? As far as I can tell it would.

Is it OK to instantiate FileSystem.get(new Configuration()) in SQLConf? there's no easy access to the actual conf, but, it is only trying to read the default file system which shouldn't depend on Spark-specific Hadoop conf.

I think this change actually restores the ability to specify a warehouse path with a URI scheme. This second, it will be interpreted as a file path only if you write "file:/foo". However there's no way to write a relative file URI, so no way to guarantee that a default of "spark-warehouse" is a relative path on the local FS. After this change, HDFS users would see that this correctly defaults to "spark-warehouse" in an HDFS working directory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I've just updated the example's location to match the new default.

@SparkQA
Copy link

SparkQA commented Oct 7, 2016

Test build #66500 has finished for PR 15382 at commit 67c2b58.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 7, 2016

Test build #66501 has finished for PR 15382 at commit cd4b4b9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@koertkuipers
Copy link
Contributor

koertkuipers commented Oct 7, 2016

i don't think there is such a thing as a HDFS working directory, but that probably means it just uses the home dir on hdfs (/user/${username}) for any relative paths

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to use Utils.resolveURI here? That resolves to file: by default like most of the rest of Spark, and the previous versions of this code. And handles absolute vs. relative paths too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utils.resolveURI creates an java.net.URI from a String path, and returns the URI if it has schema. Apparently, it treats Windows paths as having a schema and breaks down with back slashes. Utils.resolveURI returns a different results comparing to hadoop Path.toUri:

scala> (new URI("/my")).getScheme()
res22: String = null

scala> (new URI("c:/my")).getScheme()
res23: String = c

scala> (new URI("c:\\my")).getScheme()
java.net.URISyntaxException: Illegal character

hadoop Path adds a slash to the path:

scala> new Path("c:/my").toUri
res25: java.net.URI = /c:/my

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avulanov just so I'm testing the right things, what would a user set for the warehouse dir, on Windows, if they wanted to specify a local path? It would have to be like file:///C:/... right? That seems to be correctly resolved. That is, we can't expect "C:/..." or "C:..." to work, because we need this to be parsed as a URI or relative path -- is that right? If so then I suspect @vanzin is right, let me try that approach.

Copy link
Contributor

@avulanov avulanov Oct 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two issues. Linux paths will work without adding file://, and Windows will not. That might lead to confusion. Also, I doubt that both Windows and Linux users typically add file:// if they specify a local path. As a rule of thumb, if no schema is given, then by default it is a local path. The second issue is that URI requires percent encoded string, so it does not work with white spaces or special symbols in the path. As far as I understand, URI is for URIs and not for paths. If we want to use URI then we need to change the variable name to warehouseURI and add a proper documentation about this configuration parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, if the Utils method doesn't work well with Windows paths it should be avoided... I seem to remember at some point it used to handle those (using the windowsDrive method in Utils) but that seems to be gone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean just updating the documentation or changing the behavior? Updating the docs are fine. Updating the behavior (and name I guess) would affect a bunch of other places in the code, and that seems like a bigger issue to tackle separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean fixing the behavior because it is incorrect. We cannot use this function in its present state, it will introduce more issues. This seems to be a part of a bigger issue that paths in Spark are processed in many different ways that are not consistent. Here few examples: #12695 (comment), #13868 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split the difference here and treat those two as different issues?

  • for this bug, let's hardcode .transform(new File(_).toURI().toString()) into the config constant definition; that's on par with the previous version of this code modified in 11a6844, which hardcoded the file: scheme into the default value.
  • separately, if it's a real issue, let's make changes to make this support non-local directories.

That way we fix the immediate issue and can have more discussions about how to properly solve the other and maybe even fix resolveURI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do that. I could make a different PR with a subset/variant of this PR that would assume the path must be local and then rebase this one on top.

The thing is, I'm not clear it was ever actually meant to only be a local path, even though its default is a local path. It was always parsed as a Hadoop Path. Right now in the 2.0.1 release it's definitely going to support non-local paths. And this is supposed to be some analog of the Hive warehouse path, right? which can be on HDFS.

Therefore I kind of convinced myself that forcing this to a local file isn't going to be a valid intermediate state, and so we shouldn't stop there. Then we're just back to the change here.

How strongly do you feel about it, or rather, what's the problem with just allowing it to be non-local as is implemented here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I just feel we're getting stuck in this "how to fix resolveURI" argument instead of focusing on fixing this bug. If resolveURI is broken then don't use it. Simple. If you care, file a separate bug to fix it.

Instead, how about making WAREHOUSE_PATH an optional config and changing warehousePath to this:

getConf(WAREHOUSE_PATH).getOrElse(new File("spark-warehouse").toURI().toString())

Maybe with an extra check to make sure the resulting path is absolute. Then whoever wants to override the default value is required to provide a valid, complete URI.

@srowen
Copy link
Member Author

srowen commented Oct 8, 2016

OK, now using Utils.resolveURI. This means all inputs without a scheme resolve as file: URIs, including the default "spark-warehouse". This will continue to resolve to the local working dir by default, as per docs.

Inputs with a scheme can be specified, still, to point to HDFS or S3.

I adjusted the tests to, I think, further standardize the handling of URIs.

@SparkQA
Copy link

SparkQA commented Oct 8, 2016

Test build #66580 has finished for PR 15382 at commit 71f124a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #3310 has finished for PR 15382 at commit 71f124a.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #3314 has finished for PR 15382 at commit 71f124a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 11, 2016

Test build #3322 has finished for PR 15382 at commit 71f124a.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66810 has finished for PR 15382 at commit ef7a141.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I know I suggested this, but looking at it now, it seems wrong.

Before, if a user explicitly set this value in the configuration to "/user/blah/warehouse", that would be interpreted as an HDFS path (or rather a path under fs.defaultFS). With this change it would be a local path. Right? So Utils.resolveURI should really only apply to the default value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, before SPARK-15899 it was always interpreted as a local file path. After that change (i.e. right now) it would be interpreted as an HDFS path if that's what fs.defaultFS says. My proposition is that:

  • It should default to spark-warehouse in the local filesystem working dir, because that's what the docs say (therefore, we have a bug at the moment after SPARK-15899)
  • It should be interpreted as a local path if no scheme is given, because other stuff in Spark works that way via Utils.resolveURI
  • It should be possible to set this to a non-local path, because you can do this with the Hive warehouse dir, which this option sort of parallels

I think the current change matches those requirements. I think we do not want "/user/blah/warehouse" to be interpreted as an HDFS path by default because that's not how it worked historically, and that it works this way now is I believe an error.

Is that convincing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interpreted as an HDFS path by default because that's not how it worked historically

Except that is how it's worked historically for this piece of code at least. Taking a small detour, the reason I really dislike the use of resolveURI everywhere (and I commented on this when that code was added) is that it makes it awkward to use HDFS. Instead of relying on fs.defaultFS now you have to know the HDFS scheme (which can be "hdfs", "maprfs", "s3", "webhdfs", "viewfs", or others depending on your particular configuration) or things will break in mysterious ways.

Anyway, I don't really care that much for this particular setting, so if you think changing the behavior is better, it's fine with me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the difference was that the default before was always a local path. Right now ends up defaulting to an HDFS path (when using with HDFS for example). But explicitly setting "/foo/bar" has been interpreted as a HDFS path in the past and that would change. Hm, OK I see your point about only applying this to the default value.

Looking at hive.metastore.warehouse.dir, that seems to accept a path without scheme and that's supposed to be read as an HDFS path, I suppose.

The catch remains Windows paths. It should always be possible to write these as URIs, but, then becomes mandatory to write them as a URI, if I make this new change. That's a downside. That's only if you need to specify a local, non-default path on Windows, which might be a fairly corner case.

I can make the change to have it available for consideration here which I assume @vanzin would favor, and I can see the logic of. How about @avulanov ? I know this has dragged on but I want to get it right. It's complex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with @vanzin about dislike for resolveURI. i expect paths without schemes to on my default filesystem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Just for clarity, what is the proposed change in behavior expected (other than default being under cwd instead of $HOME) ? Thx

@srowen
Copy link
Member Author

srowen commented Oct 19, 2016

@vanzin this may be more like what you're suggesting. I think the Windows issue makes this a little bit more complex than what you're describing but I think this ends up being workable.

@avulanov FYI

@mridulm I think the change in behavior from what's in master now is:

  • Default to a local filesystem path (again) regardless of Hadoop config
  • Default to spark-warehouse in working dir not user dir (as in the docs)

The net change is actually pretty minor after all.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67191 has finished for PR 15382 at commit 69c5383.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Oct 19, 2016

LGTM. This should restore the behavior pre- 11a6844 and be compatible with Windows.

@dongjoon-hyun
Copy link
Member

Great. Is it going to 2.0.x, too?

@gatorsmile
Copy link
Member

cc @yhuai

assert(spark.sessionState.conf.warehousePath
=== new Path(s"${System.getProperty("user.dir")}/spark-warehouse").toString)
assert(new Path(Utils.resolveURI("spark-warehouse")).toString ===
spark.sessionState.conf.warehousePath + "/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could "/" change on Windows?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be a forward-slash because this will have to be a URI in any event in order to work on Windows. (The reason it's there is because the URI of a local directory will resolve with a trailing slash if the JDK knows the directory exists, otherwise it will be left as-is. The default URI won't have a trailing slash because it doesn't find the directory; it will exist at this point in the test though. Both are equally fine for Spark's purposes but that's why this exists in the test, for the record.)

@srowen
Copy link
Member Author

srowen commented Oct 21, 2016

I think this would have to go into 2.0.x as I think it's a regression from 2.0.0, in that the default location moved from a local file to HDFS for many users, and, it's an HDFS path that can't be created.

I'll leave it open for a few more days.

@asfgit asfgit closed this in 4ecbe1b Oct 24, 2016
asfgit pushed a commit that referenced this pull request Oct 24, 2016
…al FS but can resolve as HDFS path

Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir

Existing tests.

Author: Sean Owen <[email protected]>

Closes #15382 from srowen/SPARK-17810.

(cherry picked from commit 4ecbe1b)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member Author

srowen commented Oct 24, 2016

Merged to master/2.0

@srowen srowen deleted the SPARK-17810 branch October 24, 2016 10:21
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
…al FS but can resolve as HDFS path

## What changes were proposed in this pull request?

Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir

## How was this patch tested?

Existing tests.

Author: Sean Owen <[email protected]>

Closes apache#15382 from srowen/SPARK-17810.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…al FS but can resolve as HDFS path

## What changes were proposed in this pull request?

Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir

## How was this patch tested?

Existing tests.

Author: Sean Owen <[email protected]>

Closes apache#15382 from srowen/SPARK-17810.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants