Skip to content

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Dec 8, 2015

This PR encodes and decodes the file name to fix the issue.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 8, 2015

CC @vanzin

@vanzin
Copy link
Contributor

vanzin commented Dec 8, 2015

LGTM pending tests.

@SparkQA
Copy link

SparkQA commented Dec 9, 2015

Test build #47373 has finished for PR 10208 at commit 56a35a3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Utils.resolveURI already covers this functionality.
(It's "URI" and "URL" not "uri" and "url".)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utils.resolveURI cannot handle some file names, such as abc:xyz.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that file name is a problem in a URL though. : is not reserved in that part. You'd have to send this to java.net.URI "manually" like you've done, or else use file:/abc:xyz. Then it's doing the same thing; your method wouldn't escape the colon either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to encode colon. The purpose here is: assume we have a uri prefix: spark://localhost:1234/ and a file name: abc xyz, and want to concat the prefix and the file name to spark://localhost:1234/abc%20xyz so that we can pass it to java.net.URI(String).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, but the existing method already does just that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I pass abc:xyz to URI.resolveURI, it doesn't work:

scala> Utils.resolveURI("abc:xyz").getRawPath
res0: String = null

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it must be file:/abc:xyz since you need to tell it it needs to be treated as a file name. Meh, at this point why not just new URI("file:/" + fileName).getRawPath.substring(1) and keep it simple?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh, at this point why not just new URI("file:/" + fileName).getRawPath.substring(1) and keep it simple?

The parameter of URI(String) must be an encoded path. But here fileName is what we want to encode.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 9, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Dec 9, 2015

Test build #47437 has started for PR 10208 at commit 2c31643.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 10, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Dec 10, 2015

Test build #47529 has finished for PR 10208 at commit 2c31643.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 11, 2015

@srowen further comments?

@srowen
Copy link
Member

srowen commented Dec 12, 2015

This feels hacky to be partly encoding URIs with ad-hoc methods. I understand the problem but this seems to be making the project code more complex. The methods seems like overkill since you're just calling a single (tested) JDK method. shrug

@zsxwing
Copy link
Member Author

zsxwing commented Dec 14, 2015

@srowen Just found org.apache.commons.httpclient.util.URIUtil has already supported URI decode/encode features and updated my PR with it.

@andrewor14
Copy link
Contributor

LGTM, will merge once tests pass.

@srowen
Copy link
Member

srowen commented Dec 14, 2015

yeah if it doesn't mean adding a dependency on HttpClient into core -- it may happen to come in transitively already

@zsxwing
Copy link
Member Author

zsxwing commented Dec 14, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Dec 14, 2015

Test build #47675 has finished for PR 10208 at commit 8734464.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 14, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Dec 15, 2015

Test build #47683 has finished for PR 10208 at commit 8734464.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only accidentally works since the library is a transitive dependency of core. I'm hesitant to make it a direct dependency and make core yet bigger, but in theory it already is (indirectly). It's not otherwise used in Spark though.

I know we've been around a lot on this; I had though using the JDK URI class directly was simplest rather than drag in a dependency. If anything I was questioning wrapping it up in a method, but, that's not that bad.

Any other opinions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything I was questioning wrapping it up in a method, but, that's not that bad.

Do you suggest that reverting my last commit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before you consider doing more work to change again, I'd love to get another opinion if possible. We won't stall too long on this in any event. I suppose I preferred your original change, yes, to this one, just because it avoids any new subtle dependency entanglement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider explicitly adding that dependency? If it's already done somewhere else then it would be best that we don't re-implement the same thing and potentially introduce our own bugs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okey. Explicitly added it.

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #47767 has finished for PR 10208 at commit 28fa737.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #47786 has finished for PR 10208 at commit 28fa737.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Dec 16, 2015

The problem is, nothing else in Spark uses HttpClient. I don't think it's worth it just for this, especially since there's essentially a one-liner using JDK methods. I agree that, if we could reuse a significant set of methods from HttpClient and retire some Spark code, that'd make it worth it. That's a bit of a separate issue. Still there, I know HttpClient versions can be a bit of a problem since lots of things depend on various versions of 3.x and 4.x, and it's easy to hit version compatibility problems.

@andrewor14
Copy link
Contributor

I suppose the decision is what we want to maintain more: our own encode/decode logic, or potential dependency conflicts. If we test the encode/decode logic well then we probably don't need to maintain it much, but at the same time it's good to use a trusted library in case there's something we missed. Though I agree the dependency thing could be more difficult to maintain. I'm OK one way or the other, so we can just do it ourselves as @srowen prefers.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 16, 2015

Okey. I just removed my last 2 commits.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 16, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #47847 has finished for PR 10208 at commit 2c31643.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

@srowen do the latest changes LGTY?

@srowen
Copy link
Member

srowen commented Dec 17, 2015

OK. I think this pretty fine.

@zsxwing
Copy link
Member Author

zsxwing commented Dec 17, 2015

Great. Merging to master and 1.6

@asfgit asfgit closed this in 86e405f Dec 17, 2015
asfgit pushed a commit that referenced this pull request Dec 17, 2015
…pecial characters

This PR encodes and decodes the file name to fix the issue.

Author: Shixiong Zhu <[email protected]>

Closes #10208 from zsxwing/uri.

(cherry picked from commit 86e405f)
Signed-off-by: Shixiong Zhu <[email protected]>
@zsxwing zsxwing deleted the uri branch December 17, 2015 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants