[SPARK-23434][SQL][BRANCH-2.3] Spark should not warn `metadata directory` for a HDFS file path #20713

dongjoon-hyun · 2018-03-02T04:28:00Z

What changes were proposed in this pull request?

In a kerberized cluster, when Spark reads a file path (e.g. people.json), it warns with a wrong warning message during looking up people.json/_spark_metadata. The root cause of this situation is the difference between LocalFileSystem and DistributedFileSystem. LocalFileSystem.exists() returns false, but DistributedFileSystem.exists raises org.apache.hadoop.security.AccessControlException.

scala> spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

scala> spark.read.json("hdfs:///tmp/people.json")
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for metadata directory.
18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for metadata directory.

After this PR,

scala> spark.read.json("hdfs:///tmp/people.json").show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

How was this patch tested?

Manual.

…DFS file path ## What changes were proposed in this pull request? In a kerberized cluster, when Spark reads a file path (e.g. `people.json`), it warns with a wrong warning message during looking up `people.json/_spark_metadata`. The root cause of this situation is the difference between `LocalFileSystem` and `DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but `DistributedFileSystem.exists` raises `org.apache.hadoop.security.AccessControlException`. ```scala scala> spark.version res0: String = 2.4.0-SNAPSHOT scala> spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+ scala> spark.read.json("hdfs:///tmp/people.json") 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for metadata directory. 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for metadata directory. ``` After this PR, ```scala scala> spark.read.json("hdfs:///tmp/people.json").show +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+ ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <[email protected]> Closes #20616 from dongjoon-hyun/SPARK-23434.

SparkQA · 2018-03-02T04:44:46Z

Test build #87872 has finished for PR 20713 at commit fd538ca.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-03-02T04:47:37Z

Retest this please.

SparkQA · 2018-03-02T07:50:45Z

Test build #87875 has finished for PR 20713 at commit fd538ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-03-03T05:19:06Z

@cloud-fan and @zsxwing .
This is a backport of #20616 .

dongjoon-hyun · 2018-03-05T16:40:40Z

Retest this please.

SparkQA · 2018-03-05T19:46:53Z

Test build #87966 has finished for PR 20713 at commit fd538ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ory` for a HDFS file path ## What changes were proposed in this pull request? In a kerberized cluster, when Spark reads a file path (e.g. `people.json`), it warns with a wrong warning message during looking up `people.json/_spark_metadata`. The root cause of this situation is the difference between `LocalFileSystem` and `DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but `DistributedFileSystem.exists` raises `org.apache.hadoop.security.AccessControlException`. ```scala scala> spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+ scala> spark.read.json("hdfs:///tmp/people.json") 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for metadata directory. 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for metadata directory. ``` After this PR, ```scala scala> spark.read.json("hdfs:///tmp/people.json").show +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+ ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <[email protected]> Closes #20713 from dongjoon-hyun/SPARK-23434-2.3.

cloud-fan · 2018-03-05T22:25:52Z

thanks, merging to 2.3!

dongjoon-hyun · 2018-03-06T00:03:00Z

Thank you!

dongjoon-hyun closed this Mar 6, 2018

dongjoon-hyun deleted the SPARK-23434-2.3 branch March 6, 2018 00:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-23434][SQL][BRANCH-2.3] Spark should not warn `metadata directory` for a HDFS file path #20713

[SPARK-23434][SQL][BRANCH-2.3] Spark should not warn `metadata directory` for a HDFS file path #20713

Uh oh!

dongjoon-hyun commented Mar 2, 2018 •

edited

Loading

Uh oh!

SparkQA commented Mar 2, 2018

Uh oh!

dongjoon-hyun commented Mar 2, 2018

Uh oh!

SparkQA commented Mar 2, 2018

Uh oh!

dongjoon-hyun commented Mar 3, 2018

Uh oh!

dongjoon-hyun commented Mar 5, 2018

Uh oh!

SparkQA commented Mar 5, 2018

Uh oh!

cloud-fan commented Mar 5, 2018

Uh oh!

dongjoon-hyun commented Mar 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-23434][SQL][BRANCH-2.3] Spark should not warn metadata directory for a HDFS file path #20713

[SPARK-23434][SQL][BRANCH-2.3] Spark should not warn metadata directory for a HDFS file path #20713

Uh oh!

Conversation

dongjoon-hyun commented Mar 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 2, 2018

Uh oh!

dongjoon-hyun commented Mar 2, 2018

Uh oh!

SparkQA commented Mar 2, 2018

Uh oh!

dongjoon-hyun commented Mar 3, 2018

Uh oh!

dongjoon-hyun commented Mar 5, 2018

Uh oh!

SparkQA commented Mar 5, 2018

Uh oh!

cloud-fan commented Mar 5, 2018

Uh oh!

dongjoon-hyun commented Mar 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-23434][SQL][BRANCH-2.3] Spark should not warn `metadata directory` for a HDFS file path #20713

[SPARK-23434][SQL][BRANCH-2.3] Spark should not warn `metadata directory` for a HDFS file path #20713

dongjoon-hyun commented Mar 2, 2018 •

edited

Loading