Skip to content

Conversation

@vitillo
Copy link
Contributor

@vitillo vitillo commented Feb 21, 2017

What changes were proposed in this pull request?

HDFSBackedStateStoreProvider fails to rename files on HDFS but not on the local filesystem. According to the implementation notes of rename(), the behavior of the local filesystem and HDFS varies:

Destination exists and is a file
Renaming a file atop an existing file is specified as failing, raising an exception.

  • Local FileSystem : the rename succeeds; the destination file is replaced by the source file.
  • HDFS : The rename fails, no exception is raised. Instead the method call simply returns false.

This patch ensures that rename() isn't called if the destination file already exists. It's still semantically correct because Structured Streaming requires that rerunning a batch should generate the same output.

How was this patch tested?

This patch was tested by running StateStoreSuite.

@vitillo
Copy link
Contributor Author

vitillo commented Feb 21, 2017

@tdas @zsxwing

@zsxwing
Copy link
Member

zsxwing commented Feb 22, 2017

The fix looks good to me. But the test doesn't test anything. Right?

@vitillo
Copy link
Contributor Author

vitillo commented Feb 22, 2017

@zsxwing The test is there just to make sure the new code path is executed during tests. It probably makes more sense to create a fake filesystem type that adheres to the rename semantic of HDFS to test this.

@vitillo
Copy link
Contributor Author

vitillo commented Feb 23, 2017

@zsxwing I have rewritten the test code.

@hejix
Copy link

hejix commented Feb 24, 2017

@zsxwing Is it possible for you to give some approximate timeframe for when this might be accepted? A week or month? Since we are blocked in our spark streaming adoption (this HDFS error happens not only in restart but also in first run of our more complex aggregating drivers), I don't know if I should wait for this fix to make it into the nightly build soon or build a custom dist ourselves and update our automation. Any input as to the fix timeline would be greatly appreciated!

Also this fix handles the same issue as SPARK-19645 which has a different solution. Is this simple solution sufficient or the more complex fix in #16980 required?

Copy link
Member

@zsxwing zsxwing Feb 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we should not delete the file. Otherwise, it will break speculation. E.g., there is also a speculation ask running. It wrote successfully and just reported to the driver that the task was successful. At the same time, another task just deleted the final file wrongly because the file exists.

The correct way is just skipping rename. We can depend on the basic assumption of Structured Streaming that the same batch should contain the same data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@vitillo vitillo force-pushed the fix_rename branch 2 times, most recently from 7d44680 to 530c027 Compare February 27, 2017 14:54
@hejix
Copy link

hejix commented Feb 28, 2017

Just some feedback that I did some initial regression testing with this pull request on a full YARN (v2.7.3) 4 node cluster on GCP and it appears to have fixed the two issues we had- our structured streaming drivers now restart normally and our complex aggregation driver runs for the first time.

@mridulm
Copy link
Contributor

mridulm commented Feb 28, 2017

What is the proposed semantics from this PR now ?

  • If file exists, ignore.
  • If file does not exist, try to rename - if fails, throw exception.

Is this right ? If yes, the PR title/description should be changed.

@zsxwing
Copy link
Member

zsxwing commented Feb 28, 2017

ok to test

@zsxwing
Copy link
Member

zsxwing commented Feb 28, 2017

Is this right ? If yes, the PR title/description should be changed.

+1

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73548 has finished for PR 17012 at commit 530c027.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add our discussion to the comment? Such as

      // scalastyle:off
      // Renaming a file atop an existing one fails on HDFS
      // (http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html).
      // Hence we should either skip the rename step or delete the target file. Because deleting the
      // target file will beak speculation, skipping the rename step is the only choice. It's still
      // semantically correct because Structured Streaming requires rerunning a batch should
      // generate the same output. (SPARK-19677)
      // scalastyle:on

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: SPARK-19677: Renaming a file atop an existing one on HDFS

@zsxwing
Copy link
Member

zsxwing commented Feb 28, 2017

Just two minor comments. Otherwise, LGTM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

@vitillo vitillo changed the title [SPARK-19677][SS] Renaming a file atop an existing one should not fail on HDFS [SPARK-19677][SS] Committing a delta file atop an existing one should not fail on HDFS Feb 28, 2017
@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73589 has finished for PR 17012 at commit de228a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Feb 28, 2017

LGTM. Thanks! Merging to master, 2.1 and 2.0.

asfgit pushed a commit that referenced this pull request Feb 28, 2017
… not fail on HDFS

## What changes were proposed in this pull request?

HDFSBackedStateStoreProvider fails to rename files on HDFS but not on the local filesystem. According to the [implementation notes](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html) of `rename()`, the behavior of the local filesystem and HDFS varies:

> Destination exists and is a file
> Renaming a file atop an existing file is specified as failing, raising an exception.
>    - Local FileSystem : the rename succeeds; the destination file is replaced by the source file.
>    - HDFS : The rename fails, no exception is raised. Instead the method call simply returns false.

This patch ensures that `rename()` isn't called if the destination file already exists. It's still semantically correct because Structured Streaming requires that rerunning a batch should generate the same output.

## How was this patch tested?

This patch was tested by running `StateStoreSuite`.

Author: Roberto Agostino Vitillo <[email protected]>

Closes #17012 from vitillo/fix_rename.
@asfgit asfgit closed this in 9734a92 Feb 28, 2017
asfgit pushed a commit that referenced this pull request Feb 28, 2017
… not fail on HDFS

## What changes were proposed in this pull request?

HDFSBackedStateStoreProvider fails to rename files on HDFS but not on the local filesystem. According to the [implementation notes](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html) of `rename()`, the behavior of the local filesystem and HDFS varies:

> Destination exists and is a file
> Renaming a file atop an existing file is specified as failing, raising an exception.
>    - Local FileSystem : the rename succeeds; the destination file is replaced by the source file.
>    - HDFS : The rename fails, no exception is raised. Instead the method call simply returns false.

This patch ensures that `rename()` isn't called if the destination file already exists. It's still semantically correct because Structured Streaming requires that rerunning a batch should generate the same output.

## How was this patch tested?

This patch was tested by running `StateStoreSuite`.

Author: Roberto Agostino Vitillo <[email protected]>

Closes #17012 from vitillo/fix_rename.

(cherry picked from commit 9734a92)
Signed-off-by: Shixiong Zhu <[email protected]>
asfgit pushed a commit that referenced this pull request Mar 3, 2017
…treaming job

## What changes were proposed in this pull request?

[SPARK-19779](https://issues.apache.org/jira/browse/SPARK-19779)

The PR (#17012) can to fix restart a Structured Streaming application using hdfs as fileSystem, but also exist a problem that a tmp file of delta file is still reserved in hdfs. And Structured Streaming don't delete the tmp file generated when restart streaming job in future.

## How was this patch tested?
 unit tests

Author: guifeng <[email protected]>

Closes #17124 from gf53520/SPARK-19779.

(cherry picked from commit e24f21b)
Signed-off-by: Shixiong Zhu <[email protected]>
asfgit pushed a commit that referenced this pull request Mar 3, 2017
…treaming job

## What changes were proposed in this pull request?

[SPARK-19779](https://issues.apache.org/jira/browse/SPARK-19779)

The PR (#17012) can to fix restart a Structured Streaming application using hdfs as fileSystem, but also exist a problem that a tmp file of delta file is still reserved in hdfs. And Structured Streaming don't delete the tmp file generated when restart streaming job in future.

## How was this patch tested?
 unit tests

Author: guifeng <[email protected]>

Closes #17124 from gf53520/SPARK-19779.

(cherry picked from commit e24f21b)
Signed-off-by: Shixiong Zhu <[email protected]>
asfgit pushed a commit that referenced this pull request Mar 3, 2017
…treaming job

## What changes were proposed in this pull request?

[SPARK-19779](https://issues.apache.org/jira/browse/SPARK-19779)

The PR (#17012) can to fix restart a Structured Streaming application using hdfs as fileSystem, but also exist a problem that a tmp file of delta file is still reserved in hdfs. And Structured Streaming don't delete the tmp file generated when restart streaming job in future.

## How was this patch tested?
 unit tests

Author: guifeng <[email protected]>

Closes #17124 from gf53520/SPARK-19779.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants