Skip to content

Conversation

@rxin
Copy link
Contributor

@rxin rxin commented Apr 7, 2016

What changes were proposed in this pull request?

This patch removes DirectParquetOutputCommitter. This was initially created by Databricks as a faster way to write Parquet data to S3. However, given how the underlying S3 Hadoop implementation works, this committer only works when there are no failures. If there are multiple attempts of the same task (e.g. speculation or task failures or node failures), the output data can be corrupted. I don't think this performance optimization outweighs the correctness issue.

How was this patch tested?

Removed the related tests also.

@rxin
Copy link
Contributor Author

rxin commented Apr 7, 2016

cc @davies

@davies
Copy link
Contributor

davies commented Apr 7, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Apr 7, 2016

Test build #55182 has finished for PR 12229 at commit 8719c26.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 7, 2016

Test build #55192 has finished for PR 12229 at commit c5de86b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor Author

rxin commented Apr 7, 2016

Merging in master.

@asfgit asfgit closed this in 9ca0760 Apr 7, 2016
case e: org.apache.hadoop.fs.FileAlreadyExistsException =>
if (outputCommitter.isInstanceOf[parquet.DirectParquetOutputCommitter]) {
// Spark-11382: DirectParquetOutputCommitter is not idempotent, meaning on retry
if (outputCommitter.getClass.getName.contains("Direct")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty brittle/ugly. Is there any other way, such as having an interface covering commit semantics. then the code can go .isInstanceOf[NonAtomicCommitter]?

@mortada
Copy link
Contributor

mortada commented Aug 22, 2016

@rxin so it seems like DirectParquetOutputCommitter has been removed with Spark 2.0, is there a recommended replacement?

(I'm in the process of migrating form Spark 1.6 to 2.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants