Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Nov 15, 2018

What changes were proposed in this pull request?

In the PR, I propose to postpone creation of OutputStream/Univocity/JacksonGenerator till the first row should be written. This prevents creation of empty files for empty partitions. So, no need to open and to read such files back while loading data from the location.

How was this patch tested?

Added tests for Text, JSON and CSV datasource where empty dataset is written but should not produce any files.

@SparkQA
Copy link

SparkQA commented Nov 16, 2018

Test build #98887 has finished for PR 23052 at commit 6f3cb18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@MaxGekk, actually this is kind of important behaviour change. This basically means we're unable to read the empty files back. Similar changes were proposed in Parquet few years ago (by me) and reverted.

We should better investigate and match the behaviours first across datasources. IIRC, ORC does not create files (if that's not updated from what I have checked long ago) but Parquet does.


private val gen = new UnivocityGenerator(dataSchema, writer, params)
override def write(row: InternalRow): Unit = {
val gen = univocityGenerator.getOrElse {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, one thing we should not forget about is, CSV could have headers even if the records are empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it is fine to write only headers if an user wants to have them. Filtering the header out on this level could be slightly difficult.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 16, 2018

Similar changes were proposed in Parquet few years ago (by me) and reverted.

What was the main reason to revert it? If it is possible could you give me a link to your PR.

@HyukjinKwon
Copy link
Member

Which should be ... this #12855

@HyukjinKwon
Copy link
Member

One try to add some tests for reading/writing empty dataframes was here #13253 fyi

@HyukjinKwon
Copy link
Member

related another try #13252

@HyukjinKwon
Copy link
Member

I think now it should be good timing to match the behaviours.

@SparkQA
Copy link

SparkQA commented Nov 21, 2018

Test build #99107 has finished for PR 23052 at commit 9501d01.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 21, 2018

I have read the tickets you pointed out but haven't found what could potentially block the changes. One of corner cases is saving an empty dataframe. In this case, no files would be written, but this is ok for text-based datasources because , in any case, we cannot restore the schema fully from empty files (comparing to parquet files where we can). So, a schema must be provided by an user.

@HyukjinKwon
Copy link
Member

@MaxGekk I didn't mean to block this PR. Since we're going ahead for 3.0, it should be good to match and fix the behaviours across data sources. For instance, CSV should still be able to read the header. Shall we clarify each data sources behaviour?

@HyukjinKwon
Copy link
Member

Also, it's not always for Parquet to write empty files. That does not write empty files when data frames are created from emptyRDD (the one pointed out in the PR link I gave). We should match this behaviour as well.

@HyukjinKwon
Copy link
Member

cc @cloud-fan as well

@cloud-fan
Copy link
Contributor

cloud-fan commented Nov 22, 2018

First of all, sometimes we do need to write "empty" files, so that we can infer schema of a parquet directory. Empty parquet file is not really empty, as it has header/footer. #20525 guarantees we always write out at least one empty file.

One important thing is, when we write out an empty dataframe to file, and read it back, it should still be an empty dataframe. I'd suggest we skip reading empty file in text-based data sources, and later on send a followup PR to not write empty text files, as a perf improvement.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 22, 2018

There are two more things to deal with:

#23052 (comment) comment will still be valid - at least it should be double checked because dataframes originated from emptyRDD does not write anything all times.

One thing is CSV for text-based datasources because it can write out headers. Thing is, the header is currently written when the first row is written. I think the headers should be written too because it's kind of schema. This is what #13252 PR targeted before. I closed this because there's no interests but we should fix.

(Note that we can only read the header by header=true and inferSchema=false).

@SparkQA
Copy link

SparkQA commented Nov 24, 2018

Test build #99221 has finished for PR 23052 at commit 76e1466.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class RuleSummary(
  • class QueryPlanningTracker
  • class QueryExecution(

params: CSVOptions) extends OutputWriter with Logging {

private val charset = Charset.forName(params.charset)
private var univocityGenerator: Option[UnivocityGenerator] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a race condition below then where multiple generators can be created?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have not observe any race conditions so far. Instances of UnivocityGenerator are created per-tasks as well as OutputStreamWriters. They share instances of schema and CSVOptions but we do not modify them while writing. Inside of each UnivocityGenerator, we create an instance of CsvWriter but I almost absolutely sure they do not share anything internally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean that it would cause an error, but that it could create many generators and writers that aren't closed. It may not be obvious that it's happening. Unless we know writes will only happen in one thread what about breaking out and synchronizing the get/create part of this method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but that it could create many generators and writers that aren't closed.

Writers/generators are created inside of tasks:

val dataWriter =
if (sparkPartitionId != 0 && !iterator.hasNext) {
// In case of empty job, leave first partition to save meta for file format like parquet.
new EmptyDirectoryDataWriter(description, taskAttemptContext, committer)
} else if (description.partitionColumns.isEmpty && description.bucketIdExpression.isEmpty) {
new SingleDirectoryDataWriter(description, taskAttemptContext, committer)
} else {
new DynamicPartitionDataWriter(description, taskAttemptContext, committer)
}
try {
Utils.tryWithSafeFinallyAndFailureCallbacks(block = {
// Execute the task to write rows out and commit the task.
while (iterator.hasNext) {
dataWriter.write(iterator.next())
}
dataWriter.commit()
})(catchBlock = {
// If there is an error, abort the task
dataWriter.abort()
logError(s"Job $jobId aborted.")
})
} catch {
case e: FetchFailedException =>
throw e
case t: Throwable =>
throw new SparkException("Task failed while writing rows.", t)
}
}
where dataWriter.commit() and dataWriter.abort() close writers/generators. So, number of not closed generators is less or equal to the size of the task thread pool on executors at any moment.

Unless we know writes will only happen in one thread ...

According to comments below, this is our assumption:

/**
* Abstract class for writing out data in a single Spark task.
* Exceptions thrown by the implementation of this trait will automatically trigger task aborts.
*/
abstract class FileFormatDataWriter(

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 28, 2018

Test build #99354 has finished for PR 23052 at commit 76e1466.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 28, 2018

Test build #99361 has finished for PR 23052 at commit 76e1466.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

seems like a real failure

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 28, 2018

seems like a real failure

I am looking at it. It seems the test is not deterministic.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 28, 2018

Actually it needs similar changes like in #23130

@SparkQA
Copy link

SparkQA commented Nov 28, 2018

Test build #99372 has finished for PR 23052 at commit 586ab31.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 28, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Nov 28, 2018

Test build #99380 has finished for PR 23052 at commit 586ab31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99407 has finished for PR 23052 at commit 083d411.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Nov 29, 2018

Merged to master

@asfgit asfgit closed this in 31c4fab Nov 29, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…atasources

## What changes were proposed in this pull request?

In the PR, I propose to postpone creation of `OutputStream`/`Univocity`/`JacksonGenerator` till the first row should be written. This prevents creation of empty files for empty partitions. So, no need to open and to read such files back while loading data from the location.

## How was this patch tested?

Added tests for Text, JSON and CSV datasource where empty dataset is written but should not produce any files.

Closes apache#23052 from MaxGekk/text-empty-files.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
@MaxGekk MaxGekk deleted the text-empty-files branch August 17, 2019 13:33
gengliangwang pushed a commit that referenced this pull request Nov 27, 2019
### What changes were proposed in this pull request?

This reverts commit 31c4fab (#23052) to make sure the partition calling `ManifestFileCommitProtocol.newTaskTempFile` creates actual file.

This also reverts part of commit 0d3d46d (#26639) since the commit fixes the issue raised from 31c4fab and we're reverting back. The reason of partial revert is that we found the UT be worth to keep as it is, preventing regression - given the UT can detect the issue on empty partition -> no actual file. This makes one more change to UT; moved intentionally to test both DSv1 and DSv2.

### Why are the changes needed?

After the changes in SPARK-26081 (commit 31c4fab / #23052), CSV/JSON/TEXT don't create actual file if the partition is empty. This optimization causes a problem in `ManifestFileCommitProtocol`: the API `newTaskTempFile` is called without actual file creation. Then `fs.getFileStatus` throws `FileNotFoundException` since the file is not created.

SPARK-29999 (commit 0d3d46d / #26639) fixes the problem. But it is too costly to check file existence on each task commit. We should simply restore the behavior before SPARK-26081.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Jenkins build will follow.

Closes #26671 from HeartSaVioR/revert-SPARK-26081-SPARK-29999.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
### What changes were proposed in this pull request?

This reverts commit 31c4fab (apache#23052) to make sure the partition calling `ManifestFileCommitProtocol.newTaskTempFile` creates actual file.

This also reverts part of commit 0d3d46d (apache#26639) since the commit fixes the issue raised from 31c4fab and we're reverting back. The reason of partial revert is that we found the UT be worth to keep as it is, preventing regression - given the UT can detect the issue on empty partition -> no actual file. This makes one more change to UT; moved intentionally to test both DSv1 and DSv2.

### Why are the changes needed?

After the changes in SPARK-26081 (commit 31c4fab / apache#23052), CSV/JSON/TEXT don't create actual file if the partition is empty. This optimization causes a problem in `ManifestFileCommitProtocol`: the API `newTaskTempFile` is called without actual file creation. Then `fs.getFileStatus` throws `FileNotFoundException` since the file is not created.

SPARK-29999 (commit 0d3d46d / apache#26639) fixes the problem. But it is too costly to check file existence on each task commit. We should simply restore the behavior before SPARK-26081.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Jenkins build will follow.

Closes apache#26671 from HeartSaVioR/revert-SPARK-26081-SPARK-29999.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants