Skip to content

Conversation

@rajeshbalamohan
Copy link

…les being opened

What changes were proposed in this pull request?

In UnsafeExternalSorter::getIterator(), for every spillWriter a file is opened in UnsafeSorterSpillReader and these files get closed later point in time as a part of close() call.
However, when large number of spill files are present, number of files opened increases to a great extent and ends up throwing "Too many files" open exception.
This can easily be reproduced with TPC-DS Q67 at 1 TB scale in multi node cluster with multiple cores per executor.

There are ways to reduce the number of spill files that are generated in Q67. E.g, increase "spark.sql.windowExec.buffer.spill.threshold" where 4096 is the default. Another option is to increase ulimit to much higher values.
But those are workarounds.

This PR reduces the number of files that are kept open at in UnsafeSorterSpillReader.

How was this patch tested?

Manual testing of Q67 in 1 TB and 10 TB scale on multi node cluster.

@viirya
Copy link
Member

viirya commented Sep 11, 2017

hmm, shouldn't we just change system config to increase the limit of open file?

@rajeshbalamohan
Copy link
Author

I got into this with the limit of 32K. "unlimited" is another option which can be a workaround for this. But that may not be a preferable option in production systems. For e.g, with Q67 I observed 9000+ spill files in the task. And with multiple tasks per executor, it ended up easily reaching the limits.

if (this.din == null) {
// Good time to init (if all files are opened, we can get Too Many files exception)
initStreams();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this solve the too many file open issue? When we do merging the readers, it is possibly that all the readers in priority queue still have records and are asked for records (so their files open). You still can encounter too many file open issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. PR has been tried with queries involving window functions (e.g Q67) for which it worked fine.

During spill merges (esp getSortedIterator), it is possible to encounter too many open files issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to first describe more about how to fix this issue in the description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @viirya , we're using priority queue to do merge sort, this will turn out to be all the readers in the priority queue is opened, so still cannot solve this issue.

I think a valid fix is to control the number of concurrent merged files, like MR's io.sort.factor.

Also we still need to address similar issue in ExternalSorter and other places in Shuffle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this PR does not reduce the number of total open files. Since this PR tries to open files when they are required, this PR may reduce possibility of occurring an error of too may open files.

As @viirya pointed out, it is necessary to provide a feature to control the number of opening files at one point (e.g. priority queue).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The valid fix should be to import a new config to control the concurrent number of opened spill files, it also means you should use some data structure to keep and track the request of open spill files.

@SparkQA
Copy link

SparkQA commented Sep 11, 2017

Test build #81614 has finished for PR 19184 at commit dcc2960.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @cloud-fan @jiangxb1987

bufferSizeBytes = DEFAULT_BUFFER_SIZE_BYTES;
}

try (InputStream bs = new NioBufferedFileInputStream(file, (int) bufferSizeBytes);
Copy link
Member

@viirya viirya Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here to say we don't need to hold the file open until we actually want to load the records, so we can prevent too many file open issue partially.

this.blockId = blockId;
this.serializerManager = serializerManager;

logger.debug("bufSize: {}, file: {}, records: {}", buffSize, file, this.numRecords);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this log useful? If the number of spill readers is so many, I guess we don't want to see so many log info?

taskContext.killTaskIfInterrupted();
}
if (this.din == null) {
// Good time to init (if all files are opened, we can get Too Many files exception)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment looks confusing. Maybe It is the time to initialize and hold the input stream of the spill file for loading records. Keeps the input stream open too early will very possibly encounter too many file open issue.

@rajeshbalamohan
Copy link
Author

Thanks @viirya . I have updated the patch to address your comments.

This fixes the "too many files open" issue for (e.g Q67, Q72, Q14 etc) which involves window functions; but for the merger the issue needs to be addressed still. Agreed that this would be partial patch.

@viirya
Copy link
Member

viirya commented Sep 11, 2017

@rajeshbalamohan Thanks for updating. I think we need a complete fix instead of a partial one as previous comments from the reviewers @jerryshao @kiszk @jiangxb1987 suggested. Can you try to fix this according to the comments? Thanks.

@SparkQA
Copy link

SparkQA commented Sep 11, 2017

Test build #81628 has finished for PR 19184 at commit ea5f9d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor

mridulm commented Sep 24, 2017

@viirya @jerryshao To take a step back here.

This specific issue is applicable to window operations and not to shuffle.

In shuffle, you a much larger volume of data written per file vs 4k records per file for window operation.

To get to 9k files with shuffle, you are typically processing a TB or more data per shuffle task (unless executor is strapped of memory and spilt large number of files).

On other hand, with 4k window size (the default in spark), getting to 9k files is possible within a single task.

From what I see, there is actually no functional/performance reason to keep all the files opened, unlike in shuffle.
Having said that, there is an additional cost we pay with this approach.
With N files, we incur an additional cost of N * cost(open + read int + close)
Practically though, the impact is much lower when compared to current code (since re-open + read will be disk read friendly).

While getting it fixed for all cases would be ideal, the solution for window operation does not transfer to shuffle (and vice versa) due to the difference in the nature of how files are used in both.

In case I missed something here, please let me know.

@jerryshao
Copy link
Contributor

Hi @mridulm , sorry for late response. I agree with you that the scenario is different between here and shuffle, but the underlying structure and solutions to spill data is the same, so the problem is the same. While in the shuffle side, we could control the memory size to hold more data before spilling to avoid too many spills, but as you mentioned here we cannot do it.

Yes it is not necessary to open all the files beforehand. But since we're using priority queue to do merge sort, which will make all the file handler opened very likely. And this fix only reduces the chances to encounter too many files issue. Maybe we can call this fix as an intermittent fix, what do you think?

@mridulm
Copy link
Contributor

mridulm commented Sep 27, 2017

@jerryshao Actually the second half of your comment is not valid in this case.
The PR is not targeting the merge sort in this case, but relevant when iterating over all tuples.

UnsafeExternalSorter has two methods to iterate over the tuples.
You are referring to getSortedIterator - which uses a PriorityQueue and requires all files to be opened at the same time (so that it can return a sorted iterator).

The primary usecase of this PR is for getIterator - where we are simply iterating over all tuples : and used in ExternalAppendOnlyUnsafeRowArray for example : there is no need to sort here.
This is used by various WindowFunctionFrame implementations for example.

So this fix is orthogonal to whether we improve sort shuffle or not - the requirement is to get to all tuples. If/when we do improve merge sort, ExternalAppendOnlyUnsafeRowArray would still work differently.

@jerryshao
Copy link
Contributor

After discussed with @mridulm offline. Though the patch here cannot address the issue of getSortedIterator - which uses a PriorityQueue, somehow it solves the problem of getIterator(...) which doesn't require merge. And in this specific case, it uses getIterator in ExternalAppendOnlyUnsafeRowArray and encounter too many file opened issue.

So this fix could solve the problem of getIterator, but no harm to getSortedIterator. Maybe we could accept it as a partial/point fix here.

What do you think @viirya @kiszk @maropu @jiangxb1987 ?

@mridulm
Copy link
Contributor

mridulm commented Sep 27, 2017

Thanks to @jerryshao for pointing me to SPARK-21595.
The tests which @rajeshbalamohan did were with a version which did not include the changes in SPARK-21595; and unfortunately my local repo was not updated to reflect this too.

Given the changes in #18843 this PR is no longer relevant.
Can you close this @rajeshbalamohan ? Thanks.

In your tests, you can set the threshold to 512M - since that is the value going forward in spark 2.3

@viirya
Copy link
Member

viirya commented Sep 27, 2017

Thanks @jerryshao and @mridulm for investigating this further. It is very reasonable. I think we don't need this fix as the spill won't be too frequent in window operations now.

@rajeshbalamohan
Copy link
Author

Thanks @mridulm , @jerryshao , @viirya . closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants