[SPARK-21971][CORE] Too many open files in Spark due to concurrent fi… #19184

rajeshbalamohan · 2017-09-11T01:48:35Z

…les being opened

What changes were proposed in this pull request?

In UnsafeExternalSorter::getIterator(), for every spillWriter a file is opened in UnsafeSorterSpillReader and these files get closed later point in time as a part of close() call.
However, when large number of spill files are present, number of files opened increases to a great extent and ends up throwing "Too many files" open exception.
This can easily be reproduced with TPC-DS Q67 at 1 TB scale in multi node cluster with multiple cores per executor.

There are ways to reduce the number of spill files that are generated in Q67. E.g, increase "spark.sql.windowExec.buffer.spill.threshold" where 4096 is the default. Another option is to increase ulimit to much higher values.
But those are workarounds.

This PR reduces the number of files that are kept open at in UnsafeSorterSpillReader.

How was this patch tested?

Manual testing of Q67 in 1 TB and 10 TB scale on multi node cluster.

…les being opened

viirya · 2017-09-11T02:12:48Z

hmm, shouldn't we just change system config to increase the limit of open file?

rajeshbalamohan · 2017-09-11T02:15:26Z

I got into this with the limit of 32K. "unlimited" is another option which can be a workaround for this. But that may not be a preferable option in production systems. For e.g, with Q67 I observed 9000+ spill files in the task. And with multiple tasks per executor, it ended up easily reaching the limits.

viirya · 2017-09-11T02:41:04Z

core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java

+    if (this.din == null) {
+      // Good time to init (if all files are opened, we can get Too Many files exception)
+      initStreams();
+    }


Can this solve the too many file open issue? When we do merging the readers, it is possibly that all the readers in priority queue still have records and are asked for records (so their files open). You still can encounter too many file open issue.

Good point. PR has been tried with queries involving window functions (e.g Q67) for which it worked fine.

During spill merges (esp getSortedIterator), it is possible to encounter too many open files issue.

I think you need to first describe more about how to fix this issue in the description.

I agree with @viirya , we're using priority queue to do merge sort, this will turn out to be all the readers in the priority queue is opened, so still cannot solve this issue.

I think a valid fix is to control the number of concurrent merged files, like MR's io.sort.factor.

Also we still need to address similar issue in ExternalSorter and other places in Shuffle.

IIUC, this PR does not reduce the number of total open files. Since this PR tries to open files when they are required, this PR may reduce possibility of occurring an error of too may open files.

As @viirya pointed out, it is necessary to provide a feature to control the number of opening files at one point (e.g. priority queue).

The valid fix should be to import a new config to control the concurrent number of opened spill files, it also means you should use some data structure to keep and track the request of open spill files.

SparkQA · 2017-09-11T05:02:35Z

Test build #81614 has finished for PR 19184 at commit dcc2960.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-11T05:04:35Z

cc @cloud-fan @jiangxb1987

viirya · 2017-09-11T05:15:06Z

core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java

      bufferSizeBytes = DEFAULT_BUFFER_SIZE_BYTES;
    }

+    try (InputStream bs = new NioBufferedFileInputStream(file, (int) bufferSizeBytes);


Please add a comment here to say we don't need to hold the file open until we actually want to load the records, so we can prevent too many file open issue partially.

viirya · 2017-09-11T05:17:31Z

core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java

+    this.blockId = blockId;
+    this.serializerManager = serializerManager;
+
+    logger.debug("bufSize: {}, file: {}, records: {}", buffSize, file, this.numRecords);


Is this log useful? If the number of spill readers is so many, I guess we don't want to see so many log info?

viirya · 2017-09-11T05:22:25Z

core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java

      taskContext.killTaskIfInterrupted();
    }
+    if (this.din == null) {
+      // Good time to init (if all files are opened, we can get Too Many files exception)


This comment looks confusing. Maybe It is the time to initialize and hold the input stream of the spill file for loading records. Keeps the input stream open too early will very possibly encounter too many file open issue.

…les being opened

rajeshbalamohan · 2017-09-11T07:32:46Z

Thanks @viirya . I have updated the patch to address your comments.

This fixes the "too many files open" issue for (e.g Q67, Q72, Q14 etc) which involves window functions; but for the merger the issue needs to be addressed still. Agreed that this would be partial patch.

viirya · 2017-09-11T08:26:47Z

@rajeshbalamohan Thanks for updating. I think we need a complete fix instead of a partial one as previous comments from the reviewers @jerryshao @kiszk @jiangxb1987 suggested. Can you try to fix this according to the comments? Thanks.

SparkQA · 2017-09-11T10:38:58Z

Test build #81628 has finished for PR 19184 at commit ea5f9d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2017-09-24T08:47:07Z

@viirya @jerryshao To take a step back here.

This specific issue is applicable to window operations and not to shuffle.

In shuffle, you a much larger volume of data written per file vs 4k records per file for window operation.

To get to 9k files with shuffle, you are typically processing a TB or more data per shuffle task (unless executor is strapped of memory and spilt large number of files).

On other hand, with 4k window size (the default in spark), getting to 9k files is possible within a single task.

From what I see, there is actually no functional/performance reason to keep all the files opened, unlike in shuffle.
Having said that, there is an additional cost we pay with this approach.
With N files, we incur an additional cost of N * cost(open + read int + close)
Practically though, the impact is much lower when compared to current code (since re-open + read will be disk read friendly).

While getting it fixed for all cases would be ideal, the solution for window operation does not transfer to shuffle (and vice versa) due to the difference in the nature of how files are used in both.

In case I missed something here, please let me know.

jerryshao · 2017-09-27T03:58:27Z

Hi @mridulm , sorry for late response. I agree with you that the scenario is different between here and shuffle, but the underlying structure and solutions to spill data is the same, so the problem is the same. While in the shuffle side, we could control the memory size to hold more data before spilling to avoid too many spills, but as you mentioned here we cannot do it.

Yes it is not necessary to open all the files beforehand. But since we're using priority queue to do merge sort, which will make all the file handler opened very likely. And this fix only reduces the chances to encounter too many files issue. Maybe we can call this fix as an intermittent fix, what do you think?

mridulm · 2017-09-27T04:46:09Z

@jerryshao Actually the second half of your comment is not valid in this case.
The PR is not targeting the merge sort in this case, but relevant when iterating over all tuples.

UnsafeExternalSorter has two methods to iterate over the tuples.
You are referring to getSortedIterator - which uses a PriorityQueue and requires all files to be opened at the same time (so that it can return a sorted iterator).

The primary usecase of this PR is for getIterator - where we are simply iterating over all tuples : and used in ExternalAppendOnlyUnsafeRowArray for example : there is no need to sort here.
This is used by various WindowFunctionFrame implementations for example.

So this fix is orthogonal to whether we improve sort shuffle or not - the requirement is to get to all tuples. If/when we do improve merge sort, ExternalAppendOnlyUnsafeRowArray would still work differently.

jerryshao · 2017-09-27T06:50:53Z

After discussed with @mridulm offline. Though the patch here cannot address the issue of getSortedIterator - which uses a PriorityQueue, somehow it solves the problem of getIterator(...) which doesn't require merge. And in this specific case, it uses getIterator in ExternalAppendOnlyUnsafeRowArray and encounter too many file opened issue.

So this fix could solve the problem of getIterator, but no harm to getSortedIterator. Maybe we could accept it as a partial/point fix here.

What do you think @viirya @kiszk @maropu @jiangxb1987 ?

mridulm · 2017-09-27T07:48:23Z

Thanks to @jerryshao for pointing me to SPARK-21595.
The tests which @rajeshbalamohan did were with a version which did not include the changes in SPARK-21595; and unfortunately my local repo was not updated to reflect this too.

Given the changes in #18843 this PR is no longer relevant.
Can you close this @rajeshbalamohan ? Thanks.

In your tests, you can set the threshold to 512M - since that is the value going forward in spark 2.3

viirya · 2017-09-27T08:01:11Z

Thanks @jerryshao and @mridulm for investigating this further. It is very reasonable. I think we don't need this fix as the spill won't be too frequent in window operations now.

rajeshbalamohan · 2017-09-27T08:01:42Z

Thanks @mridulm , @jerryshao , @viirya . closing this PR.

[SPARK-21971][CORE] Too many open files in Spark due to concurrent fi…

dcc2960

…les being opened

viirya reviewed Sep 11, 2017

View reviewed changes

[SPARK-21971][CORE] Too many open files in Spark due to concurrent fi…

ea5f9d9

…les being opened

rajeshbalamohan closed this Sep 27, 2017

[SPARK-21971][CORE] Too many open files in Spark due to concurrent fi… #19184

[SPARK-21971][CORE] Too many open files in Spark due to concurrent fi… #19184

Uh oh!

Conversation

rajeshbalamohan commented Sep 11, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Sep 11, 2017

Uh oh!

rajeshbalamohan commented Sep 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 11, 2017

Uh oh!

gatorsmile commented Sep 11, 2017

Uh oh!

viirya Sep 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rajeshbalamohan commented Sep 11, 2017

Uh oh!

viirya commented Sep 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 11, 2017

Uh oh!

mridulm commented Sep 24, 2017

Uh oh!

jerryshao commented Sep 27, 2017

Uh oh!

mridulm commented Sep 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryshao commented Sep 27, 2017

Uh oh!

mridulm commented Sep 27, 2017

Uh oh!

viirya commented Sep 27, 2017

Uh oh!

rajeshbalamohan commented Sep 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

viirya Sep 11, 2017 •

edited

Loading

viirya commented Sep 11, 2017 •

edited

Loading

mridulm commented Sep 27, 2017 •

edited

Loading