[SPARK-3570] Include time to open files in shuffle write time. #4550

kayousterhout · 2015-02-12T03:10:41Z

Opening shuffle files can be very significant when the disk is
contended, especially when using ext3. While writing data to
a file can avoid hitting disk (and instead hit the buffer
cache), opening a file always involves writing some metadata
about the file to disk, so the open time can be a very significant
portion of the shuffle write time. In one job I ran recently, the time to
write shuffle data to the file was only 4ms for each task, but
the time to open the file was about 100x as long (~400ms).

When we add metrics about spilled data (#2504), we should ensure
that the file open time is also included there.

Opening shuffle files can be very significant when the disk is contended, especially when using ext3. While writing data to a file can avoid hitting disk (and instead hit the buffer cache), opening a file always involves writing some metadata about the file to disk, so the open time can be a very significant portion of the shuffle write time. In one recent job, the time to write shuffle data to the file was only 4ms for each task, but the time to open the file was about 100x as long (~400ms). When we added metrics about spilled data (apache#2504), we should ensure that the file open time is also included there.

SparkQA · 2015-02-12T04:28:23Z

Test build #27317 has finished for PR 4550 at commit 2423555.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sryza · 2015-02-12T07:50:14Z

core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala

We should be consistent with parens

Haha @rxin just chatted me the same thing. FYI I realized I missed some file accesses here to add to the shuffle write time; currently testing on a cluster to make sure I got them all and will update this patch accordingly.

sryza · 2015-02-12T07:58:45Z

The immediate goal here is to only track time spent opening the final shuffle files - e.g. not to include time spent opening spill files that will later be merged, right?

If that's the case, we should track the time in writePartitionedFile, not spillToPartitionFiles.

kayousterhout · 2015-02-12T21:04:18Z

@sryza spillToPartitionFiles is this fake spill case, where when there are fewer than bypassMergeThreshold (default 200) partitions, the sort shuffle "pretends" to be like the hash shuffle by first writing all of the data to per-partition files, and then combining all of the files at the end. My understanding is that we don't consider this "spill" because it's different than the other spill case (where we actually need to spill because we've run out of space in-memory) in that we always "spill" all of the data when there are fewer than <= bypassMergeThreshold partitions (and this is kind of a hack that we don't want to expose to the user). We already include the time to write this "spill" data in the shuffle write time, so including the time to open those files is consistent with that.

I was first thinking that we should also track the time to open the file in writePartitionedFile, but I did some measurement on a small cluster and found that the time to open the file in writePartitionedFile ends up being basically insignificant, because it's only one file (compared to in spillToPartitionedFiles, where we may open up to 200 files, and it can take a long time when the disk is busy). I did some measurements of this that I wrote up in the JIRA: https://issues.apache.org/jira/browse/SPARK-3570

sryza · 2015-02-12T22:27:00Z

That makes sense. When we're not bypassing the merge threshold, measuring the time spent opening the partition files instead of the merged file seems reasonable.

However, this patch still doesn't add support for measuring the time spent opening files in the main path where we do bypass the merge threshold (>200 partitions). Unless I'm misundestanding?

kayousterhout · 2015-02-13T20:33:22Z

Yeah the reason I didn't add it in the main path is because there, we only open one file at a time, so the open time tends to be so fast that it's not even clear we can measure it accurately (it's on the order of tends of microseconds; this is what's discussed in the JIRA I linked above). I can add it if you think it's confusing not to have it (or maybe I should just add a comment in the code?) but because it's so fast, I don't think it makes sense to add to the time.

sryza · 2015-02-14T00:44:07Z

Ah I see. Is it possible that we might hit higher latencies for the main path when there's high contention (more cores than disks)?

If that seems unlikely, sounds fine to me to just add a comment.

SparkQA · 2015-02-16T08:32:50Z

Test build #27543 has finished for PR 4550 at commit ea3a4ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sryza · 2015-02-16T18:28:12Z

LGTM

kayousterhout · 2015-02-16T20:16:58Z

@rxin would you mind taking a look at this, as someone familiar with the shuffle? I noted some measurements of the affect this has on Shuffle Write Time in the JIRA: https://issues.apache.org/jira/browse/SPARK-3570

Thanks for taking a look @sryza!

srowen · 2015-03-13T15:09:34Z

PS see the more straightforward counterpart #4965 I think this could be merged to. LGTM.

andrewor14 · 2015-03-24T23:29:17Z

Merging into master and 1.3 thanks Kay.

Opening shuffle files can be very significant when the disk is contended, especially when using ext3. While writing data to a file can avoid hitting disk (and instead hit the buffer cache), opening a file always involves writing some metadata about the file to disk, so the open time can be a very significant portion of the shuffle write time. In one job I ran recently, the time to write shuffle data to the file was only 4ms for each task, but the time to open the file was about 100x as long (~400ms). When we add metrics about spilled data (#2504), we should ensure that the file open time is also included there. Author: Kay Ousterhout <[email protected]> Closes #4550 from kayousterhout/SPARK-3570 and squashes the following commits: ea3a4ae [Kay Ousterhout] Added comment about excluded open time fdc5185 [Kay Ousterhout] Improved comment 42b7e43 [Kay Ousterhout] Fixed parens for nanotime 2423555 [Kay Ousterhout] [SPARK-3570] Include time to open files in shuffle write time. (cherry picked from commit d8ccf65) Signed-off-by: Andrew Or <[email protected]>

sryza reviewed Feb 12, 2015
View reviewed changes

kayousterhout added 2 commits February 12, 2015 11:28

Fixed parens for nanotime

42b7e43

Improved comment

fdc5185

Added comment about excluded open time

ea3a4ae

asfgit closed this in d8ccf65 Mar 24, 2015

kayousterhout deleted the SPARK-3570 branch April 12, 2017 00:41

[SPARK-3570] Include time to open files in shuffle write time. #4550

[SPARK-3570] Include time to open files in shuffle write time. #4550

Uh oh!

Conversation

kayousterhout commented Feb 12, 2015

Uh oh!

SparkQA commented Feb 12, 2015

Uh oh!

sryza Feb 12, 2015

Choose a reason for hiding this comment

Uh oh!

kayousterhout Feb 12, 2015

Choose a reason for hiding this comment

Uh oh!

sryza commented Feb 12, 2015

Uh oh!

kayousterhout commented Feb 12, 2015

Uh oh!

sryza commented Feb 12, 2015

Uh oh!

kayousterhout commented Feb 13, 2015

Uh oh!

sryza commented Feb 14, 2015

Uh oh!

SparkQA commented Feb 16, 2015

Uh oh!

sryza commented Feb 16, 2015

Uh oh!

kayousterhout commented Feb 16, 2015

Uh oh!

srowen commented Mar 13, 2015

Uh oh!

andrewor14 commented Mar 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants