-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-3570] Include time to open files in shuffle write time. #4550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Opening shuffle files can be very significant when the disk is contended, especially when using ext3. While writing data to a file can avoid hitting disk (and instead hit the buffer cache), opening a file always involves writing some metadata about the file to disk, so the open time can be a very significant portion of the shuffle write time. In one recent job, the time to write shuffle data to the file was only 4ms for each task, but the time to open the file was about 100x as long (~400ms). When we added metrics about spilled data (apache#2504), we should ensure that the file open time is also included there.
|
Test build #27317 has finished for PR 4550 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be consistent with parens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha @rxin just chatted me the same thing. FYI I realized I missed some file accesses here to add to the shuffle write time; currently testing on a cluster to make sure I got them all and will update this patch accordingly.
|
The immediate goal here is to only track time spent opening the final shuffle files - e.g. not to include time spent opening spill files that will later be merged, right? If that's the case, we should track the time in |
|
@sryza spillToPartitionFiles is this fake spill case, where when there are fewer than bypassMergeThreshold (default 200) partitions, the sort shuffle "pretends" to be like the hash shuffle by first writing all of the data to per-partition files, and then combining all of the files at the end. My understanding is that we don't consider this "spill" because it's different than the other spill case (where we actually need to spill because we've run out of space in-memory) in that we always "spill" all of the data when there are fewer than <= bypassMergeThreshold partitions (and this is kind of a hack that we don't want to expose to the user). We already include the time to write this "spill" data in the shuffle write time, so including the time to open those files is consistent with that. I was first thinking that we should also track the time to open the file in writePartitionedFile, but I did some measurement on a small cluster and found that the time to open the file in writePartitionedFile ends up being basically insignificant, because it's only one file (compared to in spillToPartitionedFiles, where we may open up to 200 files, and it can take a long time when the disk is busy). I did some measurements of this that I wrote up in the JIRA: https://issues.apache.org/jira/browse/SPARK-3570 |
|
That makes sense. When we're not bypassing the merge threshold, measuring the time spent opening the partition files instead of the merged file seems reasonable. However, this patch still doesn't add support for measuring the time spent opening files in the main path where we do bypass the merge threshold (>200 partitions). Unless I'm misundestanding? |
|
Yeah the reason I didn't add it in the main path is because there, we only open one file at a time, so the open time tends to be so fast that it's not even clear we can measure it accurately (it's on the order of tends of microseconds; this is what's discussed in the JIRA I linked above). I can add it if you think it's confusing not to have it (or maybe I should just add a comment in the code?) but because it's so fast, I don't think it makes sense to add to the time. |
|
Ah I see. Is it possible that we might hit higher latencies for the main path when there's high contention (more cores than disks)? If that seems unlikely, sounds fine to me to just add a comment. |
|
Test build #27543 has finished for PR 4550 at commit
|
|
LGTM |
|
@rxin would you mind taking a look at this, as someone familiar with the shuffle? I noted some measurements of the affect this has on Shuffle Write Time in the JIRA: https://issues.apache.org/jira/browse/SPARK-3570 Thanks for taking a look @sryza! |
|
PS see the more straightforward counterpart #4965 I think this could be merged to. LGTM. |
|
Merging into master and 1.3 thanks Kay. |
Opening shuffle files can be very significant when the disk is contended, especially when using ext3. While writing data to a file can avoid hitting disk (and instead hit the buffer cache), opening a file always involves writing some metadata about the file to disk, so the open time can be a very significant portion of the shuffle write time. In one job I ran recently, the time to write shuffle data to the file was only 4ms for each task, but the time to open the file was about 100x as long (~400ms). When we add metrics about spilled data (#2504), we should ensure that the file open time is also included there. Author: Kay Ousterhout <[email protected]> Closes #4550 from kayousterhout/SPARK-3570 and squashes the following commits: ea3a4ae [Kay Ousterhout] Added comment about excluded open time fdc5185 [Kay Ousterhout] Improved comment 42b7e43 [Kay Ousterhout] Fixed parens for nanotime 2423555 [Kay Ousterhout] [SPARK-3570] Include time to open files in shuffle write time. (cherry picked from commit d8ccf65) Signed-off-by: Andrew Or <[email protected]>
Opening shuffle files can be very significant when the disk is
contended, especially when using ext3. While writing data to
a file can avoid hitting disk (and instead hit the buffer
cache), opening a file always involves writing some metadata
about the file to disk, so the open time can be a very significant
portion of the shuffle write time. In one job I ran recently, the time to
write shuffle data to the file was only 4ms for each task, but
the time to open the file was about 100x as long (~400ms).
When we add metrics about spilled data (#2504), we should ensure
that the file open time is also included there.