Skip to content

Conversation

@kayousterhout
Copy link
Contributor

Opening shuffle files can be very significant when the disk is
contended, especially when using ext3. While writing data to
a file can avoid hitting disk (and instead hit the buffer
cache), opening a file always involves writing some metadata
about the file to disk, so the open time can be a very significant
portion of the shuffle write time. In one job I ran recently, the time to
write shuffle data to the file was only 4ms for each task, but
the time to open the file was about 100x as long (~400ms).

When we add metrics about spilled data (#2504), we should ensure
that the file open time is also included there.

Opening shuffle files can be very significant when the disk is
contended, especially when using ext3. While writing data to
a file can avoid hitting disk (and instead hit the buffer
cache), opening a file always involves writing some metadata
about the file to disk, so the open time can be a very significant
portion of the shuffle write time. In one recent job, the time to
write shuffle data to the file was only 4ms for each task, but
the time to open the file was about 100x as long (~400ms).

When we added metrics about spilled data (apache#2504), we should ensure
that the file open time is also included there.
@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27317 has finished for PR 4550 at commit 2423555.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be consistent with parens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha @rxin just chatted me the same thing. FYI I realized I missed some file accesses here to add to the shuffle write time; currently testing on a cluster to make sure I got them all and will update this patch accordingly.

@sryza
Copy link
Contributor

sryza commented Feb 12, 2015

The immediate goal here is to only track time spent opening the final shuffle files - e.g. not to include time spent opening spill files that will later be merged, right?

If that's the case, we should track the time in writePartitionedFile, not spillToPartitionFiles.

@kayousterhout
Copy link
Contributor Author

@sryza spillToPartitionFiles is this fake spill case, where when there are fewer than bypassMergeThreshold (default 200) partitions, the sort shuffle "pretends" to be like the hash shuffle by first writing all of the data to per-partition files, and then combining all of the files at the end. My understanding is that we don't consider this "spill" because it's different than the other spill case (where we actually need to spill because we've run out of space in-memory) in that we always "spill" all of the data when there are fewer than <= bypassMergeThreshold partitions (and this is kind of a hack that we don't want to expose to the user). We already include the time to write this "spill" data in the shuffle write time, so including the time to open those files is consistent with that.

I was first thinking that we should also track the time to open the file in writePartitionedFile, but I did some measurement on a small cluster and found that the time to open the file in writePartitionedFile ends up being basically insignificant, because it's only one file (compared to in spillToPartitionedFiles, where we may open up to 200 files, and it can take a long time when the disk is busy). I did some measurements of this that I wrote up in the JIRA: https://issues.apache.org/jira/browse/SPARK-3570

@sryza
Copy link
Contributor

sryza commented Feb 12, 2015

That makes sense. When we're not bypassing the merge threshold, measuring the time spent opening the partition files instead of the merged file seems reasonable.

However, this patch still doesn't add support for measuring the time spent opening files in the main path where we do bypass the merge threshold (>200 partitions). Unless I'm misundestanding?

@kayousterhout
Copy link
Contributor Author

Yeah the reason I didn't add it in the main path is because there, we only open one file at a time, so the open time tends to be so fast that it's not even clear we can measure it accurately (it's on the order of tends of microseconds; this is what's discussed in the JIRA I linked above). I can add it if you think it's confusing not to have it (or maybe I should just add a comment in the code?) but because it's so fast, I don't think it makes sense to add to the time.

@sryza
Copy link
Contributor

sryza commented Feb 14, 2015

Ah I see. Is it possible that we might hit higher latencies for the main path when there's high contention (more cores than disks)?

If that seems unlikely, sounds fine to me to just add a comment.

@SparkQA
Copy link

SparkQA commented Feb 16, 2015

Test build #27543 has finished for PR 4550 at commit ea3a4ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sryza
Copy link
Contributor

sryza commented Feb 16, 2015

LGTM

@kayousterhout
Copy link
Contributor Author

@rxin would you mind taking a look at this, as someone familiar with the shuffle? I noted some measurements of the affect this has on Shuffle Write Time in the JIRA: https://issues.apache.org/jira/browse/SPARK-3570

Thanks for taking a look @sryza!

@srowen
Copy link
Member

srowen commented Mar 13, 2015

PS see the more straightforward counterpart #4965 I think this could be merged to. LGTM.

@andrewor14
Copy link
Contributor

Merging into master and 1.3 thanks Kay.

asfgit pushed a commit that referenced this pull request Mar 24, 2015
Opening shuffle files can be very significant when the disk is
contended, especially when using ext3. While writing data to
a file can avoid hitting disk (and instead hit the buffer
cache), opening a file always involves writing some metadata
about the file to disk, so the open time can be a very significant
portion of the shuffle write time. In one job I ran recently, the time to
write shuffle data to the file was only 4ms for each task, but
the time to open the file was about 100x as long (~400ms).

When we add metrics about spilled data (#2504), we should ensure
that the file open time is also included there.

Author: Kay Ousterhout <[email protected]>

Closes #4550 from kayousterhout/SPARK-3570 and squashes the following commits:

ea3a4ae [Kay Ousterhout] Added comment about excluded open time
fdc5185 [Kay Ousterhout] Improved comment
42b7e43 [Kay Ousterhout] Fixed parens for nanotime
2423555 [Kay Ousterhout] [SPARK-3570] Include time to open files in shuffle write time.

(cherry picked from commit d8ccf65)
Signed-off-by: Andrew Or <[email protected]>
@asfgit asfgit closed this in d8ccf65 Mar 24, 2015
@kayousterhout kayousterhout deleted the SPARK-3570 branch April 12, 2017 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants