Skip to content

Conversation

@jinxing64
Copy link

@jinxing64 jinxing64 commented Mar 13, 2017

What changes were proposed in this pull request?

Metrics of blocks sizes(when shuffle) should be collected for later analysis. This is helpful for analysis when skew situations or OOM happens(though maxBytesInFlight is set).
This is a preparation for SPARK-19659.

How was this patch tested?

Unit test in HistoryServerSuite and JsonProtocolSuite.

@jinxing64 jinxing64 force-pushed the SPARK-19937 branch 2 times, most recently from d69f8f9 to 648ceaa Compare March 13, 2017 13:52
@SparkQA
Copy link

SparkQA commented Mar 13, 2017

Test build #74449 has started for PR 17276 at commit 648ceaa.

@SparkQA
Copy link

SparkQA commented Mar 13, 2017

Test build #74448 has started for PR 17276 at commit 430ec95.

@jinxing64 jinxing64 force-pushed the SPARK-19937 branch 2 times, most recently from 27d9ed5 to e2e56d3 Compare March 14, 2017 12:01
@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74515 has finished for PR 17276 at commit e2e56d3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74509 has finished for PR 17276 at commit d0932ed.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 15, 2017

Test build #74605 has finished for PR 17276 at commit 2ccde1f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 15, 2017

Test build #74607 has finished for PR 17276 at commit 91e338b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2017

Test build #74665 has finished for PR 17276 at commit 7cd290d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jinxing64 jinxing64 changed the title [WIP][SPARK-19937] Collect metrics of block sizes when shuffle. [SPARK-19937] Collect metrics of block sizes when shuffle. Mar 16, 2017
@jinxing64
Copy link
Author

jinxing64 commented Mar 16, 2017

@squito @kayousterhout
Would you mind help comment on this when have time ? It would be great if you can help :)

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74917 has finished for PR 17276 at commit 0e85332.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74939 has started for PR 17276 at commit a88e12e.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74931 has finished for PR 17276 at commit 7ac639f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74938 has finished for PR 17276 at commit e6091b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Mar 23, 2017

no worries, I'm just not sure when to look again, with all the notifications from your commits. Committers tend to think that something is ready to review if its passing tests, so its helpful to add those labels if its not the case.

@jinxing64 jinxing64 changed the title [SPARK-19937] Collect metrics of block sizes when shuffle. [WIP][SPARK-19937] Collect metrics of block sizes when shuffle. Mar 23, 2017
@jinxing64
Copy link
Author

jinxing64 commented Mar 23, 2017

You are so kind person. Thanks a lot again.

@SparkQA
Copy link

SparkQA commented Mar 23, 2017

Test build #75098 has finished for PR 17276 at commit 6a96c3b.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2017

Test build #75145 has finished for PR 17276 at commit 4f992fc.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2017

Test build #75146 has finished for PR 17276 at commit c58cb7e.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2017

Test build #75163 has finished for PR 17276 at commit 0efa348.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #75220 has finished for PR 17276 at commit c26ea56.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jinxing64
Copy link
Author

jinxing64 commented Mar 26, 2017

@squito
Thanks a lot for taking time looking into this pr.
Yes, we should add metrics to Spark very carefully. I updated the pr. Currently just add two metrics: a) the total size of underestimated blocks size, b) the size of blocks shuffled to memory.
For a), executor use maxBytesInFlight to control the speed of shuffle-read. I agree with your comment

another metric that may be nice to capture here is maximum underestimate.

Think about this scenario: the maximum is small, but thousands of blocks are underestimated, thus maxBytesInFlight cannot help avoid the OOM during shuffle-read. That's why I proposed to track the metrics of total size of underestimated blocks size;
For b), currently all data are shuffled-read to memory. If we add the feature of shuffling to disk when memory shortage, we need to evaluate the performance. I think another two metrics need to be taken into account: the size of blocks shuffled to disk(to be added in another pr) and task's running time(already exist). The more data shuffled to memory, the better performance; The shorter time cost, the better performance.

I also added some log for debug in ShuffleWriter, including the num of underestimated blocks and the size distribution.

@jinxing64 jinxing64 changed the title [WIP][SPARK-19937] Collect metrics of block sizes when shuffle. [SPARK-19937] Collect metrics of block sizes when shuffle. Mar 26, 2017
@SparkQA
Copy link

SparkQA commented Mar 26, 2017

Test build #75229 has finished for PR 17276 at commit 8801fc6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

underestimatedBlocksSize += partitionLengths[i];
}
}
writeMetrics.incUnderestimatedBlocksSize(underestimatedBlocksSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will essentially be sum of every block above average size - how is this supposed to be leveraged ?
For example:
1, 2, 3, 4, 5, 6 => 15
1, 2, 3, 4, 5, 10 => 15
(This ended up being a degenerate example - but in general, I am curious what the value is for this metric).

taskContext.taskAttemptId(), hc.getAvgSize(),
underestimatedBlocksNum, underestimatedBlocksSize, distributionStr);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to handle case of mapStatus not being HighlyCompressedMapStatus also.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In CompressedMapStatus, the blocks sizes are accurate, so I might hesitate to add that log.

Copy link
Contributor

@mridulm mridulm Mar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value is not accurate - it is a log 1.1 'compression' which converts the long to a byte : and caps the value at 255.

So there are two errors introduced; it over-estimates the actual block size when compressed value < 255 [1] (which is something this PR currently ignores), when block size goes above 34k mb or so, it under estimates the block size (which is higher than what spark currently supports due to 2G limitation).

[1] I did not realize it always over-estimates; if the current PR is targetting only blocks which are under estimated; I would agree that not handling CompressedMapStatus for time being might be ok - though would be good to add a comment to that effect on 'why' we dont need to handle it.

taskContext.taskAttemptId(), hc.getAvgSize(),
underestimatedBlocksNum, underestimatedBlocksSize, distributionStr);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This computation seems repeated - we should refactor it out into a method of its own and not duplicate it across classes.

s" (0, 0.25, 0.5, 0.75, 1.0) is $distributionStr.")
case None => // no-op
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn;t this not similar to what is in core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java, etc above ? Or is it different ?
The code looked same, but written differently (and more expensive here).

private[spark] def incRemoteBlocksFetched(v: Long): Unit = _remoteBlocksFetched.add(v)
private[spark] def incLocalBlocksFetched(v: Long): Unit = _localBlocksFetched.add(v)
private[spark] def incRemoteBytesRead(v: Long): Unit = _remoteBytesRead.add(v)
private[spark] def incRemoteBytesReadToMem(v: Long): Unit = _remoteBytesReadToMem.add(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it seems to be coded up, this will end up being everything fetched from shuffle - and we can already infer it : remote bytes read + local bytes read.
Or did I miss something here ?

@jinxing64
Copy link
Author

jinxing64 commented Mar 26, 2017

@mridulm
Thanks a lot for taking time looking into this and thanks for comments :)

  1. I changed the size of underestimated blocks to be partitionLengths.filter(_ > hc.getAvgSize).map(_ - hc.getAvgSize).sum
  2. I added a method genBlocksDistributionStr and call it from ShuffleWriters, thus avoid duplicate codes

The value of underestimatedBlocksSize is that we need to know how much the blocks' sizes are underestimated(only an average size is provided in HighlyCompressedStatus). I proposed to shuffle-read big blocks to disk when memory shortage(SPARK-19659) and this is a preparation for that pr. It is hard to record all the sizes of blocks when number of partitions is very large, but I propose to make it more accurate(we can store the average and the blocks whose size is bigger than n*average, or some other implementations). underestimatedBlocksSize is an evaluation for the implementation.

underestimatedBlocksSize is an evaluation for the stability when we shuffle data to disk. On the other hand remoteBytesReadToMem is an evaluation for performance. Currently it is the same with remoteBytesRead. When I add the feature of shuffling data to disk, remoteBytesReadToDisk will be added.

Basically, the metrics is for evaluation of stability and performance when shuffle-read. I want to achieve that smaller underestimatedBlocksSize and bigger remoteBytesReadToMem.

@SparkQA
Copy link

SparkQA commented Mar 26, 2017

Test build #75238 has finished for PR 17276 at commit cf5de4a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 26, 2017

Test build #75239 has finished for PR 17276 at commit 873129f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor

mridulm commented Mar 27, 2017

@jinxing64
If the intent behind these metrics is to help with SPARK-19659, it would be good to either add it as part of SPARK-19659 or subsequently (once the feature is merged).
This ensures that the metrics added are actually relevant to the existing spark core, and not a future expected evolution of the code - for example, the review of SPARK-19659 might significantly change its design/implementation : making some of these either irrelevant or require other more informative metrics to be introduced.

I am unclear about the intention btw - do you expect shuffle reads to be informed by metrics from the mapper side ? I probably got that wrong.

@jinxing64
Copy link
Author

@mridulm
Sorry for late reply. I opened the pr for SPARK-19659(#16989) and make these two PRs independent. Basically this pr is is to evaluate the performance(blocks are shuffled to disk) and stability(size in MapStatus is inaccurate and OOM can happen) of the implementation proposed in SPARK-19659.
I'd be so thankful if you have time to comment on these two PRs.

@jinxing64 jinxing64 changed the title [SPARK-19937] Collect metrics of block sizes when shuffle. [WIP][SPARK-19937] Collect metrics of block sizes when shuffle. Apr 10, 2017
@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76454 has finished for PR 17276 at commit 873129f.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jinxing64
Copy link
Author

@mridulm @squito
Thanks a lot for taking time review this pr.
I will close it for now and make another one if there is progress.

@jinxing64 jinxing64 closed this Jun 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants