[WIP][SPARK-19937] Collect metrics of block sizes when shuffle. #17276

jinxing64 · 2017-03-13T13:45:38Z

What changes were proposed in this pull request?

Metrics of blocks sizes(when shuffle) should be collected for later analysis. This is helpful for analysis when skew situations or OOM happens(though maxBytesInFlight is set).
This is a preparation for SPARK-19659.

How was this patch tested?

Unit test in HistoryServerSuite and JsonProtocolSuite.

SparkQA · 2017-03-13T17:12:59Z

Test build #74449 has started for PR 17276 at commit 648ceaa.

SparkQA · 2017-03-13T17:13:01Z

Test build #74448 has started for PR 17276 at commit 430ec95.

SparkQA · 2017-03-14T14:00:36Z

Test build #74515 has finished for PR 17276 at commit e2e56d3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-14T14:47:33Z

Test build #74509 has finished for PR 17276 at commit d0932ed.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-15T16:42:13Z

Test build #74605 has finished for PR 17276 at commit 2ccde1f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-15T17:03:29Z

Test build #74607 has finished for PR 17276 at commit 91e338b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-16T15:50:47Z

Test build #74665 has finished for PR 17276 at commit 7cd290d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-03-16T15:54:41Z

@squito @kayousterhout
Would you mind help comment on this when have time ? It would be great if you can help :)

…ED_BLOCKS_SIZE.

SparkQA · 2017-03-21T03:10:24Z

Test build #74917 has finished for PR 17276 at commit 0e85332.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-21T04:27:35Z

Test build #74939 has started for PR 17276 at commit a88e12e.

SparkQA · 2017-03-21T05:40:26Z

Test build #74931 has finished for PR 17276 at commit 7ac639f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-21T07:04:55Z

Test build #74938 has finished for PR 17276 at commit e6091b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2017-03-23T02:24:36Z

no worries, I'm just not sure when to look again, with all the notifications from your commits. Committers tend to think that something is ready to review if its passing tests, so its helpful to add those labels if its not the case.

jinxing64 · 2017-03-23T02:41:52Z

You are so kind person. Thanks a lot again.

SparkQA · 2017-03-23T15:23:06Z

Test build #75098 has finished for PR 17276 at commit 6a96c3b.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-24T04:53:33Z

Test build #75145 has finished for PR 17276 at commit 4f992fc.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-24T05:07:20Z

Test build #75146 has finished for PR 17276 at commit c58cb7e.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-24T16:47:11Z

Test build #75163 has finished for PR 17276 at commit 0efa348.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-25T18:03:35Z

Test build #75220 has finished for PR 17276 at commit c26ea56.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-03-26T03:24:24Z

@squito
Thanks a lot for taking time looking into this pr.
Yes, we should add metrics to Spark very carefully. I updated the pr. Currently just add two metrics: a) the total size of underestimated blocks size, b) the size of blocks shuffled to memory.
For a), executor use maxBytesInFlight to control the speed of shuffle-read. I agree with your comment

another metric that may be nice to capture here is maximum underestimate.

Think about this scenario: the maximum is small, but thousands of blocks are underestimated, thus maxBytesInFlight cannot help avoid the OOM during shuffle-read. That's why I proposed to track the metrics of total size of underestimated blocks size;
For b), currently all data are shuffled-read to memory. If we add the feature of shuffling to disk when memory shortage, we need to evaluate the performance. I think another two metrics need to be taken into account: the size of blocks shuffled to disk(to be added in another pr) and task's running time(already exist). The more data shuffled to memory, the better performance; The shorter time cost, the better performance.

I also added some log for debug in ShuffleWriter, including the num of underestimated blocks and the size distribution.

SparkQA · 2017-03-26T05:46:19Z

Test build #75229 has finished for PR 17276 at commit 8801fc6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2017-03-26T06:15:26Z

core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java

+          underestimatedBlocksSize += partitionLengths[i];
+        }
+      }
+      writeMetrics.incUnderestimatedBlocksSize(underestimatedBlocksSize);


This will essentially be sum of every block above average size - how is this supposed to be leveraged ?
For example:
1, 2, 3, 4, 5, 6 => 15
1, 2, 3, 4, 5, 10 => 15
(This ended up being a degenerate example - but in general, I am curious what the value is for this metric).

mridulm · 2017-03-26T06:16:03Z

core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java

+          taskContext.taskAttemptId(), hc.getAvgSize(),
+          underestimatedBlocksNum, underestimatedBlocksSize, distributionStr);
+      }
+    }


We need to handle case of mapStatus not being HighlyCompressedMapStatus also.

In CompressedMapStatus, the blocks sizes are accurate, so I might hesitate to add that log.

The value is not accurate - it is a log 1.1 'compression' which converts the long to a byte : and caps the value at 255.

So there are two errors introduced; it over-estimates the actual block size when compressed value < 255 [1] (which is something this PR currently ignores), when block size goes above 34k mb or so, it under estimates the block size (which is higher than what spark currently supports due to 2G limitation).

[1] I did not realize it always over-estimates; if the current PR is targetting only blocks which are under estimated; I would agree that not handling CompressedMapStatus for time being might be ok - though would be good to add a comment to that effect on 'why' we dont need to handle it.

mridulm · 2017-03-26T06:16:39Z

core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java

+          taskContext.taskAttemptId(), hc.getAvgSize(),
+          underestimatedBlocksNum, underestimatedBlocksSize, distributionStr);
+      }
+    }


This computation seems repeated - we should refactor it out into a method of its own and not duplicate it across classes.

mridulm · 2017-03-26T06:20:18Z

core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala

+                  s" (0, 0.25, 0.5, 0.75, 1.0) is $distributionStr.")
+              case None => // no-op
+            }
+          }


Isn;t this not similar to what is in core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java, etc above ? Or is it different ?
The code looked same, but written differently (and more expensive here).

mridulm · 2017-03-26T06:24:06Z

core/src/main/scala/org/apache/spark/executor/ShuffleReadMetrics.scala

  private[spark] def incRemoteBlocksFetched(v: Long): Unit = _remoteBlocksFetched.add(v)
  private[spark] def incLocalBlocksFetched(v: Long): Unit = _localBlocksFetched.add(v)
  private[spark] def incRemoteBytesRead(v: Long): Unit = _remoteBytesRead.add(v)
+  private[spark] def incRemoteBytesReadToMem(v: Long): Unit = _remoteBytesReadToMem.add(v)


The way it seems to be coded up, this will end up being everything fetched from shuffle - and we can already infer it : remote bytes read + local bytes read.
Or did I miss something here ?

jinxing64 · 2017-03-26T13:42:00Z

@mridulm
Thanks a lot for taking time looking into this and thanks for comments :)

I changed the size of underestimated blocks to be partitionLengths.filter(_ > hc.getAvgSize).map(_ - hc.getAvgSize).sum
I added a method genBlocksDistributionStr and call it from ShuffleWriters, thus avoid duplicate codes

The value of underestimatedBlocksSize is that we need to know how much the blocks' sizes are underestimated(only an average size is provided in HighlyCompressedStatus). I proposed to shuffle-read big blocks to disk when memory shortage(SPARK-19659) and this is a preparation for that pr. It is hard to record all the sizes of blocks when number of partitions is very large, but I propose to make it more accurate(we can store the average and the blocks whose size is bigger than n*average, or some other implementations). underestimatedBlocksSize is an evaluation for the implementation.

underestimatedBlocksSize is an evaluation for the stability when we shuffle data to disk. On the other hand remoteBytesReadToMem is an evaluation for performance. Currently it is the same with remoteBytesRead. When I add the feature of shuffling data to disk, remoteBytesReadToDisk will be added.

Basically, the metrics is for evaluation of stability and performance when shuffle-read. I want to achieve that smaller underestimatedBlocksSize and bigger remoteBytesReadToMem.

SparkQA · 2017-03-26T16:19:47Z

Test build #75238 has finished for PR 17276 at commit cf5de4a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-26T16:53:04Z

Test build #75239 has finished for PR 17276 at commit 873129f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2017-03-27T06:56:29Z

@jinxing64
If the intent behind these metrics is to help with SPARK-19659, it would be good to either add it as part of SPARK-19659 or subsequently (once the feature is merged).
This ensures that the metrics added are actually relevant to the existing spark core, and not a future expected evolution of the code - for example, the review of SPARK-19659 might significantly change its design/implementation : making some of these either irrelevant or require other more informative metrics to be introduced.

I am unclear about the intention btw - do you expect shuffle reads to be informed by metrics from the mapper side ? I probably got that wrong.

jinxing64 · 2017-04-01T15:47:44Z

@mridulm
Sorry for late reply. I opened the pr for SPARK-19659(#16989) and make these two PRs independent. Basically this pr is is to evaluate the performance(blocks are shuffled to disk) and stability(size in MapStatus is inaccurate and OOM can happen) of the implementation proposed in SPARK-19659.
I'd be so thankful if you have time to comment on these two PRs.

SparkQA · 2017-05-04T13:24:31Z

Test build #76454 has finished for PR 17276 at commit 873129f.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-06-08T09:29:05Z

@mridulm @squito
Thanks a lot for taking time review this pr.
I will close it for now and make another one if there is progress.

jinxing64 force-pushed the SPARK-19937 branch 2 times, most recently from d69f8f9 to 648ceaa Compare March 13, 2017 13:52

jinxing64 force-pushed the SPARK-19937 branch 2 times, most recently from 27d9ed5 to e2e56d3 Compare March 14, 2017 12:01

jinxing64 force-pushed the SPARK-19937 branch from e2e56d3 to 2ccde1f Compare March 15, 2017 14:40

Collect metrics of block sizes from executor.

91e338b

jinxing64 force-pushed the SPARK-19937 branch from 2ccde1f to 91e338b Compare March 15, 2017 14:55

jinxing added 2 commits March 16, 2017 08:48

change in JsonProtocolSuite

a95e75c

fix JsonProtocolSuite.

7cd290d

jinxing64 changed the title ~~[WIP][SPARK-19937] Collect metrics of block sizes when shuffle.~~ [SPARK-19937] Collect metrics of block sizes when shuffle. Mar 16, 2017

jinxing added 5 commits March 20, 2017 14:39

Remove MAX_BLOCK_SIZE, add UNDERESTIMATED_BLOCKS_NUM and UNDERESTIMAT…

b91ac13

…ED_BLOCKS_SIZE.

wip

b1cd879

fix taskMetricsFromJson and taskMetricsToJson

f21f168

fix

0e6d783

Fix json.

0e85332

jinxing added 3 commits March 21, 2017 11:14

remove maxBlockSize and fix remoteBytesReadToMem.

7ac639f

small fix

fea66c5

Remove the new metrics from event log

e6091b6

jinxing64 changed the title ~~[SPARK-19937] Collect metrics of block sizes when shuffle.~~ [WIP][SPARK-19937] Collect metrics of block sizes when shuffle. Mar 23, 2017

jinxing added 2 commits March 23, 2017 22:40

add remoteBytesReadToMem and underestimatedBlocksSize to api

2b89166

fix log

6a96c3b

jinxing added 2 commits March 24, 2017 12:41

show distribution with probabilities.

4f992fc

fix

c58cb7e

change MimaExcludes.

0efa348

fix json expectations.

c26ea56

small fix

8801fc6

jinxing64 changed the title ~~[WIP][SPARK-19937] Collect metrics of block sizes when shuffle.~~ [SPARK-19937] Collect metrics of block sizes when shuffle. Mar 26, 2017

mridulm reviewed Mar 26, 2017

View reviewed changes

change size of underestimated blocks and remove the duplicate code.

cf5de4a

remote unused import

873129f

jinxing64 changed the title ~~[SPARK-19937] Collect metrics of block sizes when shuffle.~~ [WIP][SPARK-19937] Collect metrics of block sizes when shuffle. Apr 10, 2017

jinxing64 closed this Jun 8, 2017

[WIP][SPARK-19937] Collect metrics of block sizes when shuffle. #17276

[WIP][SPARK-19937] Collect metrics of block sizes when shuffle. #17276

Uh oh!

Conversation

jinxing64 commented Mar 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 13, 2017

Uh oh!

SparkQA commented Mar 13, 2017

Uh oh!

SparkQA commented Mar 14, 2017

Uh oh!

SparkQA commented Mar 14, 2017

Uh oh!

SparkQA commented Mar 15, 2017

Uh oh!

SparkQA commented Mar 15, 2017

Uh oh!

SparkQA commented Mar 16, 2017

Uh oh!

jinxing64 commented Mar 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 21, 2017

Uh oh!

SparkQA commented Mar 21, 2017

Uh oh!

SparkQA commented Mar 21, 2017

Uh oh!

SparkQA commented Mar 21, 2017

Uh oh!

squito commented Mar 23, 2017

Uh oh!

jinxing64 commented Mar 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 23, 2017

Uh oh!

SparkQA commented Mar 24, 2017

Uh oh!

SparkQA commented Mar 24, 2017

Uh oh!

SparkQA commented Mar 24, 2017

Uh oh!

SparkQA commented Mar 25, 2017

Uh oh!

jinxing64 commented Mar 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 26, 2017

Uh oh!

mridulm Mar 26, 2017

Choose a reason for hiding this comment

Uh oh!

mridulm Mar 26, 2017

Choose a reason for hiding this comment

Uh oh!

jinxing64 Mar 26, 2017

Choose a reason for hiding this comment

Uh oh!

mridulm Mar 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm Mar 26, 2017

Choose a reason for hiding this comment

Uh oh!

mridulm Mar 26, 2017

Choose a reason for hiding this comment

Uh oh!

mridulm Mar 26, 2017

Choose a reason for hiding this comment

Uh oh!

jinxing64 commented Mar 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 26, 2017

jinxing64 commented Mar 13, 2017 •

edited

Loading

jinxing64 commented Mar 16, 2017 •

edited

Loading

jinxing64 commented Mar 23, 2017 •

edited

Loading

jinxing64 commented Mar 26, 2017 •

edited

Loading

mridulm Mar 27, 2017 •

edited

Loading

jinxing64 commented Mar 26, 2017 •

edited

Loading