[SPARK-9104][SPARK-9105][SPARK-9106][SPARK-9107][CORE] Netty network layer memory usage on webUI #7753

liyezhang556520 · 2015-07-29T17:49:24Z

This is a sub-task of SPARK-9103, we'd like to expose the memory usage for spark running time, this is the first step to expose the netty buffer used both with on-heap and off-heap memory. Also the metrics are showed on WebUI. In this PR, a new web Tab name Memory is added. Which is used to show the memory usage of each executors (can be in more details in future). the screenshot is like the following:

This is for each stages memory info:

This is History View:

This is WIP because with unit tests left.

SparkQA · 2015-07-29T17:57:19Z

Test build #38858 has finished for PR 7753 at commit 17e5b97.

This patch fails Python style tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class ExecutorMetrics extends Serializable
- class MemoryListener extends SparkListener
- class MemoryUIInfo

jerryshao · 2015-07-30T01:53:14Z

core/src/main/scala/org/apache/spark/executor/ExecutorMetrics.scala

not maxNettyOnHeapSizeTime?

I think we'd better create a named case class like:

case class NettyMemoryMetric(value: Long, currentTime: Long)

To replace this returned Tuple. I think it will be more clear to understand.

use maxNettyReadOnheapSizeTime to make the metrics more detail, since there is shuffle write Netty part (the stranportServer part), the maxNettyOnHeapSizeTime should be the sum of both side

I think I agree with @jerryshao .

(a) the time here is just a timestamp that we got the max, it doesn't make sense to take a sum of them, and the case class will make it more clear. but instead of currentTime I'd use timestamp

(b) I'm not sure if there is any value in including "read" in the name. its true that netty is mostly using this memory for reading data off the network, but we're not actually doing anything special to make sure we only count memory used during read. We are just counting the total memory used by netty, and it so happens that the write-side is streaming. (I think I may have been confused during earlier discussions -- the write-side uses FileSegmentManagedBuffer.convertToNetty which should by streaming.) We should mention in the docs that the primary source of this memory is from the shuffle-read, that netty will needs memory for the blocks fetched from each executor thread, etc. But I think we can only guarantee that we get statistics on netty's memory usage in general, we can't necessarily separate out the write-side.

jerryshao · 2015-07-30T02:13:22Z

A simple question, is it enough to only expose the maximum memory usage of Netty layer?

Besides, IMHO I think it would be better to separate getting Netty memory usage and displaying in web portal into two different PR.

liyezhang556520 · 2015-07-30T02:32:43Z

@jerryshao , thanks for your review and feedback, I'll separate it into different PRs. And I will keep this PR open and waiting for others to see whether they have different thoughts on the whole implementation.

jerryshao · 2015-07-30T02:36:39Z

@squito mind helping to review this?

squito · 2015-07-30T11:10:17Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

I don't think this is the "right" port -- eg., its not the port that users will see in the UI elsewhere for the executor. I think that will be the port in env.address.port (that is what is used in the onExecutorAdded event).

squito · 2015-07-30T11:27:28Z

Jenkins, retest this please

squito · 2015-07-30T11:29:43Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

nit: ExecutorMetrics is imported twice

squito · 2015-07-30T11:36:30Z

Since we'll eventually want to add more metrics, can you put all the netty metrics into another case class inside ExecutorMetrics?

Also, I'm wondering if we want to use "netty" in the name -- I think most users won't know or care about netty in particular. It should it just be named "network" or "transport", and the nio implementation should indicate that metrics are missing.

I guess altogether this means doing something like:

class ExecutorMetrics {
  var transportMetrics: Option[TransportMetrics] = ...
}

squito · 2015-07-30T11:37:56Z

@jerryshao I'm not entirely sure I know what you mean by:

A simple question, is it enough to only expose the maximum memory usage of Netty layer?

can you elaborate? Obviously we'd always like more metrics, but are you saying this isn't useful?

SparkQA · 2015-07-30T15:08:28Z

Test build #164 has finished for PR 7753 at commit 17e5b97.

This patch fails Python style tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class ExecutorMetrics extends Serializable
- class MemoryListener extends SparkListener
- class MemoryUIInfo

SparkQA · 2015-07-30T15:17:27Z

Test build #39044 has finished for PR 7753 at commit 17e5b97.

This patch fails Python style tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class ExecutorMetrics extends Serializable
- class MemoryListener extends SparkListener
- class MemoryUIInfo

jerryshao · 2015-07-31T01:09:16Z

Hi @squito , IIUC here the code always set the max memory usage:

 @volatile private var _nettyReadOnheapSize: Long = _
  def nettyReadOnheapSize: Long = _nettyReadOnheapSize
  private[spark] def setNettyReadOnheapSize(value: Long, time: SysTime): Unit = {
    _nettyReadOnheapSize = value
    if (value > _maxNettyReadOnheapSizeTime._1) {
      _maxNettyReadOnheapSizeTime = (value, time)
    }
  }

If current memory usage is smaller than the previous one, simple will discard it.

From my understanding it is better to transfer the raw metrics to the driver, and further processed by each listener as they wanted, that will be more flexible. If people want the minimum memory usage for example, they could process it by their logic in listener. Here only get the max memory usage will restrict the flexibility for different user.

squito · 2015-07-31T19:03:52Z

@jerryshao I don't think that's true. The executors metrics that get sent back to the driver have both the max value, and the current value. So a custom listener could already track the minimum memory usage if it wanted.

I wonder if we should even calculate the max at all inside the executor, or just leave it to the listeners. On one hand, max is what you will usually want, so might as well make it more convenient. OTOH, maybe it would be cleaner to leave it to the listeners ... eg. I can already imagine in the future we may want to track max memory usage during each stage, and I definitely think that logic should be left to the listeners.

what do you think @liyezhang556520 ? I will think about it more, but now I'm leaning towards just leaving max to the listeners ...

squito · 2015-11-25T22:21:13Z

core/src/test/scala/org/apache/spark/ui/memory/MemoryListenerSuite.scala

slightly higher-level question -- is anything gained by breaking this into the different test cases? I feel like you could just take this final test case, add some more stages at the end with no metric updates, and also add some asserts on listener.activeStagesToMem etc., and that would cover all the other test cases.

I'm all for having more complete coverage with more test cases, but not sure we gain anything here. what do you think?

I just think we should separate several cases so that we can make sure all situations covered. Yes actually, some of the current cases has some duplicated parts, I'll try to remove some.

squito · 2015-11-25T22:26:03Z

JsonProtocolSuite should also have a backward compatability test for the new fields

squito · 2015-11-25T22:33:50Z

core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala

comment is wrong (off by one). maybe just make comment "one extra line for SparkListenerLogStart"

Yes, forget to update the comments, thanks.

SparkQA · 2015-12-08T03:57:09Z

Test build #47305 has finished for PR 7753 at commit 4123ac7.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):\n * class ExecutorMetrics extends Serializable\n * class TransportMetrics (\n * class MemoryListener extends SparkListener\n * class MemoryUIInfo\n * class TransportMemSize\n * case class MemTime(memorySize: Long = 0L, timeStamp: Long = System.currentTimeMillis)\n

SparkQA · 2015-12-09T09:16:06Z

Test build #47415 has finished for PR 7753 at commit 4b3dbe4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class ExecutorMetrics extends Serializable\n * class TransportMetrics (\n * class MemoryListener extends SparkListener\n * class MemoryUIInfo\n * class TransportMemSize\n * case class MemTime(memorySize: Long = 0L, timeStamp: Long = System.currentTimeMillis)\n

squito · 2015-12-16T23:21:05Z

core/src/main/scala/org/apache/spark/executor/ExecutorMetrics.scala

rather than putting this method here, just define it inside the test case where you use it (though I think you probably don't even need it there)

steveloughran · 2016-01-03T13:25:43Z

One thought here: it'd probably be nice to have a json history with the new events as part of the history server suite regression tests —that'll catch any changes to the new events which will break backwards compatibility

SparkQA · 2016-01-12T10:04:11Z

Test build #49235 has finished for PR 7753 at commit 5e031ce.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

liyezhang556520 · 2016-01-12T15:12:39Z

Hi @steveloughran , thank you for your attention and your comments, and your further comments on this PR are pretty appreciated. Your advice is quite correct, we need to make a regression test for history server suite. I think this is related with what @squito mentioned that we need to add json api support. I'm wondering whether we can make it done in another PR after this PR be able to merge or this feature totally accepted by the community. And of course, rest api support is supposed to be one part of this feature.
@squito, what's your opinion on this?

SparkQA · 2016-01-12T17:35:23Z

Test build #49239 has finished for PR 7753 at commit 87f8172.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2016-01-27T01:38:07Z

@liyezhang556520 can you remove the "WIP" from the title.

cc @andrewor14

rxin · 2016-06-15T22:21:29Z

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one. We can also continue the discussion on the JIRA ticket.

Note - I actually think this is a pretty useful feature, and in the future maybe we should switch over to consolidate memory management entirely with Netty.

jerryshao reviewed Jul 30, 2015
View reviewed changes

squito reviewed Jul 30, 2015
View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala Outdated

Copy link

Contributor

squito Jul 30, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ExecutorMetrics is imported twice

liyezhang556520 added 10 commits August 17, 2015 16:45

spark-9104 first draft version

39ba441

Merge remote-tracking branch 'apache/master' into netMem-9104

ecc1044

show N/A for nio

2101538

handle executor add and remove event for memotyTab

9ccaf88

show removed executors info on page

13c17fb

add stage memory trace

c9b44b1

add history support for heartbeat event

984feaf

limit history event log frequency

2501c82

add some comments for EventLoggingListener

e0ae855

Merge remote-tracking branch 'apache/master' into netMem-9104

7491279

liyezhang556520 force-pushed the spark-9104-draft branch from 17e5b97 to 7491279 Compare August 20, 2015 08:23

squito reviewed Nov 25, 2015
View reviewed changes

modify the code according to Imran's comments, mainly with unit test

4123ac7

liyezhang556520 added 3 commits December 8, 2015 13:29

fix scala style

2ce9fd9

merge to master branch with tests update

17d094e

change port to option and some bug fixes

4b3dbe4

squito reviewed Dec 16, 2015
View reviewed changes

liyezhang556520 added 2 commits January 12, 2016 17:32

address comments of code refinement

0ea7cab

merge to latest master branch from spark-9104-draft

5e031ce

fix import ordering error

87f8172

liyezhang556520 changed the title ~~[SPARK-9104][SPARK-9105][SPARK-9106][SPARK-9107][CORE][WIP] Netty network layer memory usage on webUI~~ [SPARK-9104][SPARK-9105][SPARK-9106][SPARK-9107][CORE] Netty network layer memory usage on webUI Jan 27, 2016

asfgit closed this in 1a33f2e Jun 15, 2016

jsoltren mentioned this pull request Apr 12, 2017

[SPARK-9103][WIP] Add Memory Tracking UI and track Netty memory usage #17625

Closed

jsoltren mentioned this pull request Apr 25, 2017

[SPARK-9103][WIP] Track Netty memory usage - take two #17762

Closed

[SPARK-9104][SPARK-9105][SPARK-9106][SPARK-9107][CORE] Netty network layer memory usage on webUI #7753

[SPARK-9104][SPARK-9105][SPARK-9106][SPARK-9107][CORE] Netty network layer memory usage on webUI #7753

Uh oh!

Conversation

liyezhang556520 commented Jul 29, 2015

Uh oh!

SparkQA commented Jul 29, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryshao commented Jul 30, 2015

Uh oh!

liyezhang556520 commented Jul 30, 2015

Uh oh!

jerryshao commented Jul 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squito commented Jul 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squito commented Jul 30, 2015

Uh oh!

squito commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

jerryshao commented Jul 31, 2015

Uh oh!

squito commented Jul 31, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squito commented Nov 25, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 8, 2015

Uh oh!

SparkQA commented Dec 9, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Jan 3, 2016

Uh oh!

SparkQA commented Jan 12, 2016

Uh oh!

liyezhang556520 commented Jan 12, 2016

Uh oh!

SparkQA commented Jan 12, 2016

Uh oh!

chenghao-intel commented Jan 27, 2016

Uh oh!

rxin commented Jun 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants