Skip to content

Conversation

@liyezhang556520
Copy link
Contributor

This is a sub-task of SPARK-9103, we'd like to expose the memory usage for spark running time, this is the first step to expose the netty buffer used both with on-heap and off-heap memory. Also the metrics are showed on WebUI. In this PR, a new web Tab name Memory is added. Which is used to show the memory usage of each executors (can be in more details in future). the screenshot is like the following:

image

This is for each stages memory info:
image

This is History View:
image

This is WIP because with unit tests left.

@SparkQA
Copy link

SparkQA commented Jul 29, 2015

Test build #38858 has finished for PR 7753 at commit 17e5b97.

  • This patch fails Python style tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorMetrics extends Serializable
    • class MemoryListener extends SparkListener
    • class MemoryUIInfo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not maxNettyOnHeapSizeTime?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd better create a named case class like:

case class NettyMemoryMetric(value: Long, currentTime: Long)

To replace this returned Tuple. I think it will be more clear to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use maxNettyReadOnheapSizeTime to make the metrics more detail, since there is shuffle write Netty part (the stranportServer part), the maxNettyOnHeapSizeTime should be the sum of both side

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with @jerryshao .

(a) the time here is just a timestamp that we got the max, it doesn't make sense to take a sum of them, and the case class will make it more clear. but instead of currentTime I'd use timestamp

(b) I'm not sure if there is any value in including "read" in the name. its true that netty is mostly using this memory for reading data off the network, but we're not actually doing anything special to make sure we only count memory used during read. We are just counting the total memory used by netty, and it so happens that the write-side is streaming. (I think I may have been confused during earlier discussions -- the write-side uses FileSegmentManagedBuffer.convertToNetty which should by streaming.) We should mention in the docs that the primary source of this memory is from the shuffle-read, that netty will needs memory for the blocks fetched from each executor thread, etc. But I think we can only guarantee that we get statistics on netty's memory usage in general, we can't necessarily separate out the write-side.

@jerryshao
Copy link
Contributor

A simple question, is it enough to only expose the maximum memory usage of Netty layer?

Besides, IMHO I think it would be better to separate getting Netty memory usage and displaying in web portal into two different PR.

@liyezhang556520
Copy link
Contributor Author

@jerryshao , thanks for your review and feedback, I'll separate it into different PRs. And I will keep this PR open and waiting for others to see whether they have different thoughts on the whole implementation.

@jerryshao
Copy link
Contributor

@squito mind helping to review this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the "right" port -- eg., its not the port that users will see in the UI elsewhere for the executor. I think that will be the port in env.address.port (that is what is used in the onExecutorAdded event).

@squito
Copy link
Contributor

squito commented Jul 30, 2015

Jenkins, retest this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ExecutorMetrics is imported twice

@squito
Copy link
Contributor

squito commented Jul 30, 2015

Since we'll eventually want to add more metrics, can you put all the netty metrics into another case class inside ExecutorMetrics?

Also, I'm wondering if we want to use "netty" in the name -- I think most users won't know or care about netty in particular. It should it just be named "network" or "transport", and the nio implementation should indicate that metrics are missing.

I guess altogether this means doing something like:

class ExecutorMetrics {
  var transportMetrics: Option[TransportMetrics] = ...
}

@squito
Copy link
Contributor

squito commented Jul 30, 2015

@jerryshao I'm not entirely sure I know what you mean by:

A simple question, is it enough to only expose the maximum memory usage of Netty layer?

can you elaborate? Obviously we'd always like more metrics, but are you saying this isn't useful?

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #164 has finished for PR 7753 at commit 17e5b97.

  • This patch fails Python style tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorMetrics extends Serializable
    • class MemoryListener extends SparkListener
    • class MemoryUIInfo

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #39044 has finished for PR 7753 at commit 17e5b97.

  • This patch fails Python style tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorMetrics extends Serializable
    • class MemoryListener extends SparkListener
    • class MemoryUIInfo

@jerryshao
Copy link
Contributor

Hi @squito , IIUC here the code always set the max memory usage:

 @volatile private var _nettyReadOnheapSize: Long = _
  def nettyReadOnheapSize: Long = _nettyReadOnheapSize
  private[spark] def setNettyReadOnheapSize(value: Long, time: SysTime): Unit = {
    _nettyReadOnheapSize = value
    if (value > _maxNettyReadOnheapSizeTime._1) {
      _maxNettyReadOnheapSizeTime = (value, time)
    }
  }

If current memory usage is smaller than the previous one, simple will discard it.

From my understanding it is better to transfer the raw metrics to the driver, and further processed by each listener as they wanted, that will be more flexible. If people want the minimum memory usage for example, they could process it by their logic in listener. Here only get the max memory usage will restrict the flexibility for different user.

@squito
Copy link
Contributor

squito commented Jul 31, 2015

@jerryshao I don't think that's true. The executors metrics that get sent back to the driver have both the max value, and the current value. So a custom listener could already track the minimum memory usage if it wanted.

I wonder if we should even calculate the max at all inside the executor, or just leave it to the listeners. On one hand, max is what you will usually want, so might as well make it more convenient. OTOH, maybe it would be cleaner to leave it to the listeners ... eg. I can already imagine in the future we may want to track max memory usage during each stage, and I definitely think that logic should be left to the listeners.

what do you think @liyezhang556520 ? I will think about it more, but now I'm leaning towards just leaving max to the listeners ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly higher-level question -- is anything gained by breaking this into the different test cases? I feel like you could just take this final test case, add some more stages at the end with no metric updates, and also add some asserts on listener.activeStagesToMem etc., and that would cover all the other test cases.

I'm all for having more complete coverage with more test cases, but not sure we gain anything here. what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just think we should separate several cases so that we can make sure all situations covered. Yes actually, some of the current cases has some duplicated parts, I'll try to remove some.

@squito
Copy link
Contributor

squito commented Nov 25, 2015

JsonProtocolSuite should also have a backward compatability test for the new fields

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment is wrong (off by one). maybe just make comment "one extra line for SparkListenerLogStart"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, forget to update the comments, thanks.

@SparkQA
Copy link

SparkQA commented Dec 8, 2015

Test build #47305 has finished for PR 7753 at commit 4123ac7.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):\n * class ExecutorMetrics extends Serializable\n * class TransportMetrics (\n * class MemoryListener extends SparkListener\n * class MemoryUIInfo\n * class TransportMemSize\n * case class MemTime(memorySize: Long = 0L, timeStamp: Long = System.currentTimeMillis)\n

@SparkQA
Copy link

SparkQA commented Dec 9, 2015

Test build #47415 has finished for PR 7753 at commit 4b3dbe4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class ExecutorMetrics extends Serializable\n * class TransportMetrics (\n * class MemoryListener extends SparkListener\n * class MemoryUIInfo\n * class TransportMemSize\n * case class MemTime(memorySize: Long = 0L, timeStamp: Long = System.currentTimeMillis)\n

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than putting this method here, just define it inside the test case where you use it (though I think you probably don't even need it there)

@steveloughran
Copy link
Contributor

One thought here: it'd probably be nice to have a json history with the new events as part of the history server suite regression tests —that'll catch any changes to the new events which will break backwards compatibility

@SparkQA
Copy link

SparkQA commented Jan 12, 2016

Test build #49235 has finished for PR 7753 at commit 5e031ce.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liyezhang556520
Copy link
Contributor Author

Hi @steveloughran , thank you for your attention and your comments, and your further comments on this PR are pretty appreciated. Your advice is quite correct, we need to make a regression test for history server suite. I think this is related with what @squito mentioned that we need to add json api support. I'm wondering whether we can make it done in another PR after this PR be able to merge or this feature totally accepted by the community. And of course, rest api support is supposed to be one part of this feature.
@squito, what's your opinion on this?

@SparkQA
Copy link

SparkQA commented Jan 12, 2016

Test build #49239 has finished for PR 7753 at commit 87f8172.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chenghao-intel
Copy link
Contributor

@liyezhang556520 can you remove the "WIP" from the title.

cc @andrewor14

@liyezhang556520 liyezhang556520 changed the title [SPARK-9104][SPARK-9105][SPARK-9106][SPARK-9107][CORE][WIP] Netty network layer memory usage on webUI [SPARK-9104][SPARK-9105][SPARK-9106][SPARK-9107][CORE] Netty network layer memory usage on webUI Jan 27, 2016
@rxin
Copy link
Contributor

rxin commented Jun 15, 2016

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one. We can also continue the discussion on the JIRA ticket.

Note - I actually think this is a pretty useful feature, and in the future maybe we should switch over to consolidate memory management entirely with Netty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants