[SPARK-26357][Core] Expose executors' procfs metrics to Metrics system #23306

rezasafi · 2018-12-13T06:15:40Z

This change exposes the executors' proscfs metrics that were introduced in SPARK-24958 to Metrics system. To do this a new metric source is defined and a new config is also added. Using the configs user can choose whether they want to see procfs metrics through Metrics system. To avoid overhead a cache is added in ProsfsMtericsGetter to avoid computing metrics if they have beean computed in the past one second.

This was tested manually and I verified that procfs metrics are reporting in Metrics system using Console sink:
application_1544653637885_0020.driver.procfs.processTree.JVMRSSMemory
value = 696242176
application_1544653637885_0020.driver.procfs.processTree.JVMVMemory
value = 4959170560
application_1544653637885_0020.driver.procfs.processTree.OtherRSSMemory
value = 0
application_1544653637885_0020.driver.procfs.processTree.OtherVMemory
value = 0
application_1544653637885_0020.driver.procfs.processTree.PythonRSSMemory
value = 33714176
application_1544653637885_0020.driver.procfs.processTree.PythonVMemory
value = 401711104

And they got updated as well:
application_1544653637885_0020.driver.procfs.processTree.JVMRSSMemory
value = 732999680
application_1544653637885_0020.driver.procfs.processTree.JVMVMemory
value = 4977057792
application_1544653637885_0020.driver.procfs.processTree.OtherRSSMemory
value = 0
application_1544653637885_0020.driver.procfs.processTree.OtherVMemory
value = 0
application_1544653637885_0020.driver.procfs.processTree.PythonRSSMemory
value = 33714176
application_1544653637885_0020.driver.procfs.processTree.PythonVMemory
value = 401711104

rezasafi · 2018-12-13T06:15:59Z

@squito

squito · 2018-12-13T16:23:15Z

Jenkins, add to whitelist

squito · 2018-12-13T16:25:24Z

I don't think you have to disallow both the executor metrics & the metricssystem, its fine to allow both. If we're concerned about overhead, then it would make sense to have the metrics getter cache the previously computed values, and only recompute every N millis (configurable). Seems like overkill at this point

SparkQA · 2018-12-13T16:28:03Z

Test build #100102 has finished for PR 23306 at commit aadd699.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala

core/src/main/scala/org/apache/spark/executor/ProcfsMetricsSource.scala

core/src/main/scala/org/apache/spark/internal/config/package.scala

SparkQA · 2018-12-15T04:08:07Z

Test build #100171 has finished for PR 23306 at commit e01779c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rezasafi · 2018-12-15T04:14:04Z

Flaky test. Jenkins retest this please.

SparkQA · 2018-12-15T07:46:27Z

Test build #100175 has finished for PR 23306 at commit e01779c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rezasafi · 2018-12-17T04:36:34Z

Jenkins retest this please.

SparkQA · 2018-12-17T08:05:01Z

Test build #100220 has finished for PR 23306 at commit e01779c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

…xpose

core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala

ankuriitg · 2018-12-17T18:39:57Z

core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala

    otherVmemTotal: Long,
-    otherRSSTotal: Long)
+    otherRSSTotal: Long,
+    timeStamp: Long)


I think it will be better if we keep the timestamp metric outside of ProcfsMetrics, as this is not a metric and I could not find any benefit of keeping inside ProcfsMetrics

I considered timestamp as a property of the set of metrics, so probably better to keep it like this.

I agree with Ankur, whats the point of including the timestamp here?

I think responded to that. What is the point of separating it when it is used as a property of the set of metrics?

I think the point that both I and Imran are trying to make here is that timestamp is not a metric and thus it should not be kept inside ProcfsMetrics. The purpose of timestamp is to evict older cached metrics and thus it should be a part of ProcfsMetricsGetter along with cachedAllMetric.

I considered ProcfsMetrics as a class. The objects of this class have properties which are metrics and now a timestamp. So I think it is better to have timestamp defined along with other properties. This make the code cleaner as well. Sorry if I can't understand why it shouldn't be here.

ankuriitg · 2018-12-17T18:42:17Z

core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala

+    val lastMetricComputation = System.currentTimeMillis() - cachedAllMetric.timeStamp
+    // Check whether we have computed the metrics in the past 1s
+    // ToDo: Should we make this configurable?
+    if(lastMetricComputation > Math.min(1000, HEARTBEAT_INTERVAL_MS)) {


I think it makes sense to make it configurable, defaulting to 10s maybe

Yeah, I also think it makes sense to be configurable. What do you think @squito?

Just to add more context about having 1000ms here. The pulling request for Metrics system can't be less than 1 second. So user can configure the caching period using heartbeat interval if they want to cache for less than 1 second. The configuration option can let them to have a cache that is valid for more than 1 second.

ankuriitg · 2018-12-17T18:48:41Z

core/src/main/scala/org/apache/spark/executor/ProcfsMetricsSource.scala

 private[executor] class ProcfsMetricsSource extends Source {
  override val sourceName = "procfs"
-  override val metricRegistry = new MetricRegistry()
+  // We use numMetrics for tracking to only call computAllMetrics once per set of metrics


Can we remove this now that we have caching? This looks like a hacky way to achieve this anyway, it will be better to have some alternate way.

I thought that this still can save us from unnecessary calls. So I kept it. Why you think it is hacky? The way that Metrics system is designed is that it just return a single value from a guage. There are some other methods to return a set of metrics, but to use that we need to make more changes to the procfsgetter to implement Dropwised metric interface for each metric that we are going to report. I think that isn't necessary and it make the code uglier.

The reason I think this is hacky is that getProcfsMetrics relies on an internal state (numMetrics) to determine whether to call computeAllMetrics or not. I will prefer that getProcfsMetrics is a stateless method though.

I understand the limitations imposed by the MetricsSystem but would still like you to evaluate alternate approaches. If you think this is the best way to achieve this, then it is fine.

I'm also with Ankur here, I don't understand the point of this. other metrics, eg. NettyMemoryMetrics dont' seem to need to do the same thing w/ numMetrics, and nothing about the metrics api makes it look like you'd need to. Why do you think this is necessary? It seems you just need to return a MetricRegistry with all of the metrics registered there.

I think this is good to be done here to avoid calling procfsMetricsGetter.computeAllMetrics to compute the same set of metrics multiple times. I think we had this discusion in other review as well, but there we removed the need for this by changing the ExecutorMetricType API. Here we can't change the dropwizard API

BTW, NettyMemoryMetrics is implementing MetricSet and each metrics there also have implemented Metric interface. As I responded in my earlier comment, If I go that route I will avoid this here, but then code in ProcfsMetricGetter will be much uglier and to be honest I don't want to change that since it took 5 months for us to reach an agreement there. The gain also wouldn't be that much. The purpose of this code here is to have a less impact on the performance by removing unnecessary calls.

core/src/main/scala/org/apache/spark/internal/config/package.scala

core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala

SparkQA · 2018-12-17T20:26:18Z

Test build #100251 has finished for PR 23306 at commit abe89f2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-18T01:08:12Z

Test build #100260 has finished for PR 23306 at commit 7ee6965.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-18T20:02:06Z

Test build #100288 has finished for PR 23306 at commit 7a21efc.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

… synchronized block

core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala

SparkQA · 2018-12-19T21:26:28Z

Test build #100304 has finished for PR 23306 at commit 18fe510.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-25T23:00:41Z

Test build #101686 has finished for PR 23306 at commit e3b23b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2020-01-03T00:08:48Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

Exposing Procfsmetrics to metric system

aadd699

rezasafi force-pushed the procfsexpose branch from b1aaca0 to aadd699 Compare December 13, 2018 06:34

ankuriitg reviewed Dec 13, 2018

View reviewed changes

Fix the style issues and add a cache for metrics

e01779c

Merge branch 'master' of https://github.com/apache/spark into procfse…

abe89f2

…xpose

ankuriitg suggested changes Dec 17, 2018

View reviewed changes

Fix the forgotten cache statement

7ee6965

Some style comments and some other review comments applied

7a21efc

Removing timestamp from case class per reviewers request and adding a…

18fe510

… synchronized block

ankuriitg reviewed Dec 19, 2018

View reviewed changes

Review comments addressed

e3b23b8

rezasafi mentioned this pull request Mar 27, 2019

[SPARK-27189][CORE] Add Executor metrics and memory usage instrumentation to the metrics system #24132

Closed

dongjoon-hyun added the SPARK CORE label Jun 14, 2019

github-actions bot added the Stale label Jan 3, 2020

github-actions bot closed this Jan 4, 2020

[SPARK-26357][Core] Expose executors' procfs metrics to Metrics system #23306

[SPARK-26357][Core] Expose executors' procfs metrics to Metrics system #23306

Uh oh!

Conversation

rezasafi commented Dec 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rezasafi commented Dec 13, 2018

Uh oh!

squito commented Dec 13, 2018

Uh oh!

squito commented Dec 13, 2018

Uh oh!

SparkQA commented Dec 13, 2018

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Dec 15, 2018

Uh oh!

rezasafi commented Dec 15, 2018

Uh oh!

SparkQA commented Dec 15, 2018

Uh oh!

rezasafi commented Dec 17, 2018

Uh oh!

SparkQA commented Dec 17, 2018

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rezasafi Dec 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rezasafi Dec 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Dec 17, 2018

Uh oh!

SparkQA commented Dec 18, 2018

Uh oh!

SparkQA commented Dec 18, 2018

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rezasafi commented Dec 13, 2018 •

edited

Loading

rezasafi Dec 17, 2018 •

edited

Loading

rezasafi Dec 17, 2018 •

edited

Loading