Skip to content

Conversation

@rezasafi
Copy link
Contributor

@rezasafi rezasafi commented Dec 13, 2018

This change exposes the executors' proscfs metrics that were introduced in SPARK-24958 to Metrics system. To do this a new metric source is defined and a new config is also added. Using the configs user can choose whether they want to see procfs metrics through Metrics system. To avoid overhead a cache is added in ProsfsMtericsGetter to avoid computing metrics if they have beean computed in the past one second.

This was tested manually and I verified that procfs metrics are reporting in Metrics system using Console sink:
application_1544653637885_0020.driver.procfs.processTree.JVMRSSMemory
value = 696242176
application_1544653637885_0020.driver.procfs.processTree.JVMVMemory
value = 4959170560
application_1544653637885_0020.driver.procfs.processTree.OtherRSSMemory
value = 0
application_1544653637885_0020.driver.procfs.processTree.OtherVMemory
value = 0
application_1544653637885_0020.driver.procfs.processTree.PythonRSSMemory
value = 33714176
application_1544653637885_0020.driver.procfs.processTree.PythonVMemory
value = 401711104

And they got updated as well:
application_1544653637885_0020.driver.procfs.processTree.JVMRSSMemory
value = 732999680
application_1544653637885_0020.driver.procfs.processTree.JVMVMemory
value = 4977057792
application_1544653637885_0020.driver.procfs.processTree.OtherRSSMemory
value = 0
application_1544653637885_0020.driver.procfs.processTree.OtherVMemory
value = 0
application_1544653637885_0020.driver.procfs.processTree.PythonRSSMemory
value = 33714176
application_1544653637885_0020.driver.procfs.processTree.PythonVMemory
value = 401711104

@rezasafi
Copy link
Contributor Author

@squito

@squito
Copy link
Contributor

squito commented Dec 13, 2018

Jenkins, add to whitelist

@squito
Copy link
Contributor

squito commented Dec 13, 2018

I don't think you have to disallow both the executor metrics & the metricssystem, its fine to allow both. If we're concerned about overhead, then it would make sense to have the metrics getter cache the previously computed values, and only recompute every N millis (configurable). Seems like overkill at this point

@SparkQA
Copy link

SparkQA commented Dec 13, 2018

Test build #100102 has finished for PR 23306 at commit aadd699.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 15, 2018

Test build #100171 has finished for PR 23306 at commit e01779c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rezasafi
Copy link
Contributor Author

Flaky test. Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Dec 15, 2018

Test build #100175 has finished for PR 23306 at commit e01779c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rezasafi
Copy link
Contributor Author

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Dec 17, 2018

Test build #100220 has finished for PR 23306 at commit e01779c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

otherVmemTotal: Long,
otherRSSTotal: Long)
otherRSSTotal: Long,
timeStamp: Long)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be better if we keep the timestamp metric outside of ProcfsMetrics, as this is not a metric and I could not find any benefit of keeping inside ProcfsMetrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered timestamp as a property of the set of metrics, so probably better to keep it like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Ankur, whats the point of including the timestamp here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think responded to that. What is the point of separating it when it is used as a property of the set of metrics?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point that both I and Imran are trying to make here is that timestamp is not a metric and thus it should not be kept inside ProcfsMetrics. The purpose of timestamp is to evict older cached metrics and thus it should be a part of ProcfsMetricsGetter along with cachedAllMetric.

Copy link
Contributor Author

@rezasafi rezasafi Dec 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered ProcfsMetrics as a class. The objects of this class have properties which are metrics and now a timestamp. So I think it is better to have timestamp defined along with other properties. This make the code cleaner as well. Sorry if I can't understand why it shouldn't be here.

val lastMetricComputation = System.currentTimeMillis() - cachedAllMetric.timeStamp
// Check whether we have computed the metrics in the past 1s
// ToDo: Should we make this configurable?
if(lastMetricComputation > Math.min(1000, HEARTBEAT_INTERVAL_MS)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to make it configurable, defaulting to 10s maybe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also think it makes sense to be configurable. What do you think @squito?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add more context about having 1000ms here. The pulling request for Metrics system can't be less than 1 second. So user can configure the caching period using heartbeat interval if they want to cache for less than 1 second. The configuration option can let them to have a cache that is valid for more than 1 second.

private[executor] class ProcfsMetricsSource extends Source {
override val sourceName = "procfs"
override val metricRegistry = new MetricRegistry()
// We use numMetrics for tracking to only call computAllMetrics once per set of metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this now that we have caching? This looks like a hacky way to achieve this anyway, it will be better to have some alternate way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that this still can save us from unnecessary calls. So I kept it. Why you think it is hacky? The way that Metrics system is designed is that it just return a single value from a guage. There are some other methods to return a set of metrics, but to use that we need to make more changes to the procfsgetter to implement Dropwised metric interface for each metric that we are going to report. I think that isn't necessary and it make the code uglier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I think this is hacky is that getProcfsMetrics relies on an internal state (numMetrics) to determine whether to call computeAllMetrics or not. I will prefer that getProcfsMetrics is a stateless method though.

I understand the limitations imposed by the MetricsSystem but would still like you to evaluate alternate approaches. If you think this is the best way to achieve this, then it is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also with Ankur here, I don't understand the point of this. other metrics, eg. NettyMemoryMetrics dont' seem to need to do the same thing w/ numMetrics, and nothing about the metrics api makes it look like you'd need to. Why do you think this is necessary? It seems you just need to return a MetricRegistry with all of the metrics registered there.

Copy link
Contributor Author

@rezasafi rezasafi Dec 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to be done here to avoid calling procfsMetricsGetter.computeAllMetrics to compute the same set of metrics multiple times. I think we had this discusion in other review as well, but there we removed the need for this by changing the ExecutorMetricType API. Here we can't change the dropwizard API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, NettyMemoryMetrics is implementing MetricSet and each metrics there also have implemented Metric interface. As I responded in my earlier comment, If I go that route I will avoid this here, but then code in ProcfsMetricGetter will be much uglier and to be honest I don't want to change that since it took 5 months for us to reach an agreement there. The gain also wouldn't be that much. The purpose of this code here is to have a less impact on the performance by removing unnecessary calls.

@SparkQA
Copy link

SparkQA commented Dec 17, 2018

Test build #100251 has finished for PR 23306 at commit abe89f2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2018

Test build #100260 has finished for PR 23306 at commit 7ee6965.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2018

Test build #100288 has finished for PR 23306 at commit 7a21efc.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 19, 2018

Test build #100304 has finished for PR 23306 at commit 18fe510.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2019

Test build #101686 has finished for PR 23306 at commit e3b23b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

github-actions bot commented Jan 3, 2020

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Jan 3, 2020
@github-actions github-actions bot closed this Jan 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants