-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26357][Core] Expose executors' procfs metrics to Metrics system #23306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b1aaca0 to
aadd699
Compare
|
Jenkins, add to whitelist |
|
I don't think you have to disallow both the executor metrics & the metricssystem, its fine to allow both. If we're concerned about overhead, then it would make sense to have the metrics getter cache the previously computed values, and only recompute every N millis (configurable). Seems like overkill at this point |
|
Test build #100102 has finished for PR 23306 at commit
|
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsSource.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsSource.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsSource.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsSource.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
|
Test build #100171 has finished for PR 23306 at commit
|
|
Flaky test. Jenkins retest this please. |
|
Test build #100175 has finished for PR 23306 at commit
|
|
Jenkins retest this please. |
|
Test build #100220 has finished for PR 23306 at commit
|
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala
Outdated
Show resolved
Hide resolved
| otherVmemTotal: Long, | ||
| otherRSSTotal: Long) | ||
| otherRSSTotal: Long, | ||
| timeStamp: Long) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be better if we keep the timestamp metric outside of ProcfsMetrics, as this is not a metric and I could not find any benefit of keeping inside ProcfsMetrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered timestamp as a property of the set of metrics, so probably better to keep it like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Ankur, whats the point of including the timestamp here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think responded to that. What is the point of separating it when it is used as a property of the set of metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point that both I and Imran are trying to make here is that timestamp is not a metric and thus it should not be kept inside ProcfsMetrics. The purpose of timestamp is to evict older cached metrics and thus it should be a part of ProcfsMetricsGetter along with cachedAllMetric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered ProcfsMetrics as a class. The objects of this class have properties which are metrics and now a timestamp. So I think it is better to have timestamp defined along with other properties. This make the code cleaner as well. Sorry if I can't understand why it shouldn't be here.
| val lastMetricComputation = System.currentTimeMillis() - cachedAllMetric.timeStamp | ||
| // Check whether we have computed the metrics in the past 1s | ||
| // ToDo: Should we make this configurable? | ||
| if(lastMetricComputation > Math.min(1000, HEARTBEAT_INTERVAL_MS)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to make it configurable, defaulting to 10s maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also think it makes sense to be configurable. What do you think @squito?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add more context about having 1000ms here. The pulling request for Metrics system can't be less than 1 second. So user can configure the caching period using heartbeat interval if they want to cache for less than 1 second. The configuration option can let them to have a cache that is valid for more than 1 second.
| private[executor] class ProcfsMetricsSource extends Source { | ||
| override val sourceName = "procfs" | ||
| override val metricRegistry = new MetricRegistry() | ||
| // We use numMetrics for tracking to only call computAllMetrics once per set of metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this now that we have caching? This looks like a hacky way to achieve this anyway, it will be better to have some alternate way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that this still can save us from unnecessary calls. So I kept it. Why you think it is hacky? The way that Metrics system is designed is that it just return a single value from a guage. There are some other methods to return a set of metrics, but to use that we need to make more changes to the procfsgetter to implement Dropwised metric interface for each metric that we are going to report. I think that isn't necessary and it make the code uglier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I think this is hacky is that getProcfsMetrics relies on an internal state (numMetrics) to determine whether to call computeAllMetrics or not. I will prefer that getProcfsMetrics is a stateless method though.
I understand the limitations imposed by the MetricsSystem but would still like you to evaluate alternate approaches. If you think this is the best way to achieve this, then it is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also with Ankur here, I don't understand the point of this. other metrics, eg. NettyMemoryMetrics dont' seem to need to do the same thing w/ numMetrics, and nothing about the metrics api makes it look like you'd need to. Why do you think this is necessary? It seems you just need to return a MetricRegistry with all of the metrics registered there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to be done here to avoid calling procfsMetricsGetter.computeAllMetrics to compute the same set of metrics multiple times. I think we had this discusion in other review as well, but there we removed the need for this by changing the ExecutorMetricType API. Here we can't change the dropwizard API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, NettyMemoryMetrics is implementing MetricSet and each metrics there also have implemented Metric interface. As I responded in my earlier comment, If I go that route I will avoid this here, but then code in ProcfsMetricGetter will be much uglier and to be honest I don't want to change that since it took 5 months for us to reach an agreement there. The gain also wouldn't be that much. The purpose of this code here is to have a less impact on the performance by removing unnecessary calls.
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala
Outdated
Show resolved
Hide resolved
|
Test build #100251 has finished for PR 23306 at commit
|
|
Test build #100260 has finished for PR 23306 at commit
|
|
Test build #100288 has finished for PR 23306 at commit
|
… synchronized block
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala
Outdated
Show resolved
Hide resolved
|
Test build #100304 has finished for PR 23306 at commit
|
|
Test build #101686 has finished for PR 23306 at commit
|
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
This change exposes the executors' proscfs metrics that were introduced in SPARK-24958 to Metrics system. To do this a new metric source is defined and a new config is also added. Using the configs user can choose whether they want to see procfs metrics through Metrics system. To avoid overhead a cache is added in ProsfsMtericsGetter to avoid computing metrics if they have beean computed in the past one second.
This was tested manually and I verified that procfs metrics are reporting in Metrics system using Console sink:
application_1544653637885_0020.driver.procfs.processTree.JVMRSSMemory
value = 696242176
application_1544653637885_0020.driver.procfs.processTree.JVMVMemory
value = 4959170560
application_1544653637885_0020.driver.procfs.processTree.OtherRSSMemory
value = 0
application_1544653637885_0020.driver.procfs.processTree.OtherVMemory
value = 0
application_1544653637885_0020.driver.procfs.processTree.PythonRSSMemory
value = 33714176
application_1544653637885_0020.driver.procfs.processTree.PythonVMemory
value = 401711104
And they got updated as well:
application_1544653637885_0020.driver.procfs.processTree.JVMRSSMemory
value = 732999680
application_1544653637885_0020.driver.procfs.processTree.JVMVMemory
value = 4977057792
application_1544653637885_0020.driver.procfs.processTree.OtherRSSMemory
value = 0
application_1544653637885_0020.driver.procfs.processTree.OtherVMemory
value = 0
application_1544653637885_0020.driver.procfs.processTree.PythonRSSMemory
value = 33714176
application_1544653637885_0020.driver.procfs.processTree.PythonVMemory
value = 401711104