Skip to content

Conversation

@LucaCanali
Copy link
Contributor

What changes were proposed in this pull request?

YARN applicationMaster metrics registration introduced in SPARK-24594 causes further registration of static metrics (Codegenerator and HiveExternalCatalog) and of JVM metrics, which I believe do not belong in this context.
This looks like an unintended side effect of using the start method of [[MetricsSystem]].
A possible solution proposed here, is to introduce startNoRegisterSources to avoid these additional registrations of static sources and of JVM sources in the case of YARN applicationMaster metrics (this could be useful for other metrics that may be added in the future).

How was this patch tested?

Manually tested on a YARN cluster,

@jerryshao
Copy link
Contributor

Hi @LucaCanali do you have an output current AM metrics? I would like to know what kind of metrics will be output for now.

sinks.foreach(_.start)
}

// Same as start but this method only registers sinks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please explain why only registering sinks could solve the problem here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am trying to do is to avoid the registration of the "static metrics", for CodeGeneration and HiveExternalCatalog and also for JVM.

@LucaCanali
Copy link
Contributor Author

Hi @jerryshao you can find here below an example of metrics currently reported by applicationMaster, illustrating the issue reported here. You can find there the list of AM metrics reported (with the application ID as a prefix by default). In addition metrics for CodeGeneration and HiveExternalCatalog are also reported, these metrics do not make sense in this context, in addition they have no prefix. Metrics for JVM are reported too (without application_id prefix), which I am not sure it is wanted either.

bin/spark-shell --master yarn \
--conf "spark.metrics.conf.applicationMaster.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink" \
--conf "spark.metrics.conf.*.sink.graphite.host"=lc-mytest5 \
--conf "spark.metrics.conf.*.sink.graphite.port"=2003 \
--conf "spark.metrics.conf.*.sink.graphite.period"=10 \
--conf "spark.metrics.conf.*.sink.graphite.unit"=seconds \
--conf "spark.metrics.conf.*.sink.graphite.prefix"="luca" \
--conf "spark.metrics.conf.*.source.jvm.class"="org.apache.spark.metrics.source.JvmSource"

I have used InfluxDB to collect the metrics. This is the output of "show measurements" in InfluxDB:

name: measurements
name
----
CodeGenerator.compilationTime.count
CodeGenerator.compilationTime.max
CodeGenerator.compilationTime.mean
CodeGenerator.compilationTime.min
CodeGenerator.compilationTime.p50
CodeGenerator.compilationTime.p75
CodeGenerator.compilationTime.p95
CodeGenerator.compilationTime.p98
CodeGenerator.compilationTime.p99
CodeGenerator.compilationTime.p999
CodeGenerator.compilationTime.stddev
CodeGenerator.generatedClassSize.count
CodeGenerator.generatedClassSize.max
CodeGenerator.generatedClassSize.mean
CodeGenerator.generatedClassSize.min
CodeGenerator.generatedClassSize.p50
CodeGenerator.generatedClassSize.p75
CodeGenerator.generatedClassSize.p95
CodeGenerator.generatedClassSize.p98
CodeGenerator.generatedClassSize.p99
CodeGenerator.generatedClassSize.p999
CodeGenerator.generatedClassSize.stddev
CodeGenerator.generatedMethodSize.count
CodeGenerator.generatedMethodSize.max
CodeGenerator.generatedMethodSize.mean
CodeGenerator.generatedMethodSize.min
CodeGenerator.generatedMethodSize.p50
CodeGenerator.generatedMethodSize.p75
CodeGenerator.generatedMethodSize.p95
CodeGenerator.generatedMethodSize.p98
CodeGenerator.generatedMethodSize.p99
CodeGenerator.generatedMethodSize.p999
CodeGenerator.generatedMethodSize.stddev
CodeGenerator.sourceCodeSize.count
CodeGenerator.sourceCodeSize.max
CodeGenerator.sourceCodeSize.mean
CodeGenerator.sourceCodeSize.min
CodeGenerator.sourceCodeSize.p50
CodeGenerator.sourceCodeSize.p75
CodeGenerator.sourceCodeSize.p95
CodeGenerator.sourceCodeSize.p98
CodeGenerator.sourceCodeSize.p99
CodeGenerator.sourceCodeSize.p999
CodeGenerator.sourceCodeSize.stddev
HiveExternalCatalog.fileCacheHits.count
HiveExternalCatalog.filesDiscovered.count
HiveExternalCatalog.hiveClientCalls.count
HiveExternalCatalog.parallelListingJobCount.count
HiveExternalCatalog.partitionsFetched.count
application_1516620698330_110908.applicationMaster.numContainersPendingAllocate
application_1516620698330_110908.applicationMaster.numExecutorsFailed
application_1516620698330_110908.applicationMaster.numExecutorsRunning
application_1516620698330_110908.applicationMaster.numLocalityAwareTasks
application_1516620698330_110908.applicationMaster.numReleasedContainers
jvm.PS-MarkSweep.count
jvm.PS-MarkSweep.time
jvm.PS-Scavenge.count
jvm.PS-Scavenge.time
jvm.direct.capacity
jvm.direct.count
jvm.direct.used
jvm.heap.committed
jvm.heap.init
jvm.heap.max
jvm.heap.usage
jvm.heap.used
jvm.mapped.capacity
jvm.mapped.count
jvm.mapped.used
jvm.non-heap.committed
jvm.non-heap.init
jvm.non-heap.max
jvm.non-heap.usage
jvm.non-heap.used
jvm.pools.Code-Cache.committed
jvm.pools.Code-Cache.init
jvm.pools.Code-Cache.max
jvm.pools.Code-Cache.usage
jvm.pools.Code-Cache.used
jvm.pools.Compressed-Class-Space.committed
jvm.pools.Compressed-Class-Space.init
jvm.pools.Compressed-Class-Space.max
jvm.pools.Compressed-Class-Space.usage
jvm.pools.Compressed-Class-Space.used
jvm.pools.Metaspace.committed
jvm.pools.Metaspace.init
jvm.pools.Metaspace.max
jvm.pools.Metaspace.usage
jvm.pools.Metaspace.used
jvm.pools.PS-Eden-Space.committed
jvm.pools.PS-Eden-Space.init
jvm.pools.PS-Eden-Space.max
jvm.pools.PS-Eden-Space.usage
jvm.pools.PS-Eden-Space.used
jvm.pools.PS-Old-Gen.committed
jvm.pools.PS-Old-Gen.init
jvm.pools.PS-Old-Gen.max
jvm.pools.PS-Old-Gen.usage
jvm.pools.PS-Old-Gen.used
jvm.pools.PS-Survivor-Space.committed
jvm.pools.PS-Survivor-Space.init
jvm.pools.PS-Survivor-Space.max
jvm.pools.PS-Survivor-Space.usage
jvm.pools.PS-Survivor-Space.used
jvm.total.committed
jvm.total.init
jvm.total.max
jvm.total.used

@LucaCanali
Copy link
Contributor Author

@jerryshao would you have any additional comments on this?

@LucaCanali
Copy link
Contributor Author

@attilapiros would you be interested to review this as a follow-up of your work on [SPARK-24594][YARN] Introducing metrics for YARN ?

sinks.foreach(_.start)
}

// Same as start but this method only registers sinks
Copy link
Contributor

@attilapiros attilapiros Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see it is a duplication of the start() method without the registerSources() but what about extending the start() method with a new boolean flag (like registerStaticSources:Boolean = true) and using this flag to decide registerSources() should be called or not. Then in AM you can call it with false. This way there is no code duplication and more clear what is the different between the two usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @attilapiros for looking at this. I agree with your proposal. I'll provide an update to the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking a bit more about this PR and in case of client mode it could make sense to have some of this static metrics as the driver and AM are separated. What about (assuming the boolean flag is there) calling the start method with false only in case of cluster mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split the case in 2 topics:

  1. I belive that CodeGenerator and HiveExternalCatalog metrics don't make sense in the context of AM, so they can be safely removed.
  2. The JVM metrics may be relevant as you mentioned. Although in the current version I see the problem that the JVM metrics for the AM appear without any application id nor prefix, so they are difficult to process. I guess this part can be improved if we think JVM metrics for AM can be of interest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can just fully agree with you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to close the loop here: I like the extra flag to start instead of the separate method.

As for AM vs. driver, the way I understand things, in cluster mode, there will be two MetricSystem instances - one for the AM and one for the driver, so you shouldn't lose any driver metrics.

JVM metrics for the client-mode AM can be interesting, but that's generally not a source of problems that I've noticed, so we can probably punt on it for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented the extra flag to start and tested the patch on a YARN cluster using a graphite sink for the metrics.

@vanzin
Copy link
Contributor

vanzin commented Dec 10, 2018

ok to test

@SparkQA
Copy link

SparkQA commented Dec 11, 2018

Test build #99939 has finished for PR 22279 at commit a990758.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LucaCanali
Copy link
Contributor Author

Thanks @vanzin for looking at this.

@SparkQA
Copy link

SparkQA commented Dec 12, 2018

Test build #100017 has finished for PR 22279 at commit fd11730.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Dec 12, 2018

retest this please

running = true
StaticSources.allSources.foreach(registerSource)
registerSources()
if(registerStaticSources) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space before (

@SparkQA
Copy link

SparkQA commented Dec 12, 2018

Test build #100038 has finished for PR 22279 at commit fd11730.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2018

Test build #100044 has finished for PR 22279 at commit 9c21f16.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Dec 13, 2018

Giving up on flaky tests... interesting ones all passed. Merging to master.

@asfgit asfgit closed this in 2920438 Dec 13, 2018
holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
…r static metrics

## What changes were proposed in this pull request?

YARN applicationMaster metrics registration introduced in SPARK-24594 causes further registration of static metrics (Codegenerator and HiveExternalCatalog) and of JVM metrics, which I believe do not belong in this context.
This looks like an unintended side effect of using the start method of [[MetricsSystem]].
A possible solution proposed here, is to introduce startNoRegisterSources to avoid these additional registrations of static sources and of JVM sources in the case of YARN applicationMaster metrics (this could be useful for other metrics that may be added in the future).

## How was this patch tested?

Manually tested on a YARN cluster,

Closes apache#22279 from LucaCanali/YarnMetricsRemoveExtraSourceRegistration.

Lead-authored-by: Luca Canali <[email protected]>
Co-authored-by: LucaCanali <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…r static metrics

## What changes were proposed in this pull request?

YARN applicationMaster metrics registration introduced in SPARK-24594 causes further registration of static metrics (Codegenerator and HiveExternalCatalog) and of JVM metrics, which I believe do not belong in this context.
This looks like an unintended side effect of using the start method of [[MetricsSystem]].
A possible solution proposed here, is to introduce startNoRegisterSources to avoid these additional registrations of static sources and of JVM sources in the case of YARN applicationMaster metrics (this could be useful for other metrics that may be added in the future).

## How was this patch tested?

Manually tested on a YARN cluster,

Closes apache#22279 from LucaCanali/YarnMetricsRemoveExtraSourceRegistration.

Lead-authored-by: Luca Canali <[email protected]>
Co-authored-by: LucaCanali <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
@dongjoon-hyun
Copy link
Member

Hi, All. Sorry for commenting on this old PR. This looks safe and worthy for backporting.
Since branch-2.4 is our LTS branch for 2.4.x, I'll test and backport this to branch-2.4.

dongjoon-hyun pushed a commit that referenced this pull request Sep 16, 2019
…r static metrics

## What changes were proposed in this pull request?

YARN applicationMaster metrics registration introduced in SPARK-24594 causes further registration of static metrics (Codegenerator and HiveExternalCatalog) and of JVM metrics, which I believe do not belong in this context.
This looks like an unintended side effect of using the start method of [[MetricsSystem]].
A possible solution proposed here, is to introduce startNoRegisterSources to avoid these additional registrations of static sources and of JVM sources in the case of YARN applicationMaster metrics (this could be useful for other metrics that may be added in the future).

## How was this patch tested?

Manually tested on a YARN cluster,

Closes #22279 from LucaCanali/YarnMetricsRemoveExtraSourceRegistration.

Lead-authored-by: Luca Canali <[email protected]>
Co-authored-by: LucaCanali <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants