Skip to content

Conversation

@pgandhi999
Copy link

@pgandhi999 pgandhi999 commented Sep 20, 2018

Recently, the ability to expose the metrics for YARN Shuffle Service was added as part of SPARK-18364. We need to add some metrics to be able to determine the number of active connections as well as open connections to the external shuffle service to benchmark network and connection issues on large cluster environments.

What changes were proposed in this pull request?

Added two more shuffle server metrics for Spark Yarn shuffle service: numRegisteredConnections which indicate the number of registered connections to the shuffle service and numActiveConnections which indicate the number of active connections to the shuffle service at any given point in time.

How was this patch tested?

If these metrics are outputted to a file, we get something like this:

1533674653489 default.shuffleService: Hostname=server1.abc.com, openBlockRequestLatencyMillis_count=729, openBlockRequestLatencyMillis_rate15=0.7110833548897356, openBlockRequestLatencyMillis_rate5=1.657808981793011, openBlockRequestLatencyMillis_rate1=2.2404486061620474, openBlockRequestLatencyMillis_rateMean=0.9242558551196706,
numRegisteredConnections=35,
blockTransferRateBytes_count=2635880512, blockTransferRateBytes_rate15=2578547.6094160094, blockTransferRateBytes_rate5=6048721.726302424, blockTransferRateBytes_rate1=8548922.518223226, blockTransferRateBytes_rateMean=3341878.633637769, registeredExecutorsSize=5, registerExecutorRequestLatencyMillis_count=5, registerExecutorRequestLatencyMillis_rate15=0.0027973949328659836, registerExecutorRequestLatencyMillis_rate5=0.0021278007987206426, registerExecutorRequestLatencyMillis_rate1=2.8270296777387467E-6, registerExecutorRequestLatencyMillis_rateMean=0.006339206380043053, numActiveConnections=35

Added shuffle server metrics for Spark Yarn shuffle service. I have made my changes on top of Andrew Ash's PR and have additionally added two more metrics on top of them: numRegisteredConnections which indicate the number of registered connections to the shuffle service and numActiveConnections which indicate the number of active connections to the shuffle service at any given point in time. If these metrics are outputted to a file, we get something like this:

1533674653489 default.shuffleService: Hostname=openqe26blue-n9.blue.ygrid.yahoo.com, openBlockRequestLatencyMillis_count=729, openBlockRequestLatencyMillis_rate15=0.7110833548897356, openBlockRequestLatencyMillis_rate5=1.657808981793011, openBlockRequestLatencyMillis_rate1=2.2404486061620474, openBlockRequestLatencyMillis_rateMean=0.9242558551196706,
numRegisteredConnections=35,
blockTransferRateBytes_count=2635880512, blockTransferRateBytes_rate15=2578547.6094160094, blockTransferRateBytes_rate5=6048721.726302424, blockTransferRateBytes_rate1=8548922.518223226, blockTransferRateBytes_rateMean=3341878.633637769, registeredExecutorsSize=5, registerExecutorRequestLatencyMillis_count=5, registerExecutorRequestLatencyMillis_rate15=0.0027973949328659836, registerExecutorRequestLatencyMillis_rate5=0.0021278007987206426, registerExecutorRequestLatencyMillis_rate1=2.8270296777387467E-6, registerExecutorRequestLatencyMillis_rateMean=0.006339206380043053, numActiveConnections=35
@tgravescs
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Sep 20, 2018

Test build #96380 has finished for PR 22498 at commit cf74f36.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class ShuffleMetrics implements MetricSet
  • public class YarnShuffleServiceMetrics implements MetricsSource

@SparkQA
Copy link

SparkQA commented Sep 21, 2018

Test build #96382 has finished for PR 22498 at commit 1ac18d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

@pgandhi999 thanks for the pr, since you have additional changes and there are additional changes that were made by the original author here: palantir#236 Perhaps for now we should review the lowest common denominator one and get that in and we can put additional metrics on top of it in separate prs.

@pgandhi999
Copy link
Author

@tgravescs Fine with me, thank you.

@pgandhi999 pgandhi999 changed the title [SPARK-18364] : Expose metrics for YarnShuffleService [SPARK-25642] : Adding two new metrics to record the number of registered connections as well as the number of active connections to YARN Shuffle Service Oct 4, 2018
@SparkQA
Copy link

SparkQA commented Oct 4, 2018

Test build #96950 has finished for PR 22498 at commit edd355e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 6, 2018

Test build #97013 has finished for PR 22498 at commit 70472a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


// register metrics on the block handler into the Node Manager's metrics system.
blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections",
shuffleServer.getRegisteredConnections());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indented too far, here and in others.

@vanzin
Copy link
Contributor

vanzin commented Dec 11, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Dec 12, 2018

Test build #99996 has finished for PR 22498 at commit 70472a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Dec 12, 2018

@tgravescs it's not clear from your last comment whether you wanted more things to be added here? otherwise I was planning to merge it.

@tgravescs
Copy link
Contributor

no the other ones can be done separately. Sorry I haven't had time to review so thanks for doing it @vanzin.


@Override
public void channelRegistered(ChannelHandlerContext ctx) {
transportContext.getRegisteredConnections().inc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to call super.blah in these overrides so that other handlers in the pipeline also get the notification.

Calling superclass methods from overrided methods and fixing indentation.
@pgandhi999
Copy link
Author

Sorry, I was on vacation so getting back to the PR today. Thank you for reviewing the PR @vanzin . I have pushed the required changes.

@SparkQA
Copy link

SparkQA commented Dec 20, 2018

Test build #100337 has finished for PR 22498 at commit 3c5ef99.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…RK-18364

[SPARK-25642] : Upmerging with master branch
@SparkQA
Copy link

SparkQA commented Dec 20, 2018

Test build #100340 has finished for PR 22498 at commit b5559e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Dec 20, 2018

retest this please

Counter c = (Counter) metric;
long counterValue = c.getCount();
metricsRecordBuilder.addGauge(new ShuffleServiceMetricsInfo(name, "Number of " +
"connections to shuffle service " + name), counterValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thank you.

@SparkQA
Copy link

SparkQA commented Dec 21, 2018

Test build #100345 has finished for PR 22498 at commit b5559e3.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2018

Test build #100362 has finished for PR 22498 at commit e2414ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging to master. I'll remove the import during merge.


import java.io.File;
import java.io.IOException;
import java.lang.reflect.Method;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much. Sorry about that.

@asfgit asfgit closed this in 8dd29fe Dec 21, 2018
holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
…gistered connections as well as the number of active connections to YARN Shuffle Service

Recently, the ability to expose the metrics for YARN Shuffle Service was added as part of [SPARK-18364](apache#22485). We need to add some metrics to be able to determine the number of active connections as well as open connections to the external shuffle service to benchmark network and connection issues on large cluster environments.

Added two more shuffle server metrics for Spark Yarn shuffle service: numRegisteredConnections which indicate the number of registered connections to the shuffle service and numActiveConnections which indicate the number of active connections to the shuffle service at any given point in time.

If these metrics are outputted to a file, we get something like this:

1533674653489 default.shuffleService: Hostname=server1.abc.com, openBlockRequestLatencyMillis_count=729, openBlockRequestLatencyMillis_rate15=0.7110833548897356, openBlockRequestLatencyMillis_rate5=1.657808981793011, openBlockRequestLatencyMillis_rate1=2.2404486061620474, openBlockRequestLatencyMillis_rateMean=0.9242558551196706,
numRegisteredConnections=35,
blockTransferRateBytes_count=2635880512, blockTransferRateBytes_rate15=2578547.6094160094, blockTransferRateBytes_rate5=6048721.726302424, blockTransferRateBytes_rate1=8548922.518223226, blockTransferRateBytes_rateMean=3341878.633637769, registeredExecutorsSize=5, registerExecutorRequestLatencyMillis_count=5, registerExecutorRequestLatencyMillis_rate15=0.0027973949328659836, registerExecutorRequestLatencyMillis_rate5=0.0021278007987206426, registerExecutorRequestLatencyMillis_rate1=2.8270296777387467E-6, registerExecutorRequestLatencyMillis_rateMean=0.006339206380043053, numActiveConnections=35

Closes apache#22498 from pgandhi999/SPARK-18364.

Authored-by: pgandhi <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…gistered connections as well as the number of active connections to YARN Shuffle Service

Recently, the ability to expose the metrics for YARN Shuffle Service was added as part of [SPARK-18364](apache#22485). We need to add some metrics to be able to determine the number of active connections as well as open connections to the external shuffle service to benchmark network and connection issues on large cluster environments.

Added two more shuffle server metrics for Spark Yarn shuffle service: numRegisteredConnections which indicate the number of registered connections to the shuffle service and numActiveConnections which indicate the number of active connections to the shuffle service at any given point in time.

If these metrics are outputted to a file, we get something like this:

1533674653489 default.shuffleService: Hostname=server1.abc.com, openBlockRequestLatencyMillis_count=729, openBlockRequestLatencyMillis_rate15=0.7110833548897356, openBlockRequestLatencyMillis_rate5=1.657808981793011, openBlockRequestLatencyMillis_rate1=2.2404486061620474, openBlockRequestLatencyMillis_rateMean=0.9242558551196706,
numRegisteredConnections=35,
blockTransferRateBytes_count=2635880512, blockTransferRateBytes_rate15=2578547.6094160094, blockTransferRateBytes_rate5=6048721.726302424, blockTransferRateBytes_rate1=8548922.518223226, blockTransferRateBytes_rateMean=3341878.633637769, registeredExecutorsSize=5, registerExecutorRequestLatencyMillis_count=5, registerExecutorRequestLatencyMillis_rate15=0.0027973949328659836, registerExecutorRequestLatencyMillis_rate5=0.0021278007987206426, registerExecutorRequestLatencyMillis_rate1=2.8270296777387467E-6, registerExecutorRequestLatencyMillis_rateMean=0.006339206380043053, numActiveConnections=35

Closes apache#22498 from pgandhi999/SPARK-18364.

Authored-by: pgandhi <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
prakharjain09 pushed a commit to prakharjain09/spark that referenced this pull request Nov 29, 2019
…gistered connections as well as the number of active connections to YARN Shuffle Service

Recently, the ability to expose the metrics for YARN Shuffle Service was added as part of [SPARK-18364](apache#22485). We need to add some metrics to be able to determine the number of active connections as well as open connections to the external shuffle service to benchmark network and connection issues on large cluster environments.

Added two more shuffle server metrics for Spark Yarn shuffle service: numRegisteredConnections which indicate the number of registered connections to the shuffle service and numActiveConnections which indicate the number of active connections to the shuffle service at any given point in time.

If these metrics are outputted to a file, we get something like this:

1533674653489 default.shuffleService: Hostname=server1.abc.com, openBlockRequestLatencyMillis_count=729, openBlockRequestLatencyMillis_rate15=0.7110833548897356, openBlockRequestLatencyMillis_rate5=1.657808981793011, openBlockRequestLatencyMillis_rate1=2.2404486061620474, openBlockRequestLatencyMillis_rateMean=0.9242558551196706,
numRegisteredConnections=35,
blockTransferRateBytes_count=2635880512, blockTransferRateBytes_rate15=2578547.6094160094, blockTransferRateBytes_rate5=6048721.726302424, blockTransferRateBytes_rate1=8548922.518223226, blockTransferRateBytes_rateMean=3341878.633637769, registeredExecutorsSize=5, registerExecutorRequestLatencyMillis_count=5, registerExecutorRequestLatencyMillis_rate15=0.0027973949328659836, registerExecutorRequestLatencyMillis_rate5=0.0021278007987206426, registerExecutorRequestLatencyMillis_rate1=2.8270296777387467E-6, registerExecutorRequestLatencyMillis_rateMean=0.006339206380043053, numActiveConnections=35

Closes apache#22498 from pgandhi999/SPARK-18364.

Authored-by: pgandhi <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
(cherry picked from commit 8dd29fe)
XUJiahua pushed a commit to XUJiahua/spark that referenced this pull request Apr 9, 2020
…gistered connections as well as the number of active connections to YARN Shuffle Service

Recently, the ability to expose the metrics for YARN Shuffle Service was added as part of [SPARK-18364](apache#22485). We need to add some metrics to be able to determine the number of active connections as well as open connections to the external shuffle service to benchmark network and connection issues on large cluster environments.

Added two more shuffle server metrics for Spark Yarn shuffle service: numRegisteredConnections which indicate the number of registered connections to the shuffle service and numActiveConnections which indicate the number of active connections to the shuffle service at any given point in time.

If these metrics are outputted to a file, we get something like this:

1533674653489 default.shuffleService: Hostname=server1.abc.com, openBlockRequestLatencyMillis_count=729, openBlockRequestLatencyMillis_rate15=0.7110833548897356, openBlockRequestLatencyMillis_rate5=1.657808981793011, openBlockRequestLatencyMillis_rate1=2.2404486061620474, openBlockRequestLatencyMillis_rateMean=0.9242558551196706,
numRegisteredConnections=35,
blockTransferRateBytes_count=2635880512, blockTransferRateBytes_rate15=2578547.6094160094, blockTransferRateBytes_rate5=6048721.726302424, blockTransferRateBytes_rate1=8548922.518223226, blockTransferRateBytes_rateMean=3341878.633637769, registeredExecutorsSize=5, registerExecutorRequestLatencyMillis_count=5, registerExecutorRequestLatencyMillis_rate15=0.0027973949328659836, registerExecutorRequestLatencyMillis_rate5=0.0021278007987206426, registerExecutorRequestLatencyMillis_rate1=2.8270296777387467E-6, registerExecutorRequestLatencyMillis_rateMean=0.006339206380043053, numActiveConnections=35

Closes apache#22498 from pgandhi999/SPARK-18364.

Authored-by: pgandhi <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
(cherry picked from commit 8dd29fe)
(cherry picked from commit 7a6d4d476a81419be0561d0cb98ad5b89de0782d)

Change-Id: I682129f3c372aa83e382f159423b440344eee266
(cherry picked from commit 905ab9e0b4f67b226ef5ce4e3b2088d0d5cd43aa)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants