[FLINK-24775][coordination] move JobStatus-related metrics out of ExecutionGraph #17735

zentol · 2021-11-09T12:24:07Z

Based on #17722.

The down-/up-/restartTime metrics are now setup in the schedulersinstead of the ExecutionGraph, similar to the numRestart metrics.
To this end the AdaptiveScheduler now maintains its own set of state timestamps, according to the job state transitions that the scheduler advertises (i.e., they are not based on the ExecutionGraph).

This prevents collisions upon rescaling as they are now only registered once.

flinkbot · 2021-11-09T12:28:16Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 075f8ee (Tue Nov 09 12:28:16 UTC 2021)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2021-11-09T12:29:24Z

CI report:

4db2506 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

dmvk

The change LGTM overall 👍 My main question would be whether we could unify the JobStatus metrics between DefaultScheduler / AdaptiveScheduler / ExecutionGraph a bit more, by reusing the new JobStatusStore and registerMetrics().

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveScheduler.java

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/JobStatusStore.java

dmvk · 2021-11-12T11:31:54Z

...runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java

+        // wait for the second task submissions
+        taskManagerGateway.waitForSubmissions(2, Duration.ofSeconds(5));
+
+        // sleep a bit to ensure uptime is > 0


What do you think about passing a Clock instance to Adaptive scheduler instead of relying on System.currentTimeMillis() to simplify the test?

I generally like doing that, but I'm wondering if this would work properly for the AdaptiveScheduler in that truly all time-measurements go through the clock. For smaller self-contained components it is easy to ensure that, but this isn't the case here because we re-use some parts of the SchedulerBase/DefaultScheduler, there are multiple state classes, then internally there is the EG, ....
It would be a bit unsatisfactory to introduce a clock but only use it in one place :/

dmvk

LGTM, great job ;)

...runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java

dmvk · 2021-11-22T10:02:58Z

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveScheduler.java


-        registerMetrics();
+        SchedulerBase.registerJobMetrics(
+                jobManagerJobMetricGroup, jobStatusStore, () -> (long) numRestarts);


Does numRestarts need to be volatile? If I understand that correctly if we access the metric eg. using JMX, then it gets accessed by a different thread. Or is there some synchronization in the metrics system that I'm missing?

Does numRestarts need to be volatile?

Technically yes, but we generally don't do it. It would be too expensive on the hot code paths (aka, we can't be consistent about it anyway), and we haven't had issues so far 🤷

Or is there some synchronization in the metrics system that I'm missing?

There is none.

…cutionGraph

rmetzger added the component=Runtime/Coordination label Nov 9, 2021

zentol force-pushed the 24775 branch from eeb376f to 075f8ee Compare November 10, 2021 10:49

zentol mentioned this pull request Nov 11, 2021

[FLINK-24876][docs] Remove metrics limitation of Adaptive Scheduler #17766

Merged

dmvk reviewed Nov 12, 2021

View reviewed changes

zentol force-pushed the 24775 branch from c216393 to 6b1effa Compare November 15, 2021 10:32

dmvk approved these changes Nov 22, 2021

View reviewed changes

zentol added 3 commits November 22, 2021 15:38

[FLINK-24775][coordination] move JobStatus-related metrics out of Exe…

ac84e20

…cutionGraph

[hotfix] Remove MetricGroup parameter from EG builder

2d0db12

[FLINK-24903][coordination] Harden AdaptiveSchedulerTest

4db2506

zentol force-pushed the 24775 branch from 41a86c7 to 4db2506 Compare November 22, 2021 14:39

zentol merged commit 60f2f3c into apache:master Nov 22, 2021

zentol deleted the 24775 branch November 25, 2021 11:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-24775][coordination] move JobStatus-related metrics out of ExecutionGraph #17735

[FLINK-24775][coordination] move JobStatus-related metrics out of ExecutionGraph #17735

Uh oh!

zentol commented Nov 9, 2021

Uh oh!

flinkbot commented Nov 9, 2021

Uh oh!

flinkbot commented Nov 9, 2021 •

edited

Loading

Uh oh!

dmvk left a comment

Uh oh!

Uh oh!

Uh oh!

dmvk Nov 12, 2021

Uh oh!

zentol Nov 12, 2021 •

edited

Loading

Uh oh!

dmvk left a comment

Uh oh!

Uh oh!

Uh oh!

dmvk Nov 22, 2021

Uh oh!

zentol Nov 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[FLINK-24775][coordination] move JobStatus-related metrics out of ExecutionGraph #17735

[FLINK-24775][coordination] move JobStatus-related metrics out of ExecutionGraph #17735

Uh oh!

Conversation

zentol commented Nov 9, 2021

Uh oh!

flinkbot commented Nov 9, 2021

Automated Checks

Review Progress

Uh oh!

flinkbot commented Nov 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

dmvk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dmvk Nov 12, 2021

Choose a reason for hiding this comment

Uh oh!

zentol Nov 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmvk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dmvk Nov 22, 2021

Choose a reason for hiding this comment

Uh oh!

zentol Nov 22, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flinkbot commented Nov 9, 2021 •

edited

Loading

zentol Nov 12, 2021 •

edited

Loading