-
Notifications
You must be signed in to change notification settings - Fork 46
[SPARK-48984]Add Controller Metrics System and Utils #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a spin-off.
BTW, please write the PR description as complete as a standalone PR next time, @jiangzho .
| .enableDynamicOverride(true) | ||
| .description( | ||
| "Comma-separated list of namespaces that the operator would be watching for " | ||
| + "Spark resources. If set to '*' or unset, operator would watch all namespaces.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for supporting this syntax.
I saw the implementation.
if ("*".equals(str)) {
return Collections.emptySet();
}
Although I understand that you meant the following, the description about unset could be a little misleading because this configuration has a default value, default.
spark.kubernetes.operator.watchedNamespaces=*
spark.kubernetes.operator.watchedNamespaces= # blank
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! I was hesitating about the 'unset' (which is under the hood can be considered as not setting namespaces), but indeed it can be quite confusing at properties level. I'd update this to advice user use * only instead of unset.
| .key("spark.kubernetes.operator.reconciler.foregroundRequestTimeoutSeconds") | ||
| .enableDynamicOverride(true) | ||
| .description( | ||
| "Timeout (in seconds) to for requests made to API server. this " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this -> This
| .defaultValue(30L) | ||
| .build(); | ||
|
|
||
| public static final ConfigOption<Long> SPARK_APP_RECONCILE_INTERVAL_SECONDS = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep the pattern consistently? For example, since the name is spark.kubernetes.operator.reconciler,
SPARK_APP_RECONCILE_INTERVAL_SECONDS -> RECONCILER_SPARK_APP_INTERVAL_SECONDS?
| + "updated. This interval controls the reconcile behavior of operator " | ||
| + "reconciliation even when there's no update on SparkApplication, e.g. to " | ||
| + "determine whether a hanging app needs to be proactively terminated. Thus " | ||
| + "this is recommended to set to above 2 min to avoid unnecessary no-op " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min -> minutes
|
|
||
| public static final ConfigOption<Boolean> TRIM_ATTEMPT_STATE_TRANSITION_HISTORY = | ||
| ConfigOption.<Boolean>builder() | ||
| .key("spark.kubernetes.operator.reconciler.trimStateTransitionHistoryEnabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use Enabled here and spark.kubernetes.operator.metrics.clientMetricsEnabled, we need to change the following for consistency.
- spark.kubernetes.operator.terminateOnInformerFailure
+ spark.kubernetes.operator.terminateOnInformerFailureEnabled
| } | ||
| } | ||
| return result; | ||
| } catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a narrowest exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one was not narrowed down for metrics purpose only (so that it would be captured by the metrics). The exception is thrown at the end to ensure it's handled. There's a unit test added for this purpose as well.
| listeners.add((T) listenerClass.getConstructor().newInstance()); | ||
| } | ||
| } | ||
| } catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the narrowest one?
| sinks.forEach(Sink::stop); | ||
| registry.removeMatching(MetricFilter.ALL); | ||
| } else { | ||
| log.error("Stopping a MetricsSystem that is not running"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering when you use log.isErrorEnabled? If we want to check log.isErrorEnabled, we had better use it consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As recommended by PMD, we enabled according gradle task to enforce guard log statement when associate String creation and manipulation. I think this is not reported for no string manipulation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. It makes sense to me too.
| } | ||
|
|
||
| @Data | ||
| public static class SinkProps { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename SinkProps -> SinkProperties?
| && e.getCode() == 409 | ||
| && e.getStatus() != null | ||
| && StringUtils.isNotEmpty(e.getStatus().toString()) | ||
| && e.getStatus().toString().toLowerCase().contains("alreadyexists"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question. When we have other string here for 409 code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to handle only specific 409 cases (e.g. request timed out but the resource was actually created, then this help the reconciler to proceed).
For other 409 conflicts - quota or status patch .etc, they are handled separately in according reconcile logic.
|
Thank you for updating. |
| public static final ConfigOption<Double> RECONCILER_RETRY_INTERVAL_MULTIPLIER = | ||
| ConfigOption.<Double>builder() | ||
| .key("spark.kubernetes.operator.retry.intervalMultiplier") | ||
| .key("spark.kubernetes.operator.reconciler.retry.intervalMultiplier") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for changing the namespace of retry.*.
| public static final ConfigOption<Integer> API_RETRY_MAX_ATTEMPTS = | ||
| ConfigOption.<Integer>builder() | ||
| .key("spark.kubernetes.operator.retry.maxAttempts") | ||
| .key("spark.kubernetes.operator.api.retryMaxAttempts") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New namespace spark.kubernetes.operator.api could be misleading.
This is the backend K8s API-related configurations, right? If then, can we rename this more specifically in order to leave a room for Apache Spark Kubernetes Operator's API-related configurations?
| listeners.add((T) listenerClass.getConstructor().newInstance()); | ||
| } | ||
| } | ||
| } catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM except one comment about spark.kubernetes.operator.api
However, let's handle it seperately, @jiangzho .
I'm going to merge this in the AS-IS status.
|
Merged to main. |
### What changes were proposed in this pull request? This is a breakdown PR of #12 - defines metrics system and utils classes to be used by the reconcilers. It also refactors a few methods in previous utils class `org.apache.spark.k8s.operator.reconciler.SparkReconcilerUtils` into common utils package. ### Why are the changes needed? Breakdown PRs help us to move with more flexibility. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CIs ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#23 from jiangzho/controller_utils. Authored-by: zhou-jiang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This is a breakdown PR of #12 - defines metrics system and utils classes to be used by the reconcilers.
It also refactors a few methods in previous utils class
org.apache.spark.k8s.operator.reconciler.SparkReconcilerUtilsinto common utils package.Why are the changes needed?
Breakdown PRs help us to move with more flexibility.
Does this PR introduce any user-facing change?
No
How was this patch tested?
CIs
Was this patch authored or co-authored using generative AI tooling?
No