Skip to content

Conversation

@jiangzho
Copy link
Contributor

What changes were proposed in this pull request?

This is a breakdown PR of #12 - defines metrics system and utils classes to be used by the reconcilers.

It also refactors a few methods in previous utils class org.apache.spark.k8s.operator.reconciler.SparkReconcilerUtils into common utils package.

Why are the changes needed?

Breakdown PRs help us to move with more flexibility.

Does this PR introduce any user-facing change?

No

How was this patch tested?

CIs

Was this patch authored or co-authored using generative AI tooling?

No

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a spin-off.

BTW, please write the PR description as complete as a standalone PR next time, @jiangzho .

.enableDynamicOverride(true)
.description(
"Comma-separated list of namespaces that the operator would be watching for "
+ "Spark resources. If set to '*' or unset, operator would watch all namespaces.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for supporting this syntax.

I saw the implementation.

    if ("*".equals(str)) {
      return Collections.emptySet();
    }

Although I understand that you meant the following, the description about unset could be a little misleading because this configuration has a default value, default.

spark.kubernetes.operator.watchedNamespaces=*
spark.kubernetes.operator.watchedNamespaces=     # blank

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! I was hesitating about the 'unset' (which is under the hood can be considered as not setting namespaces), but indeed it can be quite confusing at properties level. I'd update this to advice user use * only instead of unset.

.key("spark.kubernetes.operator.reconciler.foregroundRequestTimeoutSeconds")
.enableDynamicOverride(true)
.description(
"Timeout (in seconds) to for requests made to API server. this "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this -> This

.defaultValue(30L)
.build();

public static final ConfigOption<Long> SPARK_APP_RECONCILE_INTERVAL_SECONDS =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the pattern consistently? For example, since the name is spark.kubernetes.operator.reconciler,
SPARK_APP_RECONCILE_INTERVAL_SECONDS -> RECONCILER_SPARK_APP_INTERVAL_SECONDS?

+ "updated. This interval controls the reconcile behavior of operator "
+ "reconciliation even when there's no update on SparkApplication, e.g. to "
+ "determine whether a hanging app needs to be proactively terminated. Thus "
+ "this is recommended to set to above 2 min to avoid unnecessary no-op "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min -> minutes


public static final ConfigOption<Boolean> TRIM_ATTEMPT_STATE_TRANSITION_HISTORY =
ConfigOption.<Boolean>builder()
.key("spark.kubernetes.operator.reconciler.trimStateTransitionHistoryEnabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use Enabled here and spark.kubernetes.operator.metrics.clientMetricsEnabled, we need to change the following for consistency.

- spark.kubernetes.operator.terminateOnInformerFailure
+ spark.kubernetes.operator.terminateOnInformerFailureEnabled

}
}
return result;
} catch (Exception e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a narrowest exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was not narrowed down for metrics purpose only (so that it would be captured by the metrics). The exception is thrown at the end to ensure it's handled. There's a unit test added for this purpose as well.

listeners.add((T) listenerClass.getConstructor().newInstance());
}
}
} catch (Exception e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the narrowest one?

sinks.forEach(Sink::stop);
registry.removeMatching(MetricFilter.ALL);
} else {
log.error("Stopping a MetricsSystem that is not running");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering when you use log.isErrorEnabled? If we want to check log.isErrorEnabled, we had better use it consistently.

Copy link
Contributor Author

@jiangzho jiangzho Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As recommended by PMD, we enabled according gradle task to enforce guard log statement when associate String creation and manipulation. I think this is not reported for no string manipulation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. It makes sense to me too.

}

@Data
public static class SinkProps {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename SinkProps -> SinkProperties?

&& e.getCode() == 409
&& e.getStatus() != null
&& StringUtils.isNotEmpty(e.getStatus().toString())
&& e.getStatus().toString().toLowerCase().contains("alreadyexists");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question. When we have other string here for 409 code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to handle only specific 409 cases (e.g. request timed out but the resource was actually created, then this help the reconciler to proceed).

For other 409 conflicts - quota or status patch .etc, they are handled separately in according reconcile logic.

@dongjoon-hyun
Copy link
Member

Thank you for updating.

public static final ConfigOption<Double> RECONCILER_RETRY_INTERVAL_MULTIPLIER =
ConfigOption.<Double>builder()
.key("spark.kubernetes.operator.retry.intervalMultiplier")
.key("spark.kubernetes.operator.reconciler.retry.intervalMultiplier")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for changing the namespace of retry.*.

public static final ConfigOption<Integer> API_RETRY_MAX_ATTEMPTS =
ConfigOption.<Integer>builder()
.key("spark.kubernetes.operator.retry.maxAttempts")
.key("spark.kubernetes.operator.api.retryMaxAttempts")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New namespace spark.kubernetes.operator.api could be misleading.

This is the backend K8s API-related configurations, right? If then, can we rename this more specifically in order to leave a room for Apache Spark Kubernetes Operator's API-related configurations?

listeners.add((T) listenerClass.getConstructor().newInstance());
}
}
} catch (Exception e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM except one comment about spark.kubernetes.operator.api

However, let's handle it seperately, @jiangzho .

I'm going to merge this in the AS-IS status.

@dongjoon-hyun
Copy link
Member

Merged to main.

@dongjoon-hyun dongjoon-hyun mentioned this pull request Jul 29, 2024
jiangzho added a commit to jiangzho/spark-kubernetes-operator that referenced this pull request Jul 17, 2025
### What changes were proposed in this pull request?

This is a breakdown PR of #12  - defines metrics system and utils classes to be used by the reconcilers.

It also refactors a few methods in previous utils class `org.apache.spark.k8s.operator.reconciler.SparkReconcilerUtils` into common utils package.

### Why are the changes needed?

Breakdown PRs help us to move with more flexibility.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

CIs

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#23 from jiangzho/controller_utils.

Authored-by: zhou-jiang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants