Skip to content

Conversation

@benwtrent
Copy link
Member

This adds the optional boolean parameter defer_validation to the PUT transform API. This parameter will defer any validations that are considered addressable AFTER the configuration has been created.

We do many validations eagerly against the current state of the cluster on PUT. This is so any issues can be captured early to prevent failures later when the data frame transform starts running. But, certain validations could be deferred until _start as the overall cluster state could have mutated between PUT and _start.

The default behavior is to run the validations eagerly.

I did not add any reference or HLRC documentation yet. But will if desired.
@droberts195
Do we want to document this parameter? It seems like a super user parameter and should almost never be used.

closes #43439

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

public final class SourceDestValidator {

private SourceDestValidator() {}
interface SourceDestValidation {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some comments.

I think we should document this new parameter but make very clear what the usecase is. Nit: It would be good to add some logging if the parameter is used (debug level).

public Request(StreamInput in) throws IOException {
super(in);
this.config = new DataFrameTransformConfig(in);
if (in.getVersion().onOrAfter(Version.CURRENT)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for check in BWC code like this: you can alternatively set the version, disable BWC in build.gradle, do the backports and then re-enable BWC. You need the same number of commits in the end, but using the "Version.CURRENT approach" still broke CI, disabling BWC seems to be the better approach now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hendrikmuhs I usually mute the tests before I backport in a separate PR. This way BWC can actually run against my changes and I can find any issues earlier rather than later.

private SourceDestValidator() {}
interface SourceDestValidation {
boolean isDeferrable();
void check(DataFrameTransformConfig config, ClusterState clusterState, IndexNameExpressionResolver indexNameExpressionResolver);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I still find it confusing that I have to call check and not validate on a "validator". (Vice versa it could also be a SourceDestChecker if you prefer to keep check)

@davidkyle
Copy link
Member

I think we should document this new parameter but make very clear what the usecase

Yes, the source is open after all we should document and explain when it should be use. Remember it only defers so any error will be hit when validations are run on start anyway.

Also if the has privilege validation is deferrable it should checked on start

@droberts195
Copy link

Also if the has privilege validation is deferrable it should checked on start

I don't think this is, because the rule is that the transform accesses the source and destination indices with the security privileges of the user who put it, not the user who starts it. The has-privileges API can only return the privileges of the current user, so could only check if the user calling start had the necessary privileges, not the user who called put.

This doesn't create a security issue - in the end if the user who put the transform doesn't have permission to read the source index and write to the destination then it won't do these things. But it does make the error reporting less friendly.

Maybe this can just be documented as a caveat of deferring validation: if you defer validation we assume you've thought about the security aspects of the transform and the transform will run but fail if you haven't.

I think this has actually opened up a can of worms though relating to the security aspects of the envisaged use case for this functionality: none of the built-in Beats roles have permission to put a data frame transform nor create an index other than those currently used by Beats. If Beats is going to use data frame transforms then the user that Beats connects as needs to have permission to manage data frame transforms and create and write to the destination indices configured in those transforms. Presumably to restrict abuse we'd have to set aside a pattern for those destination indices and only give the Beats user index permissions on that pattern. Then we'd have to document what that pattern is and warn users not to let indices storing unrelated confidential data match it.

@davidkyle
Copy link
Member

davidkyle commented Jul 17, 2019

👍 makes sense.

Assuming insufficient privileges causes a detectable exception we should handle that as an irrecoverable failure without retry. That change is not necessary for this PR

return e instanceof IndexNotFoundException || e instanceof AggregationResultUtils.AggregationExtractionException;

new RuntimeException(DataFrameMessages.REST_PUT_DATA_FRAME_FAILED_TO_VALIDATE_DATA_FRAME_CONFIGURATION,
validationException))
validationException -> {
if (validationException instanceof Pivot.TestQueryException && request.isDeferValidation()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pivot.validateQuery() is called only when isDeferValidation() == false

This can be simplified by removing the if...else leaving the call to listener.onFailure

* @param indexNameExpressionResolver A valid IndexNameExpressionResolver object
* @throws ElasticsearchStatusException when a validation fails
*/
public static void check(DataFrameTransformConfig config,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per Hendrik's comment this should also be validate()

==== {api-query-parms-title}

`defer_validations`::
(Optional, boolean) When `true`, this will cause deferrable validations to not run.
Copy link
Member

@davidkyle davidkyle Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include more detail about deferred validations before the second sentence.

Suggested change
(Optional, boolean) When `true`, this will cause deferrable validations to not run.
(Optional, boolean) When `true`, this will cause deferrable validations to not run.
Deferrable validations include checks for the existence of the source index and that the
user has sufficient privileges, skipping these validations at creation is useful for
situations where the source index does not exist but will be created later. Deferred
validations are always run when the {dataframe-transform} is started.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the very last sentence of that suggested change should be changed to: "Deferred validations are always run when the {dataframe-transform} is started, with the exception of privilege checks. If the user who created the transform does not have the required privileges on the source and destination indices then the transform will start but then fail when it attempts the unauthorized operation."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These details are great, thanks!

Copy link
Contributor

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Request request = DataFrameRequestConverters.putDataFrameTransform(putRequest);

assertThat(request.getParameters(), not(hasKey("defer_validation")));
assertEquals(HttpPut.METHOD_NAME, request.getMethod());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you are, could you change "assertEquals" to "assertThat(..., equalTo(...)) so that the usage of assertions is consistent in the whole method.

include-tagged::{doc-tests-file}[{api}-request]
--------------------------------------------------
<1> The configuration of the {dataframe-transform} to create
<2> Whether or not to wait to run deferrable validations until `_start` is called.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] "Whether or not to wait to run deferrable validations" sounds a bit complicated, but I guess you wanted to avoid "Whether or not to defer deferrable validations until ...". Anyway, I would try to somehow simplify it.

Copy link
Member

@davidkyle davidkyle Jul 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Postpone deferrable validations until _start is called.

[[put-data-frame-transform-query-parms]]
==== {api-query-parms-title}

`defer_validations`::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is now singular everywhere.

@benwtrent benwtrent requested a review from davidkyle July 19, 2019 13:28
Copy link

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit a9cc0e1 into elastic:master Jul 22, 2019
@benwtrent benwtrent deleted the feature/ml-df-add-defer-validation-option branch July 22, 2019 14:07
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Jul 22, 2019
…44455)

* [ML][Data Frame] Add optional defer_validation param to PUT

* addressing PR comments

* reverting bad replace

* addressing pr comments

* Update put-transform.asciidoc

* Update put-transform.asciidoc

* Update put-transform.asciidoc
benwtrent added a commit that referenced this pull request Jul 22, 2019
…4455) (#44697)

* [ML][Data Frame] Add optional defer_validation param to PUT (#44455)

* [ML][Data Frame] Add optional defer_validation param to PUT

* addressing PR comments

* reverting bad replace

* addressing pr comments

* Update put-transform.asciidoc

* Update put-transform.asciidoc

* Update put-transform.asciidoc

* adjusting for backport

* fixing imports

* [DOCS] Fixes formatting in  create data frame transform API
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ML] Make source checks more lax for dataframe transforms

8 participants