Skip to content

Conversation

@benwtrent
Copy link
Member

This separates out the definition object from being stored within the configuration object in the index.

This allows us to gather the config object without decompressing a potentially large definition.

Additionally, input is moved to the TrainedModelConfig object and out of the definition. This is so the trained input fields are accessible outside the potentially large model definition.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

this.trainedModel = ExceptionsHelper.requireNonNull(trainedModel, TRAINED_MODEL);
this.preProcessors = preProcessors == null ? Collections.emptyList() : Collections.unmodifiableList(preProcessors);
this.input = ExceptionsHelper.requireNonNull(input, INPUT);
this.modelId = modelId;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the modelId here so that it is trivially queriable on config deletion. That way we can just do a DBQ on the model_id field for the deleted model.

PARSER.declareObject(optionalConstructorArg(), RowResults.PARSER, RowResults.TYPE);
PARSER.declareInt(optionalConstructorArg(), PROGRESS_PERCENT);
PARSER.declareObject(optionalConstructorArg(), (p, c) -> TrainedModelDefinition.STRICT_PARSER.apply(p, null).build(),
PARSER.declareObject(optionalConstructorArg(), (p, c) -> TrainedModelDefinition.LENIENT_PARSER.apply(p, null),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to the LENIENT_PARSER as the native code still sends back the input for the definition. Once that is corrected, I will return this to the STRICT_PARSER

return PARSER.parse(parser, null);
}

private final List<String> fieldNames;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for having this into a class instead of having the List<String> straight on the config object? Do we envision adding more things here? If yes, I would suggest extracting this into its own file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimitris-athanasiou I will extract it to its own file. One could see us possibly having input be a transform instead of just a list of strings.

private final String modelId;

TrainedModelDefinition(TrainedModel trainedModel, List<PreProcessor> preProcessors, Input input) {
TrainedModelDefinition(TrainedModel trainedModel, List<PreProcessor> preProcessors, @Nullable String modelId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this private?

private boolean processorsInOrder;
private Input input;

private static Builder builderForParser() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not used anywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimitris-athanasiou it should be :). I will fix that promptly.

private List<PreProcessor> preProcessors;
private TrainedModel trainedModel;
private String modelId;
private boolean processorsInOrder;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this change, but could you please explain the need for processorsInOrder?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

De serializing multiple named objects indicates that either they were provided in a map (out of order) or in an array (in order). We require that processors be provided in order. So, if they are provided in a map, we blow up.

This boolean is set by the parser as it is handling the named objects. If it is not set, that means that the parser handled these as a map. If there were more than one, then we should blow up.

PARSER.declareObject(optionalConstructorArg(), RowResults.PARSER, RowResults.TYPE);
PARSER.declareInt(optionalConstructorArg(), PROGRESS_PERCENT);
PARSER.declareObject(optionalConstructorArg(), (p, c) -> TrainedModelDefinition.STRICT_PARSER.apply(p, null).build(),
PARSER.declareObject(optionalConstructorArg(), (p, c) -> TrainedModelDefinition.LENIENT_PARSER.apply(p, null),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we no longer need to call build() we can switch to this way of declaring the parser:

PARSER.declareObject(optionalConstructorArg(), TrainedModelDefinition.LENIENT_PARSER, INFERENCE_MODEL);

.startObject(TrainedModelConfig.CREATE_TIME.getPreferredName())
.field(TYPE, DATE)
.endObject()
.startObject(TrainedModelConfig.DEFINITION.getPreferredName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to disable indexing for the definition somewhere else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, we don't dynamically index new fields (dynamic = false)


ActionListener<IndexResponse> putConfigListener = ActionListener.wrap(
r -> {
if (trainedModelConfig.getDefinition() != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow for definition to ever be null when we're storing a model? I can't think of a reason to do so. Even if we add support for updates, we should have a different method for it. Unless there's a reason I'm missing, I think we could check the definition is not null and throw if it is early on this method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we can make it more flexible in the future if ever necessary (or provide specially handling for internal use cases).

}
);

indexObject(trainedModelConfig.getModelId(), trainedModelConfig, putConfigListener);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider indexing both objects in a bulk request. That way we'd be refreshing the index once and we can also handle errors in one place.

.add(client.prepareSearch(InferenceIndexConstants.INDEX_PATTERN)
.setQuery(queryBuilder)
// use sort to get the last
.addSort("_index", SortOrder.DESC)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this change but why are we sorting on _index here instead of create_time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should prefer the latest index version regardless of when the model is created. If we support updating, an updated model may go into the new index, but have an older create time of another model.

This is an open question for sure, but this is consistent with how we handle transforms.

TrainedModelDefinition definition;
try {
builder = handleSearchItem(multiSearchResponse.getResponses()[0], modelId, this::parseInferenceDocLenientlyFromSource);
} catch(ResourceNotFoundException ex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after catch

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a comment about whether we should rename Input to TrainedModelInput.

import java.util.Objects;


public class Input implements ToXContentObject, Writeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call it TrainedModelInput? Input is too generic and we use the TrainedModel prefix for the config and the definition too.

indexListener.onFailure(ex);
// This should never happen. If we were able to deserialize the object (from Native or REST) and then fail to serialize it again
// that is not the users fault. We did something wrong and should throw.
throw new ElasticsearchStatusException("Unexpected serialization exception for [{}]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could use ExceptionsHelper.serverError. Not necessary but for future reference.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit 1b3c1d2 into elastic:master Oct 30, 2019
@benwtrent benwtrent deleted the feature/ml-inference-separate-definition-and-config branch October 30, 2019 15:34
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Oct 30, 2019
…tic#48651)

This separates out the definition object from being stored within the configuration object in the index.

This allows us to gather the config object without decompressing a potentially large definition.

Additionally, input is moved to the TrainedModelConfig object and out of the definition. This is so the trained input fields are accessible outside the potentially large model definition.
benwtrent added a commit that referenced this pull request Oct 30, 2019
…#48651) (#48695)

* [ML][Inference] separating definition and config object storage (#48651)

This separates out the `definition` object from being stored within the configuration object in the index. 

This allows us to gather the config object without decompressing a potentially large definition.

Additionally, `input` is moved to the TrainedModelConfig object and out of the definition. This is so the trained input fields are accessible outside the potentially large model definition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants