Skip to content

Conversation

@stu-elastic
Copy link
Contributor

@stu-elastic stu-elastic commented Jun 1, 2022

Adds the Metadata class and metadata() method to the ingest context.

Metadata has getters and setters for index, id, routing, version and versionType.
It also has a getter for timestamp.

Refs: #86472

Adds the `Metadata` class and `meta()` method to the ingest context.

Metadata has getters and setters for index, id, routing, version and versionType.
It also has a getter for timestamp.

The `VersionType` enum is available for use with the versionType getter and setter.

Refs: elastic#86472
@stu-elastic stu-elastic added >enhancement :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache labels Jun 1, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @stu-elastic, I've created a changelog YAML for you.

@stu-elastic stu-elastic marked this pull request as ready for review June 2, 2022 01:07
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Jun 2, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@stu-elastic stu-elastic requested a review from rjernst June 2, 2022 01:07
@sethmlarson sethmlarson added the Team:Clients Meta label for clients team label Jun 2, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/clients-team (Team:Clients)

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, I have some comments.

)
);
String contextWhitelistName = "org.elasticsearch.script." + context.name.replace('-', '_');
if (PainlessPlugin.class.getResourceAsStream(contextWhitelistName + ".txt") != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this existing, but using getResourceAsStream here is incorrect, it means we open an inputstream that is never closed? We should use getResource instead (this should be a separate change so we can backport).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#

# This file only whitelists the script class and the Metadata class for ingest scripts.
# Due to name conflicts, this whitelist cannot be loaded with other scripts with conflicting Metadata types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand the conflict described here, there are already context specific whitelist files, eg org.elasticsearch.ingest.txt, can't this live there? I think it would be awkward to introduce a new special extension, "meta.txt" that is done here. This Metadata should be unique for a given context right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The painless test context loads all script whitelists. It's not an issue in this change, but when there are multiple Metadata classes, there is a name collision.

The solution of this PR is to avoid loading meta.txt whitelists in painless_test. Another option is to explicitly name the script whitelists needed by painless test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

}

class org.elasticsearch.script.IngestScript {
Metadata meta()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call this metadata(). "meta" is too meta (it doesn't mean anything). We also already call these metadata fields in the plugin api, as well as in documentation (eg https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-fields.html).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Metadata meta()
}

class org.elasticsearch.index.VersionType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will using these constants look like? Do users have to do eg VersionType.INTERNAL? Could we do anything to make that more seamless, eg metadata().version_type = internal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, VersionType.INTERNAL.

We could possibly bring them in as constants on the script class, which should make that possible. It would impact any variable named internal, however.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make those available as getters on the script object but that would be a back compat issue with existing variables of the same name. Scripts would get:

variable [external] is already defined

We'd have to have a way to shadow to avoid that issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to taking string.

}

public abstract void execute(Map<String, Object> ctx);
public Map<String, Object> getCtx() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some javadocs would be nice on the new methods here, eg explaing this method provides backcompat for ctx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

this.map = Maps.newMapWithExpectedSize(size);
}

public String getIndex() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the specific getters and setters on the backing, shouldn't it just be generic get/set with whatever key's are exposed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I wonder if the Metadata aliasing might not be necessary. We can have a single Metadata class, basically this class, but we don't actually expose any methods on it (in scripts). Then for each metadata field, there are static augmentation methods for getting/setting. In each context then the appropriate augmentation methods would be called.

if (number.doubleValue() != version) {
// did we round?
throw new IllegalArgumentException(
"version may only be set to an int or a long but was [" + number + "] with type [" + obj.getClass().getName() + "]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we get into this situation? Could we instead catch this when setting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script may put anything it wants in the backing map via ctx.

return version;
}
throw new IllegalArgumentException(
"version may only be set to an int or a long but was [" + obj + "] with type [" + obj.getClass().getName() + "]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing this exception on a get is confusing. We should perhaps validate the ctx map when it is passed into the ctor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script still has access to ctx and may change it there. Unfortunately, that means we cannot trust the contents of the map at any point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can control that if we wrap the ctx map back in ingest. Then we can do this validation on set in the map, so it affects both access types, old and new.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, we'd have to override merge, put, putAll, putIfAbsent, replace(k,v), replace(k,v,v) and replaceAll.

Would that be a backwards incompatible change? It would be rejected after the script runs, but during script execution the map can hold anything as long as it's fixed up before script termination.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backwards compatibility aspect is the only challenge I think. IMO it is debatable whether this is a break since our docs have never stated ctx (or any of our other maps like params) can be used as general purpose stashes for data in scripts. It is something we don't want, and it has hamstrung our efforts to improve scripts in the past. But we have seen some users use at least params as a mechanism to stash data across script invocations. This is very fragile, and can break just based on the order documents are iterated in. For ingest, it's even more fragile, but again I think I have seen uses where data is stashed across script processors.

So in summary, I think this is something we should "break", but I don't view it as a breaking change, it is just locking down something that was never defined. Scripting is a tricky area because there is no ability to deprecate really, at least not in a meaningful way (we have done this in the past and spammed the deprecation logs causing io pressure, etc). @tvernum @qhoxie do you have any thoughts here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the wrapped map block/validate all changes, or just do this same type of validation for the known metadata fields?

Copy link
Member

@rjernst rjernst Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would block writing to unknown keys, as well as validate the type of known keys (like version type enum). ie the only valid keys are the known metadata fields (the list you linked isn't all of them, it depends on the script context, but those I think are the ones for ingest, or most of them).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on further inspection of IngestDocument, I think we can do this as I originally suggested @stu-elastic . The sourceAndMetadata map is what we need to wrap. On put, if the key is one of the known metadata fields, we validate the value (this would get validated later anyways when we try to index, this is just moving it more upfront). Unknown keys would just be delegated without validation because those are actually document keys.

Copy link
Contributor Author

@stu-elastic stu-elastic Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help illustrate the BWC issue, the following valid script would become invalid if we validated the metadata keys in the map.

ctx._version = 'foo';
...
ctx._version = (ctx._version == 'foo') ? 10: 20;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we're only talking about validating the known metadata fields and the user-facing change would be that they're validated every time they're set rather than only being validated at the end of the script (or, when they're read from the context as part of document storage).

That feels OK to me. If we only validate on getVersion (etc) then there's a cost to the user that they cannot easily trace where the version was set. They know that something in their script set the version to a string, but that could potentially happen in more than 1 place.

So, yes, there's the possibility that it might break something for someone, but that's true of many bug fixes. This feels like a reasonable feature to validate on set rather than validate on first use. I agree it's borderline, and someone might have a compelling argument for why we should consider it "breaking" rather than "fixing", but I think I'm OK with it.

return set(VERSION, version);
}

public void removeVersion() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is it ok to remove version? It's odd having this and getRawVersion, which is different from the rest of the getters/setters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IngestService expects this to be non-null and throws a 500 if it is null.

return getString(ROUTING);
}

public MapBackedMetadata setRouting(String routing) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do without chaining? Shouldn't users be doing = with these so returning the metadata for chaining seems unnecessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in a future change, removing until then.

@stu-elastic
Copy link
Contributor Author

After discussion with @rjernst, the plan is to

  • Remove org.elasticsearch.script.ingest.meta.txt and put the metadata stuff in org.elasticsearch.script.ingest.txt
  • Figure out how to handle painless_test loading all whitelists. This will become an issue for when other metadata classes are whitelisted
  • Remove all enums, take strings instead (VersionType.EXTERNAL becomes 'external')
  • MapBackedMetadata will implement the Map<String, Object> interface and do validation on put for ctx backwards compatibility. Only metadata keys will be validated.

stu-elastic added a commit to stu-elastic/elasticsearch that referenced this pull request Jun 14, 2022
Changes the type of the version paramter in IngestDocument from
Long to long and moves it to the third argument, so all required
values occur before nullable arguments.

The IngestService expects a non-null version for a document and will
throw an NullPointerException if one is not provided.

Related: elastic#87309
stu-elastic added a commit that referenced this pull request Jun 15, 2022
Changes the type of the version parameter in `IngestDocument` from
`Long` to `long` and moves it to the third argument, so all required
values occur before nullable arguments.

The `IngestService` expects a non-null version for a document and will
throw an `NullPointerException` if one is not provided.

Related: #87309
@stu-elastic
Copy link
Contributor Author

Added #87673 for wrapping the sourceAndMetadata map.

stu-elastic added a commit to stu-elastic/elasticsearch that referenced this pull request Jun 22, 2022
stu-elastic added a commit to stu-elastic/elasticsearch that referenced this pull request Jun 22, 2022
stu-elastic added a commit that referenced this pull request Jun 27, 2022
)

Adds `IngestSourceAndMetdata` to replace the sourceAndMetadata map. 

This validates metadata values when they are added to the map for use in
scripts and other process as well as provides typed getters and for use
inside of server.

This change lays the foundation for strongly typed Metadata access in scripting.

Related: #87309
stu-elastic added a commit to stu-elastic/elasticsearch that referenced this pull request Jun 28, 2022
The `_version` metadata field should always exist in the sourceAndMetadata
map, this change enforces that invariant but allows tests to avoid it
if necessary.

Refs: elastic#87309
stu-elastic added a commit to stu-elastic/elasticsearch that referenced this pull request Jun 28, 2022
The `_version` metadata field should always exist in the sourceAndMetadata
map, this change enforces that invariant but allows tests to avoid it
if necessary.

Refs: elastic#87309
@stu-elastic
Copy link
Contributor Author

@rjernst I've added #88102 to enforce non-null _version. This PR has been updated to use the data structure in #87673.

It's ready for another pass when you have the time.

stu-elastic added a commit that referenced this pull request Jun 29, 2022
…88102)

* Ingest: Enforce _version metadata not null in sourceAndMetadata map

The `_version` metadata field should always exist in the sourceAndMetadata
map, this change enforces that invariant but allows tests to avoid it
if necessary.

Refs: #87309
Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much better. A few more thoughts.

ingestScript = factory.newInstance(script.getParams());
ingestScriptFactory = scriptService.compile(script, IngestScript.CONTEXT);
if (ScriptType.STORED.equals(script.getType())) {
// do not cache stored scripts lest they change and invalidate the cached value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a change in current behavior? It might be worth pulling this out into a separate PR, only so that it can be noted (perhaps as a bugfix) in the release notes, whereas if it goes in within this enhancement/addition to the fields API it would be hidden.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the current behavior, it's just the instance used to be cached. This change is due to a failed unit test when I switched to the compiled script rather than the instance.

/**
* Ingest and update metadata available to write scripts
*/
public interface Metadata extends Map<String, Object> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid Metadata extending Map? I thought the intent was to have Metadata be strongly typed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public abstract void execute(Map<String, Object> ctx);
/** Provides backwards compatibility access to ctx */
public Map<String, Object> getCtx() {
return metadata;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass ctx separately to ingest script, so that Metadata can remain strongly typed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache >enhancement Team:Clients Meta label for clients team Team:Core/Infra Meta label for core/infra team v8.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants