-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Script: Add Metadata to ingest context. #87309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script: Add Metadata to ingest context. #87309
Conversation
Adds the `Metadata` class and `meta()` method to the ingest context. Metadata has getters and setters for index, id, routing, version and versionType. It also has a getter for timestamp. The `VersionType` enum is available for use with the versionType getter and setter. Refs: elastic#86472
|
Hi @stu-elastic, I've created a changelog YAML for you. |
…elasticsearch into pain_wfield-ingest-doc-meta
…field-ingest-doc-meta
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
|
Pinging @elastic/clients-team (Team:Clients) |
rjernst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, I have some comments.
| ) | ||
| ); | ||
| String contextWhitelistName = "org.elasticsearch.script." + context.name.replace('-', '_'); | ||
| if (PainlessPlugin.class.getResourceAsStream(contextWhitelistName + ".txt") != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this existing, but using getResourceAsStream here is incorrect, it means we open an inputstream that is never closed? We should use getResource instead (this should be a separate change so we can backport).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # | ||
|
|
||
| # This file only whitelists the script class and the Metadata class for ingest scripts. | ||
| # Due to name conflicts, this whitelist cannot be loaded with other scripts with conflicting Metadata types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I understand the conflict described here, there are already context specific whitelist files, eg org.elasticsearch.ingest.txt, can't this live there? I think it would be awkward to introduce a new special extension, "meta.txt" that is done here. This Metadata should be unique for a given context right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The painless test context loads all script whitelists. It's not an issue in this change, but when there are multiple Metadata classes, there is a name collision.
The solution of this PR is to avoid loading meta.txt whitelists in painless_test. Another option is to explicitly name the script whitelists needed by painless test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
| } | ||
|
|
||
| class org.elasticsearch.script.IngestScript { | ||
| Metadata meta() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should call this metadata(). "meta" is too meta (it doesn't mean anything). We also already call these metadata fields in the plugin api, as well as in documentation (eg https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-fields.html).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| Metadata meta() | ||
| } | ||
|
|
||
| class org.elasticsearch.index.VersionType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will using these constants look like? Do users have to do eg VersionType.INTERNAL? Could we do anything to make that more seamless, eg metadata().version_type = internal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, VersionType.INTERNAL.
We could possibly bring them in as constants on the script class, which should make that possible. It would impact any variable named internal, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make those available as getters on the script object but that would be a back compat issue with existing variables of the same name. Scripts would get:
variable [external] is already defined
We'd have to have a way to shadow to avoid that issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to taking string.
| } | ||
|
|
||
| public abstract void execute(Map<String, Object> ctx); | ||
| public Map<String, Object> getCtx() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some javadocs would be nice on the new methods here, eg explaing this method provides backcompat for ctx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
| this.map = Maps.newMapWithExpectedSize(size); | ||
| } | ||
|
|
||
| public String getIndex() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the specific getters and setters on the backing, shouldn't it just be generic get/set with whatever key's are exposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I wonder if the Metadata aliasing might not be necessary. We can have a single Metadata class, basically this class, but we don't actually expose any methods on it (in scripts). Then for each metadata field, there are static augmentation methods for getting/setting. In each context then the appropriate augmentation methods would be called.
| if (number.doubleValue() != version) { | ||
| // did we round? | ||
| throw new IllegalArgumentException( | ||
| "version may only be set to an int or a long but was [" + number + "] with type [" + obj.getClass().getName() + "]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we get into this situation? Could we instead catch this when setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script may put anything it wants in the backing map via ctx.
| return version; | ||
| } | ||
| throw new IllegalArgumentException( | ||
| "version may only be set to an int or a long but was [" + obj + "] with type [" + obj.getClass().getName() + "]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing this exception on a get is confusing. We should perhaps validate the ctx map when it is passed into the ctor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script still has access to ctx and may change it there. Unfortunately, that means we cannot trust the contents of the map at any point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can control that if we wrap the ctx map back in ingest. Then we can do this validation on set in the map, so it affects both access types, old and new.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, we'd have to override merge, put, putAll, putIfAbsent, replace(k,v), replace(k,v,v) and replaceAll.
Would that be a backwards incompatible change? It would be rejected after the script runs, but during script execution the map can hold anything as long as it's fixed up before script termination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backwards compatibility aspect is the only challenge I think. IMO it is debatable whether this is a break since our docs have never stated ctx (or any of our other maps like params) can be used as general purpose stashes for data in scripts. It is something we don't want, and it has hamstrung our efforts to improve scripts in the past. But we have seen some users use at least params as a mechanism to stash data across script invocations. This is very fragile, and can break just based on the order documents are iterated in. For ingest, it's even more fragile, but again I think I have seen uses where data is stashed across script processors.
So in summary, I think this is something we should "break", but I don't view it as a breaking change, it is just locking down something that was never defined. Scripting is a tricky area because there is no ability to deprecate really, at least not in a meaningful way (we have done this in the past and spammed the deprecation logs causing io pressure, etc). @tvernum @qhoxie do you have any thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the wrapped map block/validate all changes, or just do this same type of validation for the known metadata fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would block writing to unknown keys, as well as validate the type of known keys (like version type enum). ie the only valid keys are the known metadata fields (the list you linked isn't all of them, it depends on the script context, but those I think are the ones for ingest, or most of them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, on further inspection of IngestDocument, I think we can do this as I originally suggested @stu-elastic . The sourceAndMetadata map is what we need to wrap. On put, if the key is one of the known metadata fields, we validate the value (this would get validated later anyways when we try to index, this is just moving it more upfront). Unknown keys would just be delegated without validation because those are actually document keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To help illustrate the BWC issue, the following valid script would become invalid if we validated the metadata keys in the map.
ctx._version = 'foo';
...
ctx._version = (ctx._version == 'foo') ? 10: 20;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, we're only talking about validating the known metadata fields and the user-facing change would be that they're validated every time they're set rather than only being validated at the end of the script (or, when they're read from the context as part of document storage).
That feels OK to me. If we only validate on getVersion (etc) then there's a cost to the user that they cannot easily trace where the version was set. They know that something in their script set the version to a string, but that could potentially happen in more than 1 place.
So, yes, there's the possibility that it might break something for someone, but that's true of many bug fixes. This feels like a reasonable feature to validate on set rather than validate on first use. I agree it's borderline, and someone might have a compelling argument for why we should consider it "breaking" rather than "fixing", but I think I'm OK with it.
| return set(VERSION, version); | ||
| } | ||
|
|
||
| public void removeVersion() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is it ok to remove version? It's odd having this and getRawVersion, which is different from the rest of the getters/setters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IngestService expects this to be non-null and throws a 500 if it is null.
| return getString(ROUTING); | ||
| } | ||
|
|
||
| public MapBackedMetadata setRouting(String routing) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do without chaining? Shouldn't users be doing = with these so returning the metadata for chaining seems unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used in a future change, removing until then.
|
After discussion with @rjernst, the plan is to
|
…field-ingest-doc-meta
…field-ingest-doc-meta
Changes the type of the version paramter in IngestDocument from Long to long and moves it to the third argument, so all required values occur before nullable arguments. The IngestService expects a non-null version for a document and will throw an NullPointerException if one is not provided. Related: elastic#87309
Changes the type of the version parameter in `IngestDocument` from `Long` to `long` and moves it to the third argument, so all required values occur before nullable arguments. The `IngestService` expects a non-null version for a document and will throw an `NullPointerException` if one is not provided. Related: #87309
|
Added #87673 for wrapping the |
) Adds `IngestSourceAndMetdata` to replace the sourceAndMetadata map. This validates metadata values when they are added to the map for use in scripts and other process as well as provides typed getters and for use inside of server. This change lays the foundation for strongly typed Metadata access in scripting. Related: #87309
The `_version` metadata field should always exist in the sourceAndMetadata map, this change enforces that invariant but allows tests to avoid it if necessary. Refs: elastic#87309
The `_version` metadata field should always exist in the sourceAndMetadata map, this change enforces that invariant but allows tests to avoid it if necessary. Refs: elastic#87309
…in_wfield-ingest-doc-meta
rjernst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks much better. A few more thoughts.
| ingestScript = factory.newInstance(script.getParams()); | ||
| ingestScriptFactory = scriptService.compile(script, IngestScript.CONTEXT); | ||
| if (ScriptType.STORED.equals(script.getType())) { | ||
| // do not cache stored scripts lest they change and invalidate the cached value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a change in current behavior? It might be worth pulling this out into a separate PR, only so that it can be noted (perhaps as a bugfix) in the release notes, whereas if it goes in within this enhancement/addition to the fields API it would be hidden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the current behavior, it's just the instance used to be cached. This change is due to a failed unit test when I switched to the compiled script rather than the instance.
| /** | ||
| * Ingest and update metadata available to write scripts | ||
| */ | ||
| public interface Metadata extends Map<String, Object> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid Metadata extending Map? I thought the intent was to have Metadata be strongly typed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| public abstract void execute(Map<String, Object> ctx); | ||
| /** Provides backwards compatibility access to ctx */ | ||
| public Map<String, Object> getCtx() { | ||
| return metadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass ctx separately to ingest script, so that Metadata can remain strongly typed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, done.
Adds the
Metadataclass andmetadata()method to the ingest context.Metadata has getters and setters for index, id, routing, version and versionType.
It also has a getter for timestamp.
Refs: #86472