-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Ingest: Add validation and strong typing to sourceAndMetdata map #87673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest: Add validation and strong typing to sourceAndMetdata map #87673
Conversation
Changes the type of the version paramter in IngestDocument from Long to long and moves it to the third argument, so all required values occur before nullable arguments. The IngestService expects a non-null version for a document and will throw an NullPointerException if one is not provided. Related: elastic#87309
Adds IngestSourceAndMetdata to replace the sourceAndMetadata map. This class validates metadata values when they are added to the map for use in scripts and other process as well as provides typed getters and for use inside of server.
…earch into ingest_validating-map
… checking against Map.Entry
…IngestSourceAndMetadata javadocs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a full review, but a few quick thoughts
| } | ||
| } | ||
|
|
||
| public void testAppendMetadataExceptVersion() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this test removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not valid to append to any metadata field (and thus create an array list) based on my tests in 8.2. This was correctly failing validation.
| processor = new DotExpanderProcessor("_tag", null, null, "foo.bar"); | ||
| processor.execute(document); | ||
| assertThat(document.getSourceAndMetadata().size(), equalTo(1)); | ||
| assertThat(document.getSourceAndMetadata().size(), equalTo(2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did the map size change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_version is added in the test constructor so it doesn't have to be added in every test that uses IngestDocument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems trappy. Just above here the IngestDocument is created. If we need helpers to not have to specify version or other metadata fields, let's add helper methods, but not make it magical. If I were debugging this code, I would be very confused why I see 2 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just version because it always has to be there. If we internally separate out source and metadata then we can have this check source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will validate that version exists in another PR in a way that allows these tests to stay the same.
| return "IngestDocument{" + " sourceAndMetadata=" + sourceAndMetadata + ", ingestMetadata=" + ingestMetadata + '}'; | ||
| } | ||
|
|
||
| public enum Metadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep this here? It creates a lot of churn, and I don't think there is any real reason it needs to be inside the implementation specific map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| * The map is expected to be used by processors where-as the typed getter and setters should | ||
| * be used by server code where possible. | ||
| */ | ||
| public class IngestSourceAndMetadata extends ValidatingMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to make this an impl detail, the type doesn't need to be exposed, so it can be package private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| * {@link #merge(Object, Object, BiFunction)} family of methods, via {@link Map.Entry#setValue(Object)}, | ||
| * via the linked Set from {@link #entrySet()} or {@link Collection#remove(Object)} from the linked collection via {@link #values()} | ||
| */ | ||
| public class ValidatingMap extends AbstractMap<String, Object> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be package private? Or perhaps this doesn't need to be a separate class, we just need it for ingest right now, let's fold this into the real impl class and split if there is a need in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed as a separate class.
| * @throws IllegalArgumentException if the validator does not allow the Entry to be removed | ||
| */ | ||
| @Override | ||
| public void remove() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the iterator remove eventually delegate to a remove on the map, which will run the validator? Or why is a validator run at all? It seems like a hacky way to disallow removal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is this map and the wrapped map. We're trying to proxy changes to the wrapped map.
This iterator is needed because AbstractSet requires subclasses to implement iterator(). If we just take the wrapped maps entry set, the validators will not be called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but why are the validators run on removal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot remove _version. If scripts remove _version then either we need to return a sentinel value from getVersion() (Long.MIN_VALUE or similar) or we have to throw an IllegalArgumentException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validators are still run on removal because a validator may raise on removal, however this change does not have any such validator.
| /** | ||
| * One time operation that removes all metadata values and their validators and returns the underlying map, containing only source keys | ||
| */ | ||
| public Map<String, Object> extractSource() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this build a copy? Alternatively, what if source was kept separate, and delegated to. So we have separate source and metadata maps internal to this meta-map. It makes the lookup slightly slower, but the metadata one would be first and very quick since it is so small a number of fields, and it makes the extraction at the end trivial (none necessary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing pattern of use in IngestDocument is to remove the Metadata and then pass resulting map as the source. The IngestDocument is thrown away so there's no need to allocate another, potentially large, map to hold the source to avoid modification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, what if source was kept separate, and delegated to.
We can do that, it would slightly complicate EntrySet but that's not a problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source and metadata maps are separated.
…_document-test-constructor
…stic/elasticsearch into ingest_validating-map
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
|
Pinging @elastic/es-data-management (Team:Data Management) |
rjernst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good. A few more comments
| } | ||
| } | ||
|
|
||
| public void testAppendMetadataExceptVersion() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this test just abusing metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the only metadata that doesn't fail is _type and I think that's because it's missing validation.
| /** | ||
| * Build an IngestDocument from values read via deserialization | ||
| */ | ||
| public static IngestDocument of(Map<String, Object> source, Map<String, Object> metadata, Map<String, Object> ingestMetadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a ton of different constructors (package private?) and then also these of methods. Can we consolidate these? It seems like we shouldn't need like 6 different ways to create an IngestDocument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take another pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Down to the same number as before.
| metadataMap.put(metadata, sourceAndMetadata.remove(metadata.getFieldName())); | ||
| } | ||
| return metadataMap; | ||
| public Map<String, Object> getSourceAndMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this separately? IngestSourceAndMetadata implements Map, so we should be able to combine this method and the one below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We lose the fact that it's an IngestSourceAndMetadata if we make the signature a Map<String, Object> (which loses access to the typed getters) and if we return an IngestSourceAndMetadata then many of the tests for the ingest processors fail to compile because IngestSourceAndMetadata is package private, eg
modules/ingest-geoip/src/test/java/org/elasticsearch/ingest/geoip/GeoIpProcessorFactoryTests.java:497:
error: IngestSourceAndMetadata.get(Object) is defined in an inaccessible class or interface
Map<?, ?> geoData = (Map<?, ?>) ingestDocument.getIngestSourceAndMetadata().get("geoip");
Similar in org.elasticsearch.action.ingest.AsyncIngestProcessorIT.
| /** | ||
| * Get all Metadata values in a Map | ||
| */ | ||
| public Map<String, Object> getMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can users get call getSourceAndMetadata() and call the appropriate method instead of adding access here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org.elasticsearch.action.ingest.WriteableIngestDocument can't see the package private org.elasticsearch.ingest.IngestSourceAndMetadata
| */ | ||
| public Map<String, Object> getSourceAndMetadata() { | ||
| return this.sourceAndMetadata; | ||
| public static ZonedDateTime getTimestamp(Map<String, Object> ingestMetadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an odd place for a utility method. Is there another place it could live, perhaps on metadata itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to IngestSourceAndMetadata.
| IngestSourceAndMetadata map; | ||
|
|
||
| @Override | ||
| public void setUp() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems unnecesary, maybe leftover from debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
|
|
||
| /** | ||
| * map of key to validating function. Should throw {@link IllegalArgumentException} on invalid value, otherwise return identity | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spacing on javadoc off :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| protected final ZonedDateTime timestamp; | ||
|
|
||
| /** | ||
| * map of key to validating function. Should throw {@link IllegalArgumentException} on invalid value, otherwise return identity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make it return anything? If validators throw IAE, then they could just be Consumers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to BiConsumers
| * @param timestamp the timestamp of ingestion | ||
| * @throws IllegalArgumentException if a validator fails for a given key | ||
| */ | ||
| public static IngestSourceAndMetadata ofMixedSourceAndMetadata(Map<String, Object> sourceAndMetadata, ZonedDateTime timestamp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to IngestDocument, there seem to be a lot of different ways to construct an IngestSourceAndMetadata. Can these be consolidated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduced to two.
| put(IngestDocument.Metadata.VERSION.getFieldName(), version); | ||
| } | ||
|
|
||
| // timestamp isn't backed by the map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why isn't timestamp part of the metadata map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a new field (#80038) so I didn't think it was necessary to backport it into the ctx map and create a potential bwc issue.
|
@elasticmachine run elasticsearch-ci/part-2 |
|
@elasticmachine run elasticsearch-ci/packaging-tests-unix-sample |
rjernst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine. I left a couple more thoughts.
| metadataMap.put(metadata, sourceAndMetadata.get(metadata.getFieldName())); | ||
| } | ||
| return metadataMap; | ||
| public IngestSourceAndMetadata getIngestSourceAndMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since IngestSourceAndMetadata is package private, this method can be as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made package private.
| IngestSourceAndMetadata.getTimestamp(ingestMetadata), | ||
| IngestSourceAndMetadata.VALIDATORS | ||
| ); | ||
| this.ingestMetadata = ingestMetadata != null ? ingestMetadata : new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this need to be the metadata from sourceAndMetadata? If the assumption is it should already exist in the passed in sourceAndMetadata, then shouldn't ingestMetadata always be non null? The naming here is a bit confusing because "sourceAndMetadata" implies it contains the metadata, but then ingestMetadata is also passed in. When this was package private it's scoped more narrowly, but making this ctor public opens this up to trappy usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two metadatas, there's the metadata available via ctx and separate "ingest" metadata that is used by processors and stores processor-only information like the pipeline and the key/value for the ForEachProcessor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed HashMap instantiation on null.
| return timestamp; | ||
| } | ||
|
|
||
| // These are not available to scripts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they are not available to scripts, why are they in the metadata map? It doesn't appear to be added when the IngestSourceAndMetadata is created, so when and how does it get in there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are set in the SimluatePipelineRequest and can be set via the SetProcessor
| public static IngestDocument ofSourceAndIngest(Map<String, Object> sourceAndMetadata, Map<String, Object> ingestMetadata) { | ||
| return new IngestDocument(sourceAndMetadata, ingestMetadata); | ||
| public static IngestDocument ofSourceAndMetadata(Map<String, Object> sourceAndMetadata) { | ||
| return new IngestDocument(sourceAndMetadata, new HashMap<>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this need to be the metadata from the sourceAndMetadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there's a separate IngestMetadata for internal processor use.
Adds
IngestSourceAndMetdatato replace the sourceAndMetadata map.This validates metadata values when they are added to the map for use in
scripts and other process as well as provides typed getters and for use
inside of server.
This change lays the foundation for strongly typed Metadata access in scripting.