Skip to content

Conversation

@stu-elastic
Copy link
Contributor

@stu-elastic stu-elastic commented Jun 14, 2022

Adds IngestSourceAndMetdata to replace the sourceAndMetadata map.

This validates metadata values when they are added to the map for use in
scripts and other process as well as provides typed getters and for use
inside of server.

This change lays the foundation for strongly typed Metadata access in scripting.

Changes the type of the version paramter in IngestDocument from
Long to long and moves it to the third argument, so all required
values occur before nullable arguments.

The IngestService expects a non-null version for a document and will
throw an NullPointerException if one is not provided.

Related: elastic#87309
Adds IngestSourceAndMetdata to replace the sourceAndMetadata map.  This
class validates metadata values when they are added to the map for use in
scripts and other process as well as provides typed getters and for use
inside of server.
Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review, but a few quick thoughts

}
}

public void testAppendMetadataExceptVersion() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this test removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not valid to append to any metadata field (and thus create an array list) based on my tests in 8.2. This was correctly failing validation.

processor = new DotExpanderProcessor("_tag", null, null, "foo.bar");
processor.execute(document);
assertThat(document.getSourceAndMetadata().size(), equalTo(1));
assertThat(document.getSourceAndMetadata().size(), equalTo(2));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the map size change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_version is added in the test constructor so it doesn't have to be added in every test that uses IngestDocument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems trappy. Just above here the IngestDocument is created. If we need helpers to not have to specify version or other metadata fields, let's add helper methods, but not make it magical. If I were debugging this code, I would be very confused why I see 2 here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just version because it always has to be there. If we internally separate out source and metadata then we can have this check source.

Copy link
Contributor Author

@stu-elastic stu-elastic Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will validate that version exists in another PR in a way that allows these tests to stay the same.

return "IngestDocument{" + " sourceAndMetadata=" + sourceAndMetadata + ", ingestMetadata=" + ingestMetadata + '}';
}

public enum Metadata {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this here? It creates a lot of churn, and I don't think there is any real reason it needs to be inside the implementation specific map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* The map is expected to be used by processors where-as the typed getter and setters should
* be used by server code where possible.
*/
public class IngestSourceAndMetadata extends ValidatingMap {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to make this an impl detail, the type doesn't need to be exposed, so it can be package private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* {@link #merge(Object, Object, BiFunction)} family of methods, via {@link Map.Entry#setValue(Object)},
* via the linked Set from {@link #entrySet()} or {@link Collection#remove(Object)} from the linked collection via {@link #values()}
*/
public class ValidatingMap extends AbstractMap<String, Object> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be package private? Or perhaps this doesn't need to be a separate class, we just need it for ingest right now, let's fold this into the real impl class and split if there is a need in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed as a separate class.

* @throws IllegalArgumentException if the validator does not allow the Entry to be removed
*/
@Override
public void remove() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the iterator remove eventually delegate to a remove on the map, which will run the validator? Or why is a validator run at all? It seems like a hacky way to disallow removal.

Copy link
Contributor Author

@stu-elastic stu-elastic Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is this map and the wrapped map. We're trying to proxy changes to the wrapped map.

This iterator is needed because AbstractSet requires subclasses to implement iterator(). If we just take the wrapped maps entry set, the validators will not be called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but why are the validators run on removal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot remove _version. If scripts remove _version then either we need to return a sentinel value from getVersion() (Long.MIN_VALUE or similar) or we have to throw an IllegalArgumentException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validators are still run on removal because a validator may raise on removal, however this change does not have any such validator.

/**
* One time operation that removes all metadata values and their validators and returns the underlying map, containing only source keys
*/
public Map<String, Object> extractSource() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this build a copy? Alternatively, what if source was kept separate, and delegated to. So we have separate source and metadata maps internal to this meta-map. It makes the lookup slightly slower, but the metadata one would be first and very quick since it is so small a number of fields, and it makes the extraction at the end trivial (none necessary).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing pattern of use in IngestDocument is to remove the Metadata and then pass resulting map as the source. The IngestDocument is thrown away so there's no need to allocate another, potentially large, map to hold the source to avoid modification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, what if source was kept separate, and delegated to.

We can do that, it would slightly complicate EntrySet but that's not a problem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source and metadata maps are separated.

@stu-elastic stu-elastic added :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >refactoring and removed WIP labels Jun 22, 2022
@stu-elastic stu-elastic marked this pull request as ready for review June 22, 2022 18:55
@elasticmachine elasticmachine added Team:Core/Infra Meta label for core/infra team Team:Data Management Meta label for data/management team labels Jun 22, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. A few more comments

}
}

public void testAppendMetadataExceptVersion() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this test just abusing metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the only metadata that doesn't fail is _type and I think that's because it's missing validation.

/**
* Build an IngestDocument from values read via deserialization
*/
public static IngestDocument of(Map<String, Object> source, Map<String, Object> metadata, Map<String, Object> ingestMetadata) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a ton of different constructors (package private?) and then also these of methods. Can we consolidate these? It seems like we shouldn't need like 6 different ways to create an IngestDocument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take another pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Down to the same number as before.

metadataMap.put(metadata, sourceAndMetadata.remove(metadata.getFieldName()));
}
return metadataMap;
public Map<String, Object> getSourceAndMetadata() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this separately? IngestSourceAndMetadata implements Map, so we should be able to combine this method and the one below?

Copy link
Contributor Author

@stu-elastic stu-elastic Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We lose the fact that it's an IngestSourceAndMetadata if we make the signature a Map<String, Object> (which loses access to the typed getters) and if we return an IngestSourceAndMetadata then many of the tests for the ingest processors fail to compile because IngestSourceAndMetadata is package private, eg

modules/ingest-geoip/src/test/java/org/elasticsearch/ingest/geoip/GeoIpProcessorFactoryTests.java:497:
 error: IngestSourceAndMetadata.get(Object) is defined in an inaccessible class or interface                                          
            Map<?, ?> geoData = (Map<?, ?>) ingestDocument.getIngestSourceAndMetadata().get("geoip");  

Similar in org.elasticsearch.action.ingest.AsyncIngestProcessorIT.

/**
* Get all Metadata values in a Map
*/
public Map<String, Object> getMetadata() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can users get call getSourceAndMetadata() and call the appropriate method instead of adding access here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.elasticsearch.action.ingest.WriteableIngestDocument can't see the package private org.elasticsearch.ingest.IngestSourceAndMetadata

*/
public Map<String, Object> getSourceAndMetadata() {
return this.sourceAndMetadata;
public static ZonedDateTime getTimestamp(Map<String, Object> ingestMetadata) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an odd place for a utility method. Is there another place it could live, perhaps on metadata itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to IngestSourceAndMetadata.

IngestSourceAndMetadata map;

@Override
public void setUp() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems unnecesary, maybe leftover from debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


/**
* map of key to validating function. Should throw {@link IllegalArgumentException} on invalid value, otherwise return identity
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spacing on javadoc off :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

protected final ZonedDateTime timestamp;

/**
* map of key to validating function. Should throw {@link IllegalArgumentException} on invalid value, otherwise return identity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make it return anything? If validators throw IAE, then they could just be Consumers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to BiConsumers

* @param timestamp the timestamp of ingestion
* @throws IllegalArgumentException if a validator fails for a given key
*/
public static IngestSourceAndMetadata ofMixedSourceAndMetadata(Map<String, Object> sourceAndMetadata, ZonedDateTime timestamp) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to IngestDocument, there seem to be a lot of different ways to construct an IngestSourceAndMetadata. Can these be consolidated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced to two.

put(IngestDocument.Metadata.VERSION.getFieldName(), version);
}

// timestamp isn't backed by the map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't timestamp part of the metadata map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a new field (#80038) so I didn't think it was necessary to backport it into the ctx map and create a potential bwc issue.

@stu-elastic
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/part-2

@stu-elastic
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/packaging-tests-unix-sample

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. I left a couple more thoughts.

metadataMap.put(metadata, sourceAndMetadata.get(metadata.getFieldName()));
}
return metadataMap;
public IngestSourceAndMetadata getIngestSourceAndMetadata() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since IngestSourceAndMetadata is package private, this method can be as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made package private.

IngestSourceAndMetadata.getTimestamp(ingestMetadata),
IngestSourceAndMetadata.VALIDATORS
);
this.ingestMetadata = ingestMetadata != null ? ingestMetadata : new HashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this need to be the metadata from sourceAndMetadata? If the assumption is it should already exist in the passed in sourceAndMetadata, then shouldn't ingestMetadata always be non null? The naming here is a bit confusing because "sourceAndMetadata" implies it contains the metadata, but then ingestMetadata is also passed in. When this was package private it's scoped more narrowly, but making this ctor public opens this up to trappy usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two metadatas, there's the metadata available via ctx and separate "ingest" metadata that is used by processors and stores processor-only information like the pipeline and the key/value for the ForEachProcessor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed HashMap instantiation on null.

return timestamp;
}

// These are not available to scripts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are not available to scripts, why are they in the metadata map? It doesn't appear to be added when the IngestSourceAndMetadata is created, so when and how does it get in there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are set in the SimluatePipelineRequest and can be set via the SetProcessor

public static IngestDocument ofSourceAndIngest(Map<String, Object> sourceAndMetadata, Map<String, Object> ingestMetadata) {
return new IngestDocument(sourceAndMetadata, ingestMetadata);
public static IngestDocument ofSourceAndMetadata(Map<String, Object> sourceAndMetadata) {
return new IngestDocument(sourceAndMetadata, new HashMap<>());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this need to be the metadata from the sourceAndMetadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there's a separate IngestMetadata for internal processor use.

@stu-elastic stu-elastic merged commit 5a2d91c into elastic:master Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >refactoring Team:Core/Infra Meta label for core/infra team Team:Data Management Meta label for data/management team v8.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants