-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Labels
:Data Management/Ingest NodeExecution or management of Ingest Pipelines including GeoIPExecution or management of Ingest Pipelines including GeoIP>enhancementMeta
Description
Enrichment at ingest
This issue describes a project that will leverage the ingest node to allow for enrichment of documents before they are indexed.
Below is a diagram that highlights the workflow. The red parts are new components.
.enrich-*- index(es) managed managed by Elasticsearch that contains a highly optimized subset of the source data used for enrichment.- source index - a normal index managed externally (e.g. not by Elasticsearch) that contains the data used for enrichment
enrich policy- a policy that describes how to synchronize the source index with the.enrich-*index. The policy will describe which fields to copy and how often to copy the fields.decorate processor- an ingest node processor that reads from a.enrich-*index to mutate the raw data before it is indexed. The.enrich-*will be data local to thedecorate processor.
There are many moving parts so this issue will serve as a central place to track them.
Tasks
Enrich policy definition
- Define the
enrich policy(@martijnvg) Added enrich policy definition. #41003 - Rename
enrich_keytomatch_fieldandenrich_valuestoenrich_fields. - Remove
typefield and make the type a top level json object that contains all the configuration of an enrich policy. Change how type is stored in an enrich policy. #45789
{
"exact_match": {
"match_field": "prsnl.id",
"enrich_fields": [
"prsnl.name.first",
"prsnl.name.last"
],
"indices": [
"bar*",
"foo"
],
"query": {}
}
}
instead of:
{
"type": "exact_match",
"indices": [
"bar*",
"foo"
],
"match_field": "prsnl.id",
"enrich_fields": [
"prsnl.name.first",
"prsnl.name.last"
],
"query": {
}
}
Enrich processor
- Write rally track for exact match processor.
- Add an enrich processor that uses the search api via node client in order to do the enrichment.
- Optimize they way msearch is executed for enrich processor lookups. Enrich indices always have a single shard, which allows us the easily optimize the execution of multiple search requests bundled together in a bulk. Added a custom api to perform the msearch more efficiently for enrich processor #43965
- Ensure that EnrichProcessorFactory always has access to the latest enrich policies.
(Currently if multiple CS updates are combined then enrich policy changes may not be visible) - Allows
IngestServiceto register components that are updated before the processor factories. - Register
EnrichProcessorFactoryas component that keeps track of the policies. - Rename the
enrich_keyoption tofieldin enrich processor configuration. Enrich processor configuration changes #45466 - Remove
set_fromandtargetsoptions and introducetarget_fieldoption that is inline with whatgeoipprocessor is doing. The entire looked up document is placed as json object under thetarget_field. Enrich processor configuration changes #45466 - Change the enrich processor to not depend on the actual
EnrichPolicyinstance. Just on the policy name. From the policy name, the enrich index alias can be resolved and from the the currently active enrich index. The enrich index should have thematch_fieldof policy in the meta mapping stored, this is the only piece of information required to do the enrichment at ingest time. Decouple enrich processor factory from enrich policy #45826 - Add overwrite parameter to enrich processor. Add support for overwrite parameter in the enrich processor. #45029
- Add template support to field and target_field parameters.
- Include match count into document being enriched to see whether there were no matches or multiple matches.
- Add a LRU cache that is only used when enrich processor needs to make a remote call to do the lookup.
- Add support for match policy type.
- Add support for geo_share_match_policy type. Add support for geo_shape_match enrich policy type #42639
- Add support for ip_range_match policy type.
- Explore warming the LRU cache based on entries from the previous enrich index.
Policy management
- Think about bwc around enrich policy types.
(add created version to EnrichPolicy?) (@jbaiera) Add the cluster version to enrich policies #45021 - Execute force merge when running policy. (@jbaiera) Add force merge step to Enrich Policy execution #41969
- Introduce background process that removes enrich indices that are not referenced by an alias or no policy exists for an enrich index.
The background process should mark indices for deletion first, and remove them in the next execution (To avoid deleting indices that have been freshly retired from the enrich alias and still potentially in use). Also the background process should not delete any indices that are tied to policies currently being executed - We don't want to throw out new indices that are currently being populated by a policy execution. (@jbaiera) Add Enrich index background task to cleanup old indices #43746 - Add validation that enrich key fields / enrich values
field are not inside an array of objects (nested). (@jbaiera) Enrich validate nested mappings #42452 - De-normalize nested data inside source index when executing policy.
- Stats (in memory)
- Error Handling
- Add description to .enrich index as _meta mapping to indicate that this index is managed by ES and shouldn't be modified in any way. (@jbaiera)
- Always drop the _id and _routing field from documents originating from source indices. This to ensure the uniqueness of documents. (@jbaiera)
- Overwrite specific index settings on enrich index: disable field data, global ordinals loading, shard allocation filtering, automatic refresh.
- Should force merge as part of policy execution results in more than one segment retry the force merge or fail the execute policy request?
APIs
- Get policy API
- Execute policy API.
- Add manage_enrich privilege.
- Make policies immutable. The PUT policy api should fail when a policy already exists, so effectively this api can only return a 200 response code. If a policy needs to be changed then it first needs to be removed, or alternatively, a new policy under a different name should be added. (@hub-cap) Ensure enrich policy is immutable #43604
- A policy should not be removed if a pipeline is still referencing it. (@hub-cap) Fail delete policy if pipeline exists #44438
- The delete policy api should first remove all enrich indices of a policy, before removing the policy from the cluster state. (@hub-cap) Remove enrich indices on delete policy #45870
- Use has_privilege api as part of put policy api to check whether the user has sufficient privileges in source index. (@hub-cap) Validate read priv of enrich source indices #43595
- Policy name validation. The validation should be similar to index name validation, because the policy name is used to created an index. (same validation as in
MetaDataCreateIndexService#validateIndexOrAliasName) (@martijnvg) - Replace current get and list APIs with another API that returns both a single policy and all policies. In both cases a list should be returned. For example
GET _enrich/policy/users-policy(specific policy) andGET _enrich/policy(all policies). Both variants should always return a list of objects. And later also support:
GET _enrich/policy/users-*andGET _enrich/policy/users-policy,users2-policy. (@hub-cap) Consolidate enrich list all and get by name APIs #45705 - CRUD for
enrich policy(@hub-cap) _enrich/policy/name - Store enrich policy in an index (
.enrich-policies?) instead of in the cluster state. (@hub-cap) Use an index to store enrich policies #47475 - Stats API
- Integrate stats api with monitoring
- Telemetry support
- task api for execute ?wait_for_completion=false (@hub-cap)
- GET wildcard and comma separated policy names (@hub-cap)
Misc
- Restart qa test
- Documentation
- Enable / Disable settings
- HLRC
-
update Kibana roles for new role, to be done after the feature branch is merged to masterobsoleted by Role Management - use ES Builtin Privilege API to drive list of privileges kibana#40270 - update stack docs for the new role, to be done after the feature branch is merged to master
- Transport client support. (@hub-cap) Add enrich transport client support #46002
- Integration with xpack usage api.
EDITS:
- 2019-4-8: Changed the original description of this issue to reflect the current direction*
- 20190507: Updated after planning meeting.
Metadata
Metadata
Labels
:Data Management/Ingest NodeExecution or management of Ingest Pipelines including GeoIPExecution or management of Ingest Pipelines including GeoIP>enhancementMeta
