Skip to content

Conversation

@felixbarny
Copy link
Member

@felixbarny felixbarny commented Jun 28, 2021

Adds support for allowing duplicate keys.

Also, if add_to_root is true, it will recursively merge the deserialized JSON object instead of overriding conflicting keys via putAll.

Background:
This is to support creating an ingest pipeline for ECS JSON logs.
Due to how some logging frameworks work, ECS JSON logs may contain duplicate keys. Instead of failing, the JSON parser should be more lenient and prefer the last value.

Merging instead of putAll is important as Filebeat sets values for data_stream.type, data_stream.dataset, and data_stream.namespace. If the logs override just one, such as data_stream.dataset, currently, the whole data_stream namespace would be overridden.

@felixbarny felixbarny added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Jun 28, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Jun 28, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@danhermann danhermann self-requested a review June 28, 2021 10:04
Copy link
Contributor

@danhermann danhermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging instead of putAll is important as Filebeat sets values for data_stream.type, data_stream.dataset, and data_stream.namespace. If the logs override just one, such as data_stream.dataset, currently, the whole data_stream namespace would be overridden.

This is a breaking change in the JSON processor's behavior that we can't add in a minor release unless it's an opt-in feature.

Also, we need some unit tests for the new allowDuplicateKeys option for XContent parsing.

| `add_to_root` | no | false | Flag that forces the serialized json to be injected into the top level of the document. `target_field` must not be set when this option is chosen.
| Name | Required | Default | Description
| `field` | yes | - | The field to be parsed.
| `target_field` | no | `field` | The field that the converted structured object will be written into. Any existing content in this field will be overwritten.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like the "any existing content in this field will be overwritten" statement will no longer be necessarily true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is whether or not we want to alter the behavior for target_field. In my initial attempt, I've tried that but it made things quite complex. One reason is that target_field allows the source and target to be a scalar whereas add_to_root requires the source to be a map/object. Also, today the default behavior differs a lot depending on whether target_field or add_to_root is set.

The setting target_field always overrides the target. When add_to_root is set, the fields are kind of merged, but only on the first level. When we introduce a new merge option, it would be easier to only let that apply to add_to_root, however, that would make it somewhat confusing.

For now, I considered the behavior of the partial merge that applies if add_to_root is set to be a bug as it's arguably quite surprising. But I see that this may lead to backward-incompatible behavior if folks rely on the current behavior, as surprising as it may be.

I see three options now:

  • Leave it as is and declare the partial merging behavior a bug.
  • Introduce merge_to_root that only applies for the add_to_root, leaving the existing partial merge of add_to_root and the overriding behavior of target_field intact.
  • Introduce merge that applies to both target_field and add_to_root. If set to false (default), leaving the existing partial merge of add_to_root and the overriding behavior of target_field intact.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I'll bring this up at our team meeting and see if we can settle on a preferred option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the outcome of the discussion in the meeting, I've added an add_to_root_recursive_merge flag.

@felixbarny
Copy link
Member Author

Also, we need some unit tests for the new allowDuplicateKeys option for XContent parsing.

Doing that uncovered that CBOR and Smile currently don't support disabling duplicate detection after creating a parser. Not sure if that's intentional or a bug in Jackson. It doesn't seem impossible as the content gets parsed after creating the parser.

As we only need to disable the duplicate detection for JSON, I'll not go down the rabbit hole of creating a Jackson PR. Btw, we're still using the deprecated jackson-dataformats-cbor instead of the successor jackson-dataformats-binary. Instead, I'll throw an UnsupportedOperationException and override the test case for CBOR and Smile.

@felixbarny felixbarny added auto-backport Automatically create backport pull requests when merged v7.15.0 v8.0.0 labels Jul 1, 2021
@felixbarny felixbarny requested a review from danhermann July 1, 2021 19:11
@felixbarny
Copy link
Member Author

felixbarny commented Jul 6, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team team-discuss v7.15.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants