add max_number_of_fields to Object mapping #13547

yanjunh · 2015-09-13T21:49:12Z

This is the implementation for #11443. Added "max_number_of_fields" as an option to object mapping. Discard document if the number of direct sub fields reaches the limit. Default behavior is no limit.

nik9000 · 2015-09-14T13:17:33Z

I wonder if this should be an index setting instead of an object mapper one. Its neat that you can get the specificity on object mapper but it feels like what matters is the total number of fields on the index not the number of fields in an object.

yanjunh · 2015-09-15T04:50:37Z

Thanks for the feedback. Agree to move the setting to index settings. Feel it's rarely useful that people needs to set different limit for different object. The total number of fields matters but I think the limit should still be applied at object level. So it's possible to skip indexing objects that reached the limit while still indexing the rest of the document. (same as dynamic:false). Just feel dropping the whole document causes too much unnecessary information loss. I can see that a developer just grabs an in-memory object and add a few extra fields for logging. The developer may not know some part of the in-memory object is bad for indexing and he may not even be able to change the object content. It will be nice to index the "good" part of the document in this case.

nik9000 · 2015-09-15T13:38:10Z

The debate over on #11443 went the other way - be noisy and warn people when they go over the limit and reject the document. Make them fix it.

clintongormley · 2015-09-18T17:29:24Z

I agree with @nik9000 that this should be at the index level. The idea is to warn people who are creating too many fields that they are going to have problems. If an object is set to enabled:false, then it won't create fields and won't count towards the total.

Also think we should impose a limit by default. Not sure how many though. 1000?

jpountz · 2015-09-24T14:08:11Z

1000 sounds like a reasonable default to me.

kimchy · 2015-09-24T14:12:45Z

I like the idea of rejecting the document. Just a note regarding the exception, in master we are trying to get away from many dedicated exception, maybe we can use one of the built in ones?

yanjunh · 2015-09-25T16:50:10Z

I will take a look next week. Totally occupied this week. I have some concerns about rejecting the document. It basically stops indexing when the limit has reached until the logging can be fixed. Our use case of ELK is mission critical especially during outage time so we need to keep indexing ongoing, even when part of the document is not searchable. BTW, if we don't reject document, then we don't need to have the exception. Just need to log a message.

…of an object

yanjunh · 2015-10-03T17:56:01Z

Add 2 settings (both are runtime adjustable)
"index.subfields.limit"-- the number of fields an object can contain. Positive integer value. Default is no limit.
"index.subfields.dynamic_at_limit" -- The dynamic field setting when the limit is reached. It has 3 values. "strict" is to drop the document and log an exception. This is the default behavior. "false" will keep the document but not index the field. "true" will keep the document and index this field. A warning will be logged for "true" and "false" settings.

Here is our use case:
Pushed to production yesterday. Here is the setting used in mapping
"index.subfields.limit":1000,
"index.subfields.dynamic_at_limit": false,
After a few minutes, we start to see the following in Elasticsearch log files:
[f812d212-72ee-4099-aae8-6b1503a50d8c_] exceeds the max number of fields configured for [fields.realObj_]
Now we know that fields.realObj_ contains uuid keys and has reached the configured 1000 limit. Contacted eng team and they turned off emitting uuid keys. We then increased the limit by 100 using the following
curl -XPUT localhost:9200/index_name/_settings -d '{"index.subfields.limit":1100}'.

So new dynamic fields within "fields.realObj_" can still be indexed.
In the process, the cluster has stayed up, we didn't loss documents and indexed all possible fields.

clintongormley · 2015-10-06T12:42:54Z

Hi @yanjunh

I have some concerns about rejecting the document. It basically stops indexing when the limit has reached until the logging can be fixed. Our use case of ELK is mission critical especially during outage time so we need to keep indexing ongoing, even when part of the document is not searchable. BTW, if we don't reject document, then we don't need to have the exception. Just need to log a message.

We know from long and bitter experience that most people seldom look at log messages. The point of adding this limit is to:

inform the user early that they are doing something inadvisable
stop runaway field addition from doing damage
make the limit dynamically updatable so that the user can pick up quickly from where they left off while making longer term plans to fix the issue

Logging the message is not enough for this. It needs to be an exception if it is going to have any chance of helping. I also don't like indexing some fields and not others - it just leads to surprises later on. (Why is my data wrong?) If we just reject documents, then it is very clear what is going wrong and which documents have not been indexed.

index.subfields.limit

The important thing here is the number of Lucene fields, not the number of fields within an object. Don't forget multi-fields, copy_to fields etc.

I think this should be a single, per-index, dynamic setting: index.fields.limit, defaulting to 1000. If this limit is breached, then we reject any document that tries to add another field by throwing an exception.

jpountz · 2015-10-06T12:48:16Z

I think this should be a single, per-index, dynamic setting: index.fields.limit, defaulting to 1000. If this limit is breached, then we reject any document that tries to add another field by throwing an exception.

+1 to this proposal

nik9000 · 2015-10-06T12:51:41Z

Logging the message is not enough for this. It needs to be an exception if it is going to have any chance of helping. I also don't like indexing some fields and not others - it just leads to surprises later on. (Why is my data wrong?) If we just reject documents, then it is very clear what is going wrong and which documents have not been indexed.

@clintongormley, what do you say to making the behavior on violation configurable and default to throwing an error. I hate to make more knobs to tune but if it @yanjunh happy I'm ok with it.

+1 to this proposal

+1

jpountz · 2015-10-06T12:59:57Z

@clintongormley, what do you say to making the behavior on violation configurable and default to throwing an error. I hate to make more knobs to tune but if it @yanjunh happy I'm ok with it.

This is a bit scary to me: if someone comes to the forums asking why some fields are behaving weirdly, it might take me a lot of time before realizing that it is because elasticsearch was configured to ignore new fields past a certain field number. I would much rather not allow this so that elasticsearch remains more predictable.

clintongormley · 2015-10-06T13:03:54Z

it might take me a lot of time before realizing that it is because elasticsearch was configured to ignore new fields past a certain field number

exactly my thoughts. it is just too easy to forget.

rjernst · 2015-10-06T21:11:24Z

If this limit is breached, then we reject any document that tries to add another field by throwing an exception.

And also reject mapping updates with new fields right? Not just dynamic field addition?

+1 to the plan.

nik9000 · 2015-10-06T21:17:56Z

And also reject mapping updates with new fields right? Not just dynamic field addition?

Yes!

yanjunh · 2015-10-07T03:18:54Z

Thanks for all the comments. I think the key argument is the document should be kept or dropped after the limit is reached.

Our system sends out a lot more different kind of logs during outage. It has happened in the past that Elasticsearch stopped working when it is needed the most because of bad data in the outage logs. Now if we just drop the document after the field number limit is reached, it's the same undesirable situation to the engineers. That's why I added an option to keep the document after the limit is reached. People can still see the logs and it helps them to combat outage.

And if there is an option to keep the document, applying the limit index wise makes the document less predictable. We dont know whether a new field will be indexed and it depending on the order it appears in the stream. If the limit is applied object wise, the bad object will not be indexed properly but the rest of document will be fine and still usable. And most likely people don't need to search or aggregate inside the bad object anyway. They just need to see the source.

Admit that even we drop the document, it's far better than letting the cluster down. I will make another push if everyone likes to drop the document.

clintongormley · 2016-01-18T19:58:58Z

Admit that even we drop the document, it's far better than letting the cluster down. I will make another push if everyone likes to drop the document.

Sorry for the delay in feedback. Yes, we are in favour of throwing an exception if too many fields are added to an index. The limit should be controlled by an index level setting.

yanjunh · 2016-01-19T05:27:01Z

Sorry I haven't worked on the promised change. I need to find an efficient way to get the total number of fields, not just the number of direct sub fields. Unfortunately our application doesn't allow us to drop documents even after the limit has been reached. So we also need the setting to allow us to keep the document but we are fine to let dropping the document be the default.

jpountz · 2016-01-19T09:33:43Z

You might want to look at #15989 that performs a similar validation.

clintongormley · 2016-03-10T11:55:53Z

Hi @yanjunh

Are you still interested in working on this?

yanjunh · 2016-03-10T17:41:43Z

@clintongormley Sure. I just started to port this to the master branch. I was dragged by other issues in moving to 2.x. Hopefully I can get another diff sometime next week.

jpountz · 2016-03-29T17:55:10Z

Fixed on master via #17357.

add max_number_of_fields to Object mapping

65819aa

use index.subfields settings to limit the number of direct subfields …

4288648

…of an object

clintongormley added discuss :Search Foundations/Mapping Index mappings, including merging and defining field types labels Oct 6, 2015

Merge remote-tracking branch 'upstream/1.7' into 1.7

5eebe9a

clintongormley added the feedback_needed label Mar 10, 2016

This was referenced Mar 19, 2016

Add limit to total number of fields in mapping #17199

Closed

Add limit to total number of fields in mapping #17203

Closed

jpountz closed this Mar 29, 2016

add max_number_of_fields to Object mapping #13547

add max_number_of_fields to Object mapping #13547

Uh oh!

Conversation

yanjunh commented Sep 13, 2015

Uh oh!

nik9000 commented Sep 14, 2015

Uh oh!

yanjunh commented Sep 15, 2015

Uh oh!

nik9000 commented Sep 15, 2015

Uh oh!

clintongormley commented Sep 18, 2015

Uh oh!

jpountz commented Sep 24, 2015

Uh oh!

kimchy commented Sep 24, 2015

Uh oh!

yanjunh commented Sep 25, 2015

Uh oh!

yanjunh commented Oct 3, 2015

Uh oh!

clintongormley commented Oct 6, 2015

Uh oh!

jpountz commented Oct 6, 2015

Uh oh!

nik9000 commented Oct 6, 2015

Uh oh!

jpountz commented Oct 6, 2015

Uh oh!

clintongormley commented Oct 6, 2015

Uh oh!

rjernst commented Oct 6, 2015

Uh oh!

nik9000 commented Oct 6, 2015

Uh oh!

yanjunh commented Oct 7, 2015

Uh oh!

clintongormley commented Jan 18, 2016

Uh oh!

yanjunh commented Jan 19, 2016

Uh oh!

jpountz commented Jan 19, 2016

Uh oh!

clintongormley commented Mar 10, 2016

Uh oh!

yanjunh commented Mar 10, 2016

Uh oh!

jpountz commented Mar 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants