Skip to content

Conversation

@yanjunh
Copy link
Contributor

@yanjunh yanjunh commented Sep 13, 2015

This is the implementation for #11443. Added "max_number_of_fields" as an option to object mapping. Discard document if the number of direct sub fields reaches the limit. Default behavior is no limit.

@nik9000
Copy link
Member

nik9000 commented Sep 14, 2015

I wonder if this should be an index setting instead of an object mapper one. Its neat that you can get the specificity on object mapper but it feels like what matters is the total number of fields on the index not the number of fields in an object.

@yanjunh
Copy link
Contributor Author

yanjunh commented Sep 15, 2015

Thanks for the feedback. Agree to move the setting to index settings. Feel it's rarely useful that people needs to set different limit for different object. The total number of fields matters but I think the limit should still be applied at object level. So it's possible to skip indexing objects that reached the limit while still indexing the rest of the document. (same as dynamic:false). Just feel dropping the whole document causes too much unnecessary information loss. I can see that a developer just grabs an in-memory object and add a few extra fields for logging. The developer may not know some part of the in-memory object is bad for indexing and he may not even be able to change the object content. It will be nice to index the "good" part of the document in this case.

@nik9000
Copy link
Member

nik9000 commented Sep 15, 2015

The debate over on #11443 went the other way - be noisy and warn people when they go over the limit and reject the document. Make them fix it.

@clintongormley
Copy link
Contributor

I agree with @nik9000 that this should be at the index level. The idea is to warn people who are creating too many fields that they are going to have problems. If an object is set to enabled:false, then it won't create fields and won't count towards the total.

Also think we should impose a limit by default. Not sure how many though. 1000?

@jpountz
Copy link
Contributor

jpountz commented Sep 24, 2015

1000 sounds like a reasonable default to me.

@kimchy
Copy link
Member

kimchy commented Sep 24, 2015

I like the idea of rejecting the document. Just a note regarding the exception, in master we are trying to get away from many dedicated exception, maybe we can use one of the built in ones?

@yanjunh
Copy link
Contributor Author

yanjunh commented Sep 25, 2015

I will take a look next week. Totally occupied this week. I have some concerns about rejecting the document. It basically stops indexing when the limit has reached until the logging can be fixed. Our use case of ELK is mission critical especially during outage time so we need to keep indexing ongoing, even when part of the document is not searchable. BTW, if we don't reject document, then we don't need to have the exception. Just need to log a message.

@yanjunh
Copy link
Contributor Author

yanjunh commented Oct 3, 2015

Add 2 settings (both are runtime adjustable)
"index.subfields.limit"-- the number of fields an object can contain. Positive integer value. Default is no limit.
"index.subfields.dynamic_at_limit" -- The dynamic field setting when the limit is reached. It has 3 values. "strict" is to drop the document and log an exception. This is the default behavior. "false" will keep the document but not index the field. "true" will keep the document and index this field. A warning will be logged for "true" and "false" settings.

Here is our use case:
Pushed to production yesterday. Here is the setting used in mapping
"index.subfields.limit":1000,
"index.subfields.dynamic_at_limit": false,
After a few minutes, we start to see the following in Elasticsearch log files:
[f812d212-72ee-4099-aae8-6b1503a50d8c_] exceeds the max number of fields configured for [fields.realObj_]
Now we know that fields.realObj_ contains uuid keys and has reached the configured 1000 limit. Contacted eng team and they turned off emitting uuid keys. We then increased the limit by 100 using the following
curl -XPUT localhost:9200/index_name/_settings -d '{"index.subfields.limit":1100}'.

So new dynamic fields within "fields.realObj_" can still be indexed.
In the process, the cluster has stayed up, we didn't loss documents and indexed all possible fields.

@clintongormley
Copy link
Contributor

Hi @yanjunh

I have some concerns about rejecting the document. It basically stops indexing when the limit has reached until the logging can be fixed. Our use case of ELK is mission critical especially during outage time so we need to keep indexing ongoing, even when part of the document is not searchable. BTW, if we don't reject document, then we don't need to have the exception. Just need to log a message.

We know from long and bitter experience that most people seldom look at log messages. The point of adding this limit is to:

  • inform the user early that they are doing something inadvisable
  • stop runaway field addition from doing damage
  • make the limit dynamically updatable so that the user can pick up quickly from where they left off while making longer term plans to fix the issue

Logging the message is not enough for this. It needs to be an exception if it is going to have any chance of helping. I also don't like indexing some fields and not others - it just leads to surprises later on. (Why is my data wrong?) If we just reject documents, then it is very clear what is going wrong and which documents have not been indexed.

index.subfields.limit

The important thing here is the number of Lucene fields, not the number of fields within an object. Don't forget multi-fields, copy_to fields etc.

I think this should be a single, per-index, dynamic setting: index.fields.limit, defaulting to 1000. If this limit is breached, then we reject any document that tries to add another field by throwing an exception.

@clintongormley clintongormley added discuss :Search Foundations/Mapping Index mappings, including merging and defining field types labels Oct 6, 2015
@jpountz
Copy link
Contributor

jpountz commented Oct 6, 2015

I think this should be a single, per-index, dynamic setting: index.fields.limit, defaulting to 1000. If this limit is breached, then we reject any document that tries to add another field by throwing an exception.

+1 to this proposal

@nik9000
Copy link
Member

nik9000 commented Oct 6, 2015

Logging the message is not enough for this. It needs to be an exception if it is going to have any chance of helping. I also don't like indexing some fields and not others - it just leads to surprises later on. (Why is my data wrong?) If we just reject documents, then it is very clear what is going wrong and which documents have not been indexed.

@clintongormley, what do you say to making the behavior on violation configurable and default to throwing an error. I hate to make more knobs to tune but if it @yanjunh happy I'm ok with it.

+1 to this proposal

+1

@jpountz
Copy link
Contributor

jpountz commented Oct 6, 2015

@clintongormley, what do you say to making the behavior on violation configurable and default to throwing an error. I hate to make more knobs to tune but if it @yanjunh happy I'm ok with it.

This is a bit scary to me: if someone comes to the forums asking why some fields are behaving weirdly, it might take me a lot of time before realizing that it is because elasticsearch was configured to ignore new fields past a certain field number. I would much rather not allow this so that elasticsearch remains more predictable.

@clintongormley
Copy link
Contributor

it might take me a lot of time before realizing that it is because elasticsearch was configured to ignore new fields past a certain field number

exactly my thoughts. it is just too easy to forget.

@rjernst
Copy link
Member

rjernst commented Oct 6, 2015

If this limit is breached, then we reject any document that tries to add another field by throwing an exception.

And also reject mapping updates with new fields right? Not just dynamic field addition?

+1 to the plan.

@nik9000
Copy link
Member

nik9000 commented Oct 6, 2015

And also reject mapping updates with new fields right? Not just dynamic field addition?

Yes!

@yanjunh
Copy link
Contributor Author

yanjunh commented Oct 7, 2015

Thanks for all the comments. I think the key argument is the document should be kept or dropped after the limit is reached.

Our system sends out a lot more different kind of logs during outage. It has happened in the past that Elasticsearch stopped working when it is needed the most because of bad data in the outage logs. Now if we just drop the document after the field number limit is reached, it's the same undesirable situation to the engineers. That's why I added an option to keep the document after the limit is reached. People can still see the logs and it helps them to combat outage.

And if there is an option to keep the document, applying the limit index wise makes the document less predictable. We dont know whether a new field will be indexed and it depending on the order it appears in the stream. If the limit is applied object wise, the bad object will not be indexed properly but the rest of document will be fine and still usable. And most likely people don't need to search or aggregate inside the bad object anyway. They just need to see the source.

Admit that even we drop the document, it's far better than letting the cluster down. I will make another push if everyone likes to drop the document.

@clintongormley
Copy link
Contributor

Admit that even we drop the document, it's far better than letting the cluster down. I will make another push if everyone likes to drop the document.

Sorry for the delay in feedback. Yes, we are in favour of throwing an exception if too many fields are added to an index. The limit should be controlled by an index level setting.

@yanjunh
Copy link
Contributor Author

yanjunh commented Jan 19, 2016

Sorry I haven't worked on the promised change. I need to find an efficient way to get the total number of fields, not just the number of direct sub fields. Unfortunately our application doesn't allow us to drop documents even after the limit has been reached. So we also need the setting to allow us to keep the document but we are fine to let dropping the document be the default.

@jpountz
Copy link
Contributor

jpountz commented Jan 19, 2016

You might want to look at #15989 that performs a similar validation.

@clintongormley
Copy link
Contributor

Hi @yanjunh

Are you still interested in working on this?

@yanjunh
Copy link
Contributor Author

yanjunh commented Mar 10, 2016

@clintongormley Sure. I just started to port this to the master branch. I was dragged by other issues in moving to 2.x. Hopefully I can get another diff sometime next week.

@jpountz
Copy link
Contributor

jpountz commented Mar 29, 2016

Fixed on master via #17357.

@jpountz jpountz closed this Mar 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

discuss feedback_needed :Search Foundations/Mapping Index mappings, including merging and defining field types

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants