-
Notifications
You must be signed in to change notification settings - Fork 25.6k
add max_number_of_fields to Object mapping #13547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I wonder if this should be an index setting instead of an object mapper one. Its neat that you can get the specificity on object mapper but it feels like what matters is the total number of fields on the index not the number of fields in an object. |
|
Thanks for the feedback. Agree to move the setting to index settings. Feel it's rarely useful that people needs to set different limit for different object. The total number of fields matters but I think the limit should still be applied at object level. So it's possible to skip indexing objects that reached the limit while still indexing the rest of the document. (same as dynamic:false). Just feel dropping the whole document causes too much unnecessary information loss. I can see that a developer just grabs an in-memory object and add a few extra fields for logging. The developer may not know some part of the in-memory object is bad for indexing and he may not even be able to change the object content. It will be nice to index the "good" part of the document in this case. |
|
The debate over on #11443 went the other way - be noisy and warn people when they go over the limit and reject the document. Make them fix it. |
|
I agree with @nik9000 that this should be at the index level. The idea is to warn people who are creating too many fields that they are going to have problems. If an object is set to enabled:false, then it won't create fields and won't count towards the total. Also think we should impose a limit by default. Not sure how many though. 1000? |
|
1000 sounds like a reasonable default to me. |
|
I like the idea of rejecting the document. Just a note regarding the exception, in master we are trying to get away from many dedicated exception, maybe we can use one of the built in ones? |
|
I will take a look next week. Totally occupied this week. I have some concerns about rejecting the document. It basically stops indexing when the limit has reached until the logging can be fixed. Our use case of ELK is mission critical especially during outage time so we need to keep indexing ongoing, even when part of the document is not searchable. BTW, if we don't reject document, then we don't need to have the exception. Just need to log a message. |
|
Add 2 settings (both are runtime adjustable) Here is our use case: So new dynamic fields within "fields.realObj_" can still be indexed. |
|
Hi @yanjunh
We know from long and bitter experience that most people seldom look at log messages. The point of adding this limit is to:
Logging the message is not enough for this. It needs to be an exception if it is going to have any chance of helping. I also don't like indexing some fields and not others - it just leads to surprises later on. (Why is my data wrong?) If we just reject documents, then it is very clear what is going wrong and which documents have not been indexed.
The important thing here is the number of Lucene fields, not the number of fields within an object. Don't forget multi-fields, copy_to fields etc. I think this should be a single, per-index, dynamic setting: |
+1 to this proposal |
@clintongormley, what do you say to making the behavior on violation configurable and default to throwing an error. I hate to make more knobs to tune but if it @yanjunh happy I'm ok with it.
+1 |
This is a bit scary to me: if someone comes to the forums asking why some fields are behaving weirdly, it might take me a lot of time before realizing that it is because elasticsearch was configured to ignore new fields past a certain field number. I would much rather not allow this so that elasticsearch remains more predictable. |
exactly my thoughts. it is just too easy to forget. |
And also reject mapping updates with new fields right? Not just dynamic field addition? +1 to the plan. |
Yes! |
|
Thanks for all the comments. I think the key argument is the document should be kept or dropped after the limit is reached. Our system sends out a lot more different kind of logs during outage. It has happened in the past that Elasticsearch stopped working when it is needed the most because of bad data in the outage logs. Now if we just drop the document after the field number limit is reached, it's the same undesirable situation to the engineers. That's why I added an option to keep the document after the limit is reached. People can still see the logs and it helps them to combat outage. And if there is an option to keep the document, applying the limit index wise makes the document less predictable. We dont know whether a new field will be indexed and it depending on the order it appears in the stream. If the limit is applied object wise, the bad object will not be indexed properly but the rest of document will be fine and still usable. And most likely people don't need to search or aggregate inside the bad object anyway. They just need to see the source. Admit that even we drop the document, it's far better than letting the cluster down. I will make another push if everyone likes to drop the document. |
Sorry for the delay in feedback. Yes, we are in favour of throwing an exception if too many fields are added to an index. The limit should be controlled by an index level setting. |
|
Sorry I haven't worked on the promised change. I need to find an efficient way to get the total number of fields, not just the number of direct sub fields. Unfortunately our application doesn't allow us to drop documents even after the limit has been reached. So we also need the setting to allow us to keep the document but we are fine to let dropping the document be the default. |
|
You might want to look at #15989 that performs a similar validation. |
|
Hi @yanjunh Are you still interested in working on this? |
|
@clintongormley Sure. I just started to port this to the master branch. I was dragged by other issues in moving to 2.x. Hopefully I can get another diff sometime next week. |
|
Fixed on master via #17357. |
This is the implementation for #11443. Added "max_number_of_fields" as an option to object mapping. Discard document if the number of direct sub fields reaches the limit. Default behavior is no limit.