-
Notifications
You must be signed in to change notification settings - Fork 25.6k
QL: complex nested field types mappings fix #69368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…o temp_nested_fields_bugfix
|
Pinging @elastic/es-ql (Team:QL) |
|
@elasticmachine update branch |
bpintea
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first round of comments.
| } | ||
|
|
||
| /* | ||
| * For a tree of fields like root.nested1.nested2.leaf where nested1 and nested2 are nested field types, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would "a path of fields" make more sense? (since I guess the doc isn't parsed/accessed as a tree?)
| String tempHitname = hitName; | ||
| List<String> remainingPath = new ArrayList<>(); | ||
| // first, search for the "root" DocumentField under which the remaining path of nested document values is | ||
| while((field = hit.field(tempHitname)) == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spacing after while.
| List<String> remainingPath = new ArrayList<>(); | ||
| // first, search for the "root" DocumentField under which the remaining path of nested document values is | ||
| while((field = hit.field(tempHitname)) == null) { | ||
| int indexOfDot = tempHitname.lastIndexOf("."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this guaranteed to return a non-negative?
Actually, with a mapping similar to the exemplified one in the test's comment (but only using the first lowered letter of the randomly generated nodes), indexing just {"h": [{"f":{"b": {"a": "true"}}}]} and querying with SELECT h.f.o FROM test seems to trigger:
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end -1, length 1
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3734)
at java.base/java.lang.String.substring(String.java:1903)
at org.elasticsearch.xpack.ql.execution.search.extractor.AbstractFieldHitExtractor.extractNestedField(AbstractFieldHitExtractor.java:113)
at org.elasticsearch.xpack.ql.execution.search.extractor.AbstractFieldHitExtractor.extract(AbstractFieldHitExtractor.java:86)
at org.elasticsearch.xpack.sql.execution.search.SearchHitRowSet.extractValue(SearchHitRowSet.java:129)
at org.elasticsearch.xpack.sql.execution.search.SearchHitRowSet.extractValue(SearchHitRowSet.java:32)
at org.elasticsearch.xpack.sql.execution.search.ResultRowSet.getColumn(ResultRowSet.java:38)
at org.elasticsearch.xpack.sql.session.AbstractRowSet.column(AbstractRowSet.java:19)
at org.elasticsearch.xpack.sql.session.RowView.forEachColumn(RowView.java:39)
at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.lambda$createResponse$5(TransportSqlQueryAction.java:132)
at org.elasticsearch.xpack.sql.session.RowSet.forEachRow(RowSet.java:28)
at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.createResponse(TransportSqlQueryAction.java:130)
at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.createResponseWithSchema(TransportSqlQueryAction.java:125)
at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.lambda$operation$0(TransportSqlQueryAction.java:97)
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:139)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioning it here, but happy to open an issue if better (or just drop it, if wrong): selecting h.f.b.a will still return nothing.
Reason being that the inner_hits response from ES comes with with the entries listed as h.f_12 (for example), while we're looking for the inner hit h.f.b in SearchHitRowSet. Seems like this should be doable to work, but didn't look deeper.
This is not new, this didn't work before either (although no error message is returned, as mentioned in the limitations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! That's an older problem - SELECT h.f.b.a where b.a is a hierarchy of simple fields inside another hierarchy of nested fields -, for sure. It's the complexity of the nested documents definition possibilities that shows that our algorithm isn't covering all of them. I'd be inclined in dealing with this bug in a subsequent PR as it doesn't look straightforward and it may need a bit of rethinking the way we deal with nested fields hierarchies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this guaranteed to return a non-negative?
Actually, with a mapping similar to the exemplified one in the test's comment (but only using the first lowered letter of the randomly generated nodes), indexing just{"h": [{"f":{"b": {"a": "true"}}}]}and querying withSELECT h.f.o FROM testseems to trigger:Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end -1, length 1 at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3734) at java.base/java.lang.String.substring(String.java:1903) at org.elasticsearch.xpack.ql.execution.search.extractor.AbstractFieldHitExtractor.extractNestedField(AbstractFieldHitExtractor.java:113) at org.elasticsearch.xpack.ql.execution.search.extractor.AbstractFieldHitExtractor.extract(AbstractFieldHitExtractor.java:86) at org.elasticsearch.xpack.sql.execution.search.SearchHitRowSet.extractValue(SearchHitRowSet.java:129) at org.elasticsearch.xpack.sql.execution.search.SearchHitRowSet.extractValue(SearchHitRowSet.java:32) at org.elasticsearch.xpack.sql.execution.search.ResultRowSet.getColumn(ResultRowSet.java:38) at org.elasticsearch.xpack.sql.session.AbstractRowSet.column(AbstractRowSet.java:19) at org.elasticsearch.xpack.sql.session.RowView.forEachColumn(RowView.java:39) at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.lambda$createResponse$5(TransportSqlQueryAction.java:132) at org.elasticsearch.xpack.sql.session.RowSet.forEachRow(RowSet.java:28) at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.createResponse(TransportSqlQueryAction.java:130) at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.createResponseWithSchema(TransportSqlQueryAction.java:125) at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.lambda$operation$0(TransportSqlQueryAction.java:97) at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:139)
We should prevent the error from occurring - looking at the stacktrace this seems to have been addressed yet I'm mentioning to double check this is indeed the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bpintea this change should fix the h.f.b.a returning nothing example you mentioned. Test for this exact case here.
| if (remainingPath.size() > 0) { | ||
| List<Object> values = field.getValues(); | ||
| Iterator<String> pathIterator = remainingPath.iterator(); | ||
| while(pathIterator.hasNext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spacing.
| Iterator<String> pathIterator = remainingPath.iterator(); | ||
| while(pathIterator.hasNext()) { | ||
| String pathElement = pathIterator.next(); | ||
| Map<String, Object> elements = (Map<String, Object>) values.get(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here (and when calling unwrapFieldsMultiValue below), only a first value is selected. Could the looked for path not be accessed through a subsequent element?
matriv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the great randomised test, Imho we should also have some plain tests that target specific cases, as for example those mentioned already by @bpintea
…o temp_nested_fields_bugfix
|
@matriv thank you for the review. I am not sure how useful it is to have a specific test for that scenario. There could be many more scenarios like that one, ones that we didn't even think about, just because the combination of mappings possible is so complex. While we do have specific tests for specific scenarios in @bpintea spotted one scenario where the value of the requested field is |
cbuescher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I didn't go through all the details of the extraction implementation yet, its a bit different to what we do on the fields side of things where we deal with the _source map. I left a question around testing, I tend to agree with @matriv that in addition to the randomized testing it would be great to have a few common cases spelled out explicitely without too much randomness to get a basic coverage of those. But I don't know how well this fits with the general testing philosophy and infra in ql, so I leave this only as a remark.
| } | ||
| index.endObject(); | ||
| // from time to time set a null value instead of an actual value | ||
| if (randomBoolean()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use null less often than 50% of the times in randomization here, it should be the exception I think. Using ' rarely()` this case would still get picked 1% of the time, should be enough?
| bulkContent.append("\"" + fieldName + "\":{\"" + leafFieldName + "\":null"); | ||
| } else { | ||
| randomValue.set(randomAlphaOfLength(10)); | ||
| bulkContent.append("\"" + fieldName + "\":{\"" + leafFieldName + "\":\"" + randomValue.get() + "\""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this entirely, but to me it seems the test will only add one value for the leaf field thats going to be requested and tested for here? Asking because its very common that there is more than one entry in a nested field, e.g. in a very simple case
"fields": {
"group" : ["fans"],
"user": [{
"first": ["John"],
"last": ["Smith"],
},
{
"first": ["Alice"],
"last": ["White"],
}
]
}
where user is the nested field, and you want to retrieve user.first. I'm not sure this case is tested for, but I expect it to be common. Maybe less common but still relevant would be something like
"fields": {
"group" : ["fans"],
"user": [{
"first": ["John", "Jim", "James"],
"last": ["Smith"],
},
{
"first": ["Alice"],
"last": ["White"],
}
]
}
where I would expect probably expect user.first to extract a list or array of four values.
Maybe this is done in AbstractFieldHitExtractor#unwrapFieldsMultiValue but I think it would be great to have this explicitely covered by tests as well.
palesz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of minor comments on top of the already existing reviews.
| Map<String, Object> elements = (Map<String, Object>) values.get(0); | ||
| values = (List<Object>) elements.get(pathElement); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: if you use Map<String, List<Object>> elements, you won't have to cast on elements.get()
| * "leaf_field" : [ | ||
| * "abc2" | ||
| * ] | ||
| * So, start re-building the path until the right one is found, ie inner_root_1.inner_root_2...... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you were looking for the DocumentField above you were going from longest to shortest path. Here you go the other way around. Any reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SearchHit structure for a field (as in asking for SearchHit.field(fieldName)) looks like a series of Lists and Maps: List of Maps where maps have as key a string (the name of a field/subfield) and the value can be another set of Lists of Maps.
If we have to look for a field called a.b.c.d.e (this is represented as AbstractFieldHitExtractor.fieldName) and this field sits under a nested data structure (and implicitly has AbstractFieldHitExtractor.hitName != null) with hitName = a.b.c, the algorithm searches for a.b.c (SearchHit.field("a.b.c")), if this is null then we search for SearchHit.field("a.b") and if still not found we search for SearchHit.field("a"). When we find a DocumentField, we take its values and we start searching in its List of Maps for the rest of the field.
If we found a DocumentField for a then we still need to search for the rest of the field name path, which now is b.c.d.e. So, we look in values for b, then for b.c and so on until we find an element. If we find b.c then the rest of the path becomes d.e and we repeat the process.
| if (randomBoolean()) { | ||
| randomValue.set(null); | ||
| bulkContent.append("\"" + fieldName + "\":{\"" + leafFieldName + "\":null"); | ||
| } else { | ||
| randomValue.set(randomAlphaOfLength(10)); | ||
| bulkContent.append("\"" + fieldName + "\":{\"" + leafFieldName + "\":\"" + randomValue.get() + "\""); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (randomBoolean()) { | |
| randomValue.set(null); | |
| bulkContent.append("\"" + fieldName + "\":{\"" + leafFieldName + "\":null"); | |
| } else { | |
| randomValue.set(randomAlphaOfLength(10)); | |
| bulkContent.append("\"" + fieldName + "\":{\"" + leafFieldName + "\":\"" + randomValue.get() + "\""); | |
| } | |
| randomValue.set(randomBoolean() ? null : randomAlphaOfLength(10)); | |
| bulkContent.append("\"" + fieldName + "\":{\"" + leafFieldName + "\":\"" + randomValue.get() + "\""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is tiny difference between my approach and yours: when setting a null value in json the json looks like this "field_name": null. Your suggestion leads to a json looking like "field_name": "null" which is the string null, not a null value. I know the if else approach is a more verbose one , but otherwise the string being built inside bulkContent would have looked messy and hard to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, missed that. Fair point.
costin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - it's unfortunate the reading of nested docs is messy and there are no built-in utilities that we can leverage and instead need to reconstruct the hierarchy ourselves through strings which are error prone.
| value = unwrapFieldsMultiValue(((Map<String, Object>) values.get(0)).get(fieldName.substring(hitName.length() + 1))); | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the outer code also calls unwrapFieldMultiValue, how about returning its argument instead:
value = ((Map<String, Object>) values.get(0)).get(fieldName.substring(hitName.length() + 1));
| if (field != null) { | ||
| value = unwrapFieldsMultiValue(field.getValues()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be common for both branches (needs changes to extractNestedField) to make it more consistent. Something like:
field = hitName != null ? extractNestedField(hit) : hit.field(fieldName);
if (field != null) {
value = unwrapFieldsMultiValue(fields.getValue());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extractNestedField cannot consistently return a DocumentField instance so that we can call allways field.getValues() in extract(SearchHit) method. The data structure inside a DocumentField is more complex in the case of nested fields and it requires further exploration before obtaining an equivalent of field.getValues() (see https://github.com/elastic/elasticsearch/pull/69368/files#diff-099b1feb3ed943ecc359be919a1dd67423938b9b8252e43c9ddd0a72027c1c87R120-R143).
| List<String> remainingPath = new ArrayList<>(); | ||
| // first, search for the "root" DocumentField under which the remaining path of nested document values is | ||
| while((field = hit.field(tempHitname)) == null) { | ||
| int indexOfDot = tempHitname.lastIndexOf("."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this guaranteed to return a non-negative?
Actually, with a mapping similar to the exemplified one in the test's comment (but only using the first lowered letter of the randomly generated nodes), indexing just{"h": [{"f":{"b": {"a": "true"}}}]}and querying withSELECT h.f.o FROM testseems to trigger:Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end -1, length 1 at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3734) at java.base/java.lang.String.substring(String.java:1903) at org.elasticsearch.xpack.ql.execution.search.extractor.AbstractFieldHitExtractor.extractNestedField(AbstractFieldHitExtractor.java:113) at org.elasticsearch.xpack.ql.execution.search.extractor.AbstractFieldHitExtractor.extract(AbstractFieldHitExtractor.java:86) at org.elasticsearch.xpack.sql.execution.search.SearchHitRowSet.extractValue(SearchHitRowSet.java:129) at org.elasticsearch.xpack.sql.execution.search.SearchHitRowSet.extractValue(SearchHitRowSet.java:32) at org.elasticsearch.xpack.sql.execution.search.ResultRowSet.getColumn(ResultRowSet.java:38) at org.elasticsearch.xpack.sql.session.AbstractRowSet.column(AbstractRowSet.java:19) at org.elasticsearch.xpack.sql.session.RowView.forEachColumn(RowView.java:39) at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.lambda$createResponse$5(TransportSqlQueryAction.java:132) at org.elasticsearch.xpack.sql.session.RowSet.forEachRow(RowSet.java:28) at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.createResponse(TransportSqlQueryAction.java:130) at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.createResponseWithSchema(TransportSqlQueryAction.java:125) at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.lambda$operation$0(TransportSqlQueryAction.java:97) at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:139)
We should prevent the error from occurring - looking at the stacktrace this seems to have been addressed yet I'm mentioning to double check this is indeed the case.
…o temp_nested_fields_bugfix
bpintea
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
matriv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thank you!
* Handle a tree of nested fields or a nested field inside a non-nested field type (cherry picked from commit 28b64c6)
* Handle a tree of nested fields or a nested field inside a non-nested field type (cherry picked from commit 28b64c6)
After the changes in #67432 and #68802, the way nested fields are extracted from the ES query results changed significantly. This PR fixes the more complex multi-nested mappings fields extraction.
Fixes #69175.