Skip to content

Conversation

@astefan
Copy link
Contributor

@astefan astefan commented Feb 22, 2021

After the changes in #67432 and #68802, the way nested fields are extracted from the ES query results changed significantly. This PR fixes the more complex multi-nested mappings fields extraction.

Fixes #69175.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@astefan
Copy link
Contributor Author

astefan commented Feb 22, 2021

@elasticmachine update branch

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first round of comments.

}

/*
* For a tree of fields like root.nested1.nested2.leaf where nested1 and nested2 are nested field types,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would "a path of fields" make more sense? (since I guess the doc isn't parsed/accessed as a tree?)

String tempHitname = hitName;
List<String> remainingPath = new ArrayList<>();
// first, search for the "root" DocumentField under which the remaining path of nested document values is
while((field = hit.field(tempHitname)) == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spacing after while.

List<String> remainingPath = new ArrayList<>();
// first, search for the "root" DocumentField under which the remaining path of nested document values is
while((field = hit.field(tempHitname)) == null) {
int indexOfDot = tempHitname.lastIndexOf(".");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed to return a non-negative?
Actually, with a mapping similar to the exemplified one in the test's comment (but only using the first lowered letter of the randomly generated nodes), indexing just {"h": [{"f":{"b": {"a": "true"}}}]} and querying with SELECT h.f.o FROM test seems to trigger:

Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end -1, length 1
	at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3734)
	at java.base/java.lang.String.substring(String.java:1903)
	at org.elasticsearch.xpack.ql.execution.search.extractor.AbstractFieldHitExtractor.extractNestedField(AbstractFieldHitExtractor.java:113)
	at org.elasticsearch.xpack.ql.execution.search.extractor.AbstractFieldHitExtractor.extract(AbstractFieldHitExtractor.java:86)
	at org.elasticsearch.xpack.sql.execution.search.SearchHitRowSet.extractValue(SearchHitRowSet.java:129)
	at org.elasticsearch.xpack.sql.execution.search.SearchHitRowSet.extractValue(SearchHitRowSet.java:32)
	at org.elasticsearch.xpack.sql.execution.search.ResultRowSet.getColumn(ResultRowSet.java:38)
	at org.elasticsearch.xpack.sql.session.AbstractRowSet.column(AbstractRowSet.java:19)
	at org.elasticsearch.xpack.sql.session.RowView.forEachColumn(RowView.java:39)
	at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.lambda$createResponse$5(TransportSqlQueryAction.java:132)
	at org.elasticsearch.xpack.sql.session.RowSet.forEachRow(RowSet.java:28)
	at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.createResponse(TransportSqlQueryAction.java:130)
	at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.createResponseWithSchema(TransportSqlQueryAction.java:125)
	at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.lambda$operation$0(TransportSqlQueryAction.java:97)
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:139)

Copy link
Contributor

@bpintea bpintea Feb 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioning it here, but happy to open an issue if better (or just drop it, if wrong): selecting h.f.b.a will still return nothing.
Reason being that the inner_hits response from ES comes with with the entries listed as h.f_12 (for example), while we're looking for the inner hit h.f.b in SearchHitRowSet. Seems like this should be doable to work, but didn't look deeper.
This is not new, this didn't work before either (although no error message is returned, as mentioned in the limitations).

Copy link
Contributor Author

@astefan astefan Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! That's an older problem - SELECT h.f.b.a where b.a is a hierarchy of simple fields inside another hierarchy of nested fields -, for sure. It's the complexity of the nested documents definition possibilities that shows that our algorithm isn't covering all of them. I'd be inclined in dealing with this bug in a subsequent PR as it doesn't look straightforward and it may need a bit of rethinking the way we deal with nested fields hierarchies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed to return a non-negative?
Actually, with a mapping similar to the exemplified one in the test's comment (but only using the first lowered letter of the randomly generated nodes), indexing just {"h": [{"f":{"b": {"a": "true"}}}]} and querying with SELECT h.f.o FROM test seems to trigger:

Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end -1, length 1
	at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3734)
	at java.base/java.lang.String.substring(String.java:1903)
	at org.elasticsearch.xpack.ql.execution.search.extractor.AbstractFieldHitExtractor.extractNestedField(AbstractFieldHitExtractor.java:113)
	at org.elasticsearch.xpack.ql.execution.search.extractor.AbstractFieldHitExtractor.extract(AbstractFieldHitExtractor.java:86)
	at org.elasticsearch.xpack.sql.execution.search.SearchHitRowSet.extractValue(SearchHitRowSet.java:129)
	at org.elasticsearch.xpack.sql.execution.search.SearchHitRowSet.extractValue(SearchHitRowSet.java:32)
	at org.elasticsearch.xpack.sql.execution.search.ResultRowSet.getColumn(ResultRowSet.java:38)
	at org.elasticsearch.xpack.sql.session.AbstractRowSet.column(AbstractRowSet.java:19)
	at org.elasticsearch.xpack.sql.session.RowView.forEachColumn(RowView.java:39)
	at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.lambda$createResponse$5(TransportSqlQueryAction.java:132)
	at org.elasticsearch.xpack.sql.session.RowSet.forEachRow(RowSet.java:28)
	at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.createResponse(TransportSqlQueryAction.java:130)
	at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.createResponseWithSchema(TransportSqlQueryAction.java:125)
	at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.lambda$operation$0(TransportSqlQueryAction.java:97)
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:139)

We should prevent the error from occurring - looking at the stacktrace this seems to have been addressed yet I'm mentioning to double check this is indeed the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bpintea this change should fix the h.f.b.a returning nothing example you mentioned. Test for this exact case here.

if (remainingPath.size() > 0) {
List<Object> values = field.getValues();
Iterator<String> pathIterator = remainingPath.iterator();
while(pathIterator.hasNext()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spacing.

Iterator<String> pathIterator = remainingPath.iterator();
while(pathIterator.hasNext()) {
String pathElement = pathIterator.next();
Map<String, Object> elements = (Map<String, Object>) values.get(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here (and when calling unwrapFieldsMultiValue below), only a first value is selected. Could the looked for path not be accessed through a subsequent element?

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the great randomised test, Imho we should also have some plain tests that target specific cases, as for example those mentioned already by @bpintea

@cbuescher cbuescher self-requested a review February 24, 2021 09:27
@astefan
Copy link
Contributor Author

astefan commented Feb 25, 2021

@matriv thank you for the review. I am not sure how useful it is to have a specific test for that scenario. There could be many more scenarios like that one, ones that we didn't even think about, just because the combination of mappings possible is so complex. While we do have specific tests for specific scenarios in FieldExtractorTestCase, for nested fields the scenarios are not so few also not so simple and I'm reluctant to add a specific case that couldn't be covered by a randomization algorithm.

@bpintea spotted one scenario where the value of the requested field is null and I've adjusted the randomized test to account for that. He also noticed another scenario that didn't ever work, where a tree of mappings is nested inside a tree of nested fields. Since the fix is not obvious for the moment, I'd suggest dealing with that fix in a separate GH issue where tests should be added as well. If a specific test is proven to be useful in that case vs. a randomized one, I'd let this to be decided then.

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I didn't go through all the details of the extraction implementation yet, its a bit different to what we do on the fields side of things where we deal with the _source map. I left a question around testing, I tend to agree with @matriv that in addition to the randomized testing it would be great to have a few common cases spelled out explicitely without too much randomness to get a basic coverage of those. But I don't know how well this fits with the general testing philosophy and infra in ql, so I leave this only as a remark.

}
index.endObject();
// from time to time set a null value instead of an actual value
if (randomBoolean()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use null less often than 50% of the times in randomization here, it should be the exception I think. Using ' rarely()` this case would still get picked 1% of the time, should be enough?

bulkContent.append("\"" + fieldName + "\":{\"" + leafFieldName + "\":null");
} else {
randomValue.set(randomAlphaOfLength(10));
bulkContent.append("\"" + fieldName + "\":{\"" + leafFieldName + "\":\"" + randomValue.get() + "\"");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this entirely, but to me it seems the test will only add one value for the leaf field thats going to be requested and tested for here? Asking because its very common that there is more than one entry in a nested field, e.g. in a very simple case

     "fields": {
        "group" : ["fans"],
        "user": [{
            "first": ["John"],
            "last": ["Smith"],
          },
          {
            "first": ["Alice"],
            "last": ["White"],
          }
        ]
      }

where user is the nested field, and you want to retrieve user.first. I'm not sure this case is tested for, but I expect it to be common. Maybe less common but still relevant would be something like

     "fields": {
        "group" : ["fans"],
        "user": [{
            "first": ["John", "Jim", "James"],
            "last": ["Smith"],
          },
          {
            "first": ["Alice"],
            "last": ["White"],
          }
        ]
      }

where I would expect probably expect user.first to extract a list or array of four values.
Maybe this is done in AbstractFieldHitExtractor#unwrapFieldsMultiValue but I think it would be great to have this explicitely covered by tests as well.

Copy link
Contributor

@palesz palesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor comments on top of the already existing reviews.

Comment on lines 121 to 122
Map<String, Object> elements = (Map<String, Object>) values.get(0);
values = (List<Object>) elements.get(pathElement);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: if you use Map<String, List<Object>> elements, you won't have to cast on elements.get()

* "leaf_field" : [
* "abc2"
* ]
* So, start re-building the path until the right one is found, ie inner_root_1.inner_root_2......
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you were looking for the DocumentField above you were going from longest to shortest path. Here you go the other way around. Any reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SearchHit structure for a field (as in asking for SearchHit.field(fieldName)) looks like a series of Lists and Maps: List of Maps where maps have as key a string (the name of a field/subfield) and the value can be another set of Lists of Maps.

If we have to look for a field called a.b.c.d.e (this is represented as AbstractFieldHitExtractor.fieldName) and this field sits under a nested data structure (and implicitly has AbstractFieldHitExtractor.hitName != null) with hitName = a.b.c, the algorithm searches for a.b.c (SearchHit.field("a.b.c")), if this is null then we search for SearchHit.field("a.b") and if still not found we search for SearchHit.field("a"). When we find a DocumentField, we take its values and we start searching in its List of Maps for the rest of the field.
If we found a DocumentField for a then we still need to search for the rest of the field name path, which now is b.c.d.e. So, we look in values for b, then for b.c and so on until we find an element. If we find b.c then the rest of the path becomes d.e and we repeat the process.

Comment on lines 1173 to 1179
if (randomBoolean()) {
randomValue.set(null);
bulkContent.append("\"" + fieldName + "\":{\"" + leafFieldName + "\":null");
} else {
randomValue.set(randomAlphaOfLength(10));
bulkContent.append("\"" + fieldName + "\":{\"" + leafFieldName + "\":\"" + randomValue.get() + "\"");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (randomBoolean()) {
randomValue.set(null);
bulkContent.append("\"" + fieldName + "\":{\"" + leafFieldName + "\":null");
} else {
randomValue.set(randomAlphaOfLength(10));
bulkContent.append("\"" + fieldName + "\":{\"" + leafFieldName + "\":\"" + randomValue.get() + "\"");
}
randomValue.set(randomBoolean() ? null : randomAlphaOfLength(10));
bulkContent.append("\"" + fieldName + "\":{\"" + leafFieldName + "\":\"" + randomValue.get() + "\"");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is tiny difference between my approach and yours: when setting a null value in json the json looks like this "field_name": null. Your suggestion leads to a json looking like "field_name": "null" which is the string null, not a null value. I know the if else approach is a more verbose one , but otherwise the string being built inside bulkContent would have looked messy and hard to follow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, missed that. Fair point.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - it's unfortunate the reading of nested docs is messy and there are no built-in utilities that we can leverage and instead need to reconstruct the hierarchy ourselves through strings which are error prone.

Comment on lines 143 to 144
value = unwrapFieldsMultiValue(((Map<String, Object>) values.get(0)).get(fieldName.substring(hitName.length() + 1)));
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the outer code also calls unwrapFieldMultiValue, how about returning its argument instead:

value = ((Map<String, Object>) values.get(0)).get(fieldName.substring(hitName.length() + 1));

Comment on lines +89 to +91
if (field != null) {
value = unwrapFieldsMultiValue(field.getValues());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be common for both branches (needs changes to extractNestedField) to make it more consistent. Something like:

field = hitName != null  ? extractNestedField(hit) : hit.field(fieldName);
if (field != null) {
    value = unwrapFieldsMultiValue(fields.getValue());
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractNestedField cannot consistently return a DocumentField instance so that we can call allways field.getValues() in extract(SearchHit) method. The data structure inside a DocumentField is more complex in the case of nested fields and it requires further exploration before obtaining an equivalent of field.getValues() (see https://github.com/elastic/elasticsearch/pull/69368/files#diff-099b1feb3ed943ecc359be919a1dd67423938b9b8252e43c9ddd0a72027c1c87R120-R143).

List<String> remainingPath = new ArrayList<>();
// first, search for the "root" DocumentField under which the remaining path of nested document values is
while((field = hit.field(tempHitname)) == null) {
int indexOfDot = tempHitname.lastIndexOf(".");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed to return a non-negative?
Actually, with a mapping similar to the exemplified one in the test's comment (but only using the first lowered letter of the randomly generated nodes), indexing just {"h": [{"f":{"b": {"a": "true"}}}]} and querying with SELECT h.f.o FROM test seems to trigger:

Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end -1, length 1
	at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3734)
	at java.base/java.lang.String.substring(String.java:1903)
	at org.elasticsearch.xpack.ql.execution.search.extractor.AbstractFieldHitExtractor.extractNestedField(AbstractFieldHitExtractor.java:113)
	at org.elasticsearch.xpack.ql.execution.search.extractor.AbstractFieldHitExtractor.extract(AbstractFieldHitExtractor.java:86)
	at org.elasticsearch.xpack.sql.execution.search.SearchHitRowSet.extractValue(SearchHitRowSet.java:129)
	at org.elasticsearch.xpack.sql.execution.search.SearchHitRowSet.extractValue(SearchHitRowSet.java:32)
	at org.elasticsearch.xpack.sql.execution.search.ResultRowSet.getColumn(ResultRowSet.java:38)
	at org.elasticsearch.xpack.sql.session.AbstractRowSet.column(AbstractRowSet.java:19)
	at org.elasticsearch.xpack.sql.session.RowView.forEachColumn(RowView.java:39)
	at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.lambda$createResponse$5(TransportSqlQueryAction.java:132)
	at org.elasticsearch.xpack.sql.session.RowSet.forEachRow(RowSet.java:28)
	at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.createResponse(TransportSqlQueryAction.java:130)
	at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.createResponseWithSchema(TransportSqlQueryAction.java:125)
	at org.elasticsearch.xpack.sql.plugin.TransportSqlQueryAction.lambda$operation$0(TransportSqlQueryAction.java:97)
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:139)

We should prevent the error from occurring - looking at the stacktrace this seems to have been addressed yet I'm mentioning to double check this is indeed the case.

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you!

@astefan astefan merged commit 28b64c6 into elastic:master Mar 15, 2021
@astefan astefan deleted the temp_nested_fields_bugfix branch March 15, 2021 13:29
astefan added a commit to astefan/elasticsearch that referenced this pull request Mar 15, 2021
* Handle a tree of nested fields or a nested field inside a non-nested
field type

(cherry picked from commit 28b64c6)
astefan added a commit to astefan/elasticsearch that referenced this pull request Mar 15, 2021
* Handle a tree of nested fields or a nested field inside a non-nested
field type

(cherry picked from commit 28b64c6)
astefan added a commit that referenced this pull request Mar 15, 2021
* Handle a tree of nested fields or a nested field inside a non-nested
field type

(cherry picked from commit 28b64c6)
astefan added a commit that referenced this pull request Mar 15, 2021
* Handle a tree of nested fields or a nested field inside a non-nested
field type

(cherry picked from commit 28b64c6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SQL: double nested fields return null values

8 participants