Skip to content

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Oct 17, 2021

We introduced invalid accesses for sorted set doc values in LUCENE-9613. However, the issue has been unnoticed because the ordinals in doc values tests aren't complex enough to use high packed bits, and the 3 padding bytes make these invalid accesses perfectly fine. To reproduce this issue, we need to use at least 20 bits per value for the ordinals.

The Machine Learning team at Elastic uses a complex doc values setup and uncovers this bug with a Elasticsearch 8.0 nightly build. Fortunately, this is an unreleased bug in both Lucene and Elasticsearch.

count = ords.docValueCount();
} else {
count = 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if it is possible for us to write an extra trailing 0 for this case? Then we could avoid the conditional on every next/advance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rmuir. Yes, it should be okay if we write 8 extra trailing 0 (for the worst case where we use DirectPackedReader64 to store ordinals). WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: It requires 11 extra trailing bytes. The worst case is DirectPackedReader40 where we need 3 extra bytes for the last value and 8 extra bytes for a dummy value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmuir But we still need to make sure we don't call docValueCount() when the iterator is not positioned since this is illegal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz that isn't the problem. the problem is that we call docValueCount "eagerly" ourselves inside of next/advance and cache it. I don't know why that is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we need to change this. We could use the same approach that we use in getSortedNumeric by having a boolean set flag to only read values if they are actually needed.

@dnhatn
Copy link
Member Author

dnhatn commented Oct 18, 2021

@rmuir I pushed 4ee7a6b to write more trailing bytes. Would you mind taking another look? Thank you!

@dnhatn dnhatn requested a review from rmuir October 18, 2021 01:40
@rmuir
Copy link
Member

rmuir commented Oct 18, 2021

Hmm, sorry if I'm being overly nitpicky, but it doesn't seem good to mix this concern into DirectWriter? It has nothing to do with DirectWriter. Seems like writing the extra 0 should be instead done in Lucene90DocValuesConsumer?

@rmuir
Copy link
Member

rmuir commented Oct 18, 2021

Also, given that writing a sortedset is really just a sortednumeric, isn't sortednumeric impacted by the bug too? would it easier to test for this bug with a sortednumeric unit test? The unit test is quite complicated.

@jpountz
Copy link
Contributor

jpountz commented Oct 18, 2021

I think the test is useful to double check that the change is fixing the issue but I'm not sure if we should merge it: it wouldn't be manageable to have tests at the doc-values format test case's level for every possible number of bits per ord. This invalid access issue is something that I would have expected AssertingDocValuesFormat to catch, maybe we should review the assertions we have there around methods that shouldn't be called when iterators are not positioned?

@rmuir
Copy link
Member

rmuir commented Oct 18, 2021

Seems a lot easier to just test this via SORTED_NUMERIC. There isn't any point in testing SORTED_SET as it only makes things more difficult.

The test should fail without the fix and pass with the fix, so there is no "use" to be had otherwise.

As far as Asserting, maybe you misunderstand the issue. It does not happen via the public API, so such a class cannot catch it? It happens via implementation detail:

NumericDocValues ords = ... // fetch internal 
return new SortedNumericDocvalues() { ... } // buggy delegator

Why do we have all this delegation?
If we truly must delegate, why not delegate docValueCount too?
I don't understand why we eagerly invoke it in next/advance and cache it.

@jpountz
Copy link
Contributor

jpountz commented Oct 18, 2021

Why do we have all this delegation?

It was a way to have sorted set doc values automatically benefit from the specialization of sorted numeric doc values without duplicating all the specialized cases.

If we truly must delegate, why not delegate docValueCount too?

Sorted set doc values don't have a docValueCount API, they're just expected to return NO_MORE_ORDS when all ords have been exhausted.

I don't understand why we eagerly invoke it in next/advance and cache it.

+1 to change this. This would fix the bug, and probably also help a bit when the iterator is used to iterate over docs that have a value while never reading the values, such as with DocValueFieldExistsQuery.

@jpountz
Copy link
Contributor

jpountz commented Oct 18, 2021

As far as Asserting, maybe you misunderstand the issue

Sorry you are right indeed, AssertingDVFormat cannot help here.

@dnhatn
Copy link
Member Author

dnhatn commented Oct 18, 2021

Seems a lot easier to just test this via SORTED_NUMERIC. There isn't any point in testing SORTED_SET as it only makes things more difficult.

@rmuir @jpountz Thanks for your reviews and discussion. I think the issue only affects the sorted_set dv. I've simplified the unit test and applied the solution that you both discussed. Can you give it another look.

@rmuir
Copy link
Member

rmuir commented Oct 18, 2021

Sorted set doc values don't have a docValueCount API, they're just expected to return NO_MORE_ORDS when all ords have been exhausted.

Thanks, sorry I had completely forgotten that, and that's the inconsistency that is root cause of the trouble here (padding/alignment that hid the bug didn't help). SortedSet was added first, and not having a count method may not have been the best decision. I am not sure it is even slightly helpful to save space if you want to implement as a vint-list, because you still need to store "some kind of length" to have per-document random access.

With the SortedNumeric, there is no available sentinel value that can be used (without boxing or something nasty), so we had to do a count method.

Maybe it is worth a second thought, if the SortedSet could get a count method to be more consistent and efficient like the numeric one. It would have costs (e.g. we'd need to hard-break the api in a way that it isnt trappy on users), but it would also have benefits: e.g. none of this state-keeping inside the codec, instead based on a more natural loop that happens outside of the codec code. Then AssertingCodec would really detect issues, maybe the compiler can do a better job with it, etc.

@dnhatn
Copy link
Member Author

dnhatn commented Oct 19, 2021

@rmuir @jpountz Thanks for reviewing.

@dnhatn dnhatn merged commit 8b68bf6 into apache:main Oct 19, 2021
@dnhatn dnhatn deleted the LUCENE-10159 branch October 19, 2021 12:00
@jpountz
Copy link
Contributor

jpountz commented Oct 19, 2021

@rmuir Agreed with your thoughts. I wonder what you would think of reducing the padding to the strict minimum depending on the number of bits per value. This would make it harder to change how we read values in the future, but this would also make it more likely to detect out-of-bounds access in the future as we could also detect this on low numbers of bits per value.

@rmuir
Copy link
Member

rmuir commented Oct 19, 2021

+1 to that idea as a followup issue too. it is really bad that it masked the bug here! Thanks @dnhatn for all the debugging and the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants