LUCENE-10159: Fix invalid access in sorted set dv #389

dnhatn · 2021-10-17T03:29:37Z

We introduced invalid accesses for sorted set doc values in LUCENE-9613. However, the issue has been unnoticed because the ordinals in doc values tests aren't complex enough to use high packed bits, and the 3 padding bytes make these invalid accesses perfectly fine. To reproduce this issue, we need to use at least 20 bits per value for the ordinals.

The Machine Learning team at Elastic uses a complex doc values setup and uncovers this bug with a Elasticsearch 8.0 nightly build. Fortunately, this is an unreleased bug in both Lucene and Elasticsearch.

rmuir · 2021-10-17T17:47:20Z

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java

+          count = ords.docValueCount();
+        } else {
+          count = 0;
+        }


do you know if it is possible for us to write an extra trailing 0 for this case? Then we could avoid the conditional on every next/advance?

Thanks @rmuir. Yes, it should be okay if we write 8 extra trailing 0 (for the worst case where we use DirectPackedReader64 to store ordinals). WDYT?

Correction: It requires 11 extra trailing bytes. The worst case is DirectPackedReader40 where we need 3 extra bytes for the last value and 8 extra bytes for a dummy value.

@rmuir But we still need to make sure we don't call docValueCount() when the iterator is not positioned since this is illegal?

@jpountz that isn't the problem. the problem is that we call docValueCount "eagerly" ourselves inside of next/advance and cache it. I don't know why that is.

I agree we need to change this. We could use the same approach that we use in getSortedNumeric by having a boolean set flag to only read values if they are actually needed.

dnhatn · 2021-10-18T01:40:10Z

@rmuir I pushed 4ee7a6b to write more trailing bytes. Would you mind taking another look? Thank you!

rmuir · 2021-10-18T03:04:19Z

Hmm, sorry if I'm being overly nitpicky, but it doesn't seem good to mix this concern into DirectWriter? It has nothing to do with DirectWriter. Seems like writing the extra 0 should be instead done in Lucene90DocValuesConsumer?

rmuir · 2021-10-18T03:08:35Z

Also, given that writing a sortedset is really just a sortednumeric, isn't sortednumeric impacted by the bug too? would it easier to test for this bug with a sortednumeric unit test? The unit test is quite complicated.

jpountz · 2021-10-18T05:20:29Z

I think the test is useful to double check that the change is fixing the issue but I'm not sure if we should merge it: it wouldn't be manageable to have tests at the doc-values format test case's level for every possible number of bits per ord. This invalid access issue is something that I would have expected AssertingDocValuesFormat to catch, maybe we should review the assertions we have there around methods that shouldn't be called when iterators are not positioned?

rmuir · 2021-10-18T09:34:44Z

Seems a lot easier to just test this via SORTED_NUMERIC. There isn't any point in testing SORTED_SET as it only makes things more difficult.

The test should fail without the fix and pass with the fix, so there is no "use" to be had otherwise.

As far as Asserting, maybe you misunderstand the issue. It does not happen via the public API, so such a class cannot catch it? It happens via implementation detail:

NumericDocValues ords = ... // fetch internal 
return new SortedNumericDocvalues() { ... } // buggy delegator

Why do we have all this delegation?
If we truly must delegate, why not delegate docValueCount too?
I don't understand why we eagerly invoke it in next/advance and cache it.

jpountz · 2021-10-18T09:44:48Z

Why do we have all this delegation?

It was a way to have sorted set doc values automatically benefit from the specialization of sorted numeric doc values without duplicating all the specialized cases.

If we truly must delegate, why not delegate docValueCount too?

Sorted set doc values don't have a docValueCount API, they're just expected to return NO_MORE_ORDS when all ords have been exhausted.

I don't understand why we eagerly invoke it in next/advance and cache it.

+1 to change this. This would fix the bug, and probably also help a bit when the iterator is used to iterate over docs that have a value while never reading the values, such as with DocValueFieldExistsQuery.

jpountz · 2021-10-18T11:58:37Z

As far as Asserting, maybe you misunderstand the issue

Sorry you are right indeed, AssertingDVFormat cannot help here.

dnhatn · 2021-10-18T16:31:15Z

Seems a lot easier to just test this via SORTED_NUMERIC. There isn't any point in testing SORTED_SET as it only makes things more difficult.

@rmuir @jpountz Thanks for your reviews and discussion. I think the issue only affects the sorted_set dv. I've simplified the unit test and applied the solution that you both discussed. Can you give it another look.

rmuir · 2021-10-18T23:37:36Z

Sorted set doc values don't have a docValueCount API, they're just expected to return NO_MORE_ORDS when all ords have been exhausted.

Thanks, sorry I had completely forgotten that, and that's the inconsistency that is root cause of the trouble here (padding/alignment that hid the bug didn't help). SortedSet was added first, and not having a count method may not have been the best decision. I am not sure it is even slightly helpful to save space if you want to implement as a vint-list, because you still need to store "some kind of length" to have per-document random access.

With the SortedNumeric, there is no available sentinel value that can be used (without boxing or something nasty), so we had to do a count method.

Maybe it is worth a second thought, if the SortedSet could get a count method to be more consistent and efficient like the numeric one. It would have costs (e.g. we'd need to hard-break the api in a way that it isnt trappy on users), but it would also have benefits: e.g. none of this state-keeping inside the codec, instead based on a more natural loop that happens outside of the codec code. Then AssertingCodec would really detect issues, maybe the compiler can do a better job with it, etc.

dnhatn · 2021-10-19T11:59:17Z

@rmuir @jpountz Thanks for reviewing.

jpountz · 2021-10-19T12:41:16Z

@rmuir Agreed with your thoughts. I wonder what you would think of reducing the padding to the strict minimum depending on the number of bits per value. This would make it harder to change how we read values in the future, but this would also make it more likely to detect out-of-bounds access in the future as we could also detect this on low numbers of bits per value.

rmuir · 2021-10-19T12:43:04Z

+1 to that idea as a followup issue too. it is really bad that it masked the bug here! Thanks @dnhatn for all the debugging and the fix!

LUCENE-10159: Fix invalid access in sorted set dv

5f3520a

dnhatn requested review from jpountz and rmuir October 17, 2021 03:29

dnhatn mentioned this pull request Oct 17, 2021

Hit java.lang.IndexOutOfBoundsException in es log after recent Lucene9 snapshot upgrade elastic/elasticsearch#78785

Closed

dnhatn added 5 commits October 16, 2021 23:47

Limit IW memory usage

b83463a

less memory

ee6cfaa

make it nightly

1c2cfc2

smaller

d10c551

make it nightly again

08eccc8

rmuir reviewed Oct 17, 2021

View reviewed changes

Writing trailing bytes instead

4ee7a6b

dnhatn requested a review from rmuir October 18, 2021 01:40

Do not load docValueCount eagerly

2bf7eeb

dnhatn added 3 commits October 18, 2021 12:40

make it nightly again

d1e30f8

simpler test

777bec0

import

dd29ec3

rmuir approved these changes Oct 18, 2021

View reviewed changes

dnhatn merged commit 8b68bf6 into apache:main Oct 19, 2021

dnhatn deleted the LUCENE-10159 branch October 19, 2021 12:00

dnhatn mentioned this pull request Oct 19, 2021

Upgrade lucene version 8b68bf60c98 elastic/elasticsearch#79461

Merged

This was referenced Aug 24, 2022

Create blocks for ords when it helps in Lucene80DocValuesFormat [LUCENE-9613] #10653

Closed

Index corruption: IndexOutOfBoundsException for doc values [LUCENE-10159] #11196

Closed

LUCENE-10159: Fix invalid access in sorted set dv #389

LUCENE-10159: Fix invalid access in sorted set dv #389

Uh oh!

Conversation

dnhatn commented Oct 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmuir Oct 17, 2021

Choose a reason for hiding this comment

Uh oh!

dnhatn Oct 17, 2021

Choose a reason for hiding this comment

Uh oh!

dnhatn Oct 17, 2021

Choose a reason for hiding this comment

Uh oh!

jpountz Oct 18, 2021

Choose a reason for hiding this comment

Uh oh!

rmuir Oct 18, 2021

Choose a reason for hiding this comment

Uh oh!

jpountz Oct 18, 2021

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Oct 18, 2021

Uh oh!

rmuir commented Oct 18, 2021

Uh oh!

rmuir commented Oct 18, 2021

Uh oh!

jpountz commented Oct 18, 2021

Uh oh!

rmuir commented Oct 18, 2021

Uh oh!

jpountz commented Oct 18, 2021

Uh oh!

jpountz commented Oct 18, 2021

Uh oh!

dnhatn commented Oct 18, 2021

Uh oh!

rmuir commented Oct 18, 2021

Uh oh!

dnhatn commented Oct 19, 2021

Uh oh!

jpountz commented Oct 19, 2021

Uh oh!

rmuir commented Oct 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dnhatn commented Oct 17, 2021 •

edited

Loading