Use the bulk SimScorer#score API to compute impact scores. #15151

jpountz · 2025-09-03T16:11:52Z

In #15039 we introduced a bulk SimScorer#score API and used it to compute scores with the leading conjunctive clause and "essential" clauses of disjunctive queries. With this PR, we are now also using this bulk API when translating (term frequency, length normalization factor) pairs into the maximum possible score that a block of postings may produce.

To do it right, I had to change the impacts API to no longer return a List of (term freq, norm) pairs, but instead two parallel arrays of term frequencies and norms that could (almost) directly be passed to the SimScorer#score bulk API. Unfortunately this makes the change quite big since many backward formats had to be touched.

In apache#15039 we introduced a bulk `SimScorer#score` API and used it to compute scores with the leading conjunctive clause and "essential" clauses of disjunctive queries. With this PR, we are now also using this bulk API when translating (term frequency, length normalization factor) pairs into the maximum possible score that a block of postings may produce. To do it right, I had to change the impacts API to no longer return a List of (term freq, norm) pairs, but instead two parallel arrays of term frequencies and norms that could (almost) directly be passed to the `SimScorer#score` bulk API. Unfortunately this makes the change quite big since many backward formats had to be touched.

github-actions · 2025-09-03T16:12:46Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

jpountz · 2025-09-03T17:42:59Z

wikibigall on my machine gives the following results:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
             FilteredOrStopWords       44.09      (2.0%)       43.68      (1.2%)   -0.9% (  -4% -    2%) 0.070
                 FilteredPrefix3      147.21      (2.1%)      146.02      (2.0%)   -0.8% (  -4% -    3%) 0.208
                   TermTitleSort       84.02      (3.4%)       83.45      (3.7%)   -0.7% (  -7% -    6%) 0.543
              CombinedOrHighHigh       21.82      (2.9%)       21.67      (3.2%)   -0.7% (  -6% -    5%) 0.477
              FilteredOrHighHigh       64.80      (1.9%)       64.38      (1.0%)   -0.6% (  -3% -    2%) 0.170
                     CountPhrase        4.16      (1.8%)        4.13      (1.7%)   -0.6% (  -4% -    2%) 0.269
                    CombinedTerm       37.20      (2.0%)       37.03      (2.6%)   -0.5% (  -4% -    4%) 0.533
               CombinedOrHighMed       82.29      (2.7%)       81.99      (3.3%)   -0.4% (  -6% -    5%) 0.704
                      TermDTSort      383.61      (2.3%)      382.24      (2.7%)   -0.4% (  -5% -    4%) 0.651
      FilteredOr2Terms2StopWords      142.98      (1.4%)      142.64      (0.7%)   -0.2% (  -2% -    1%) 0.498
                   TermMonthSort     3268.83      (2.6%)     3263.88      (2.8%)   -0.2% (  -5% -    5%) 0.860
                FilteredOr3Terms      161.33      (1.1%)      161.11      (0.8%)   -0.1% (  -2% -    1%) 0.659
                     CountOrMany       28.41      (1.9%)       28.39      (3.2%)   -0.1% (  -5% -    5%) 0.936
                  FilteredIntNRQ      296.73      (1.0%)      296.54      (1.2%)   -0.1% (  -2% -    2%) 0.855
         CountFilteredOrHighHigh      134.69      (1.0%)      134.61      (1.7%)   -0.1% (  -2% -    2%) 0.901
                 CountAndHighMed      297.37      (1.3%)      297.31      (1.2%)   -0.0% (  -2% -    2%) 0.958
                CountAndHighHigh      350.18      (2.0%)      350.11      (3.8%)   -0.0% (  -5% -    5%) 0.983
               TermDayOfYearSort      282.87      (2.0%)      282.81      (1.4%)   -0.0% (  -3% -    3%) 0.974
             CountFilteredOrMany       26.60      (2.0%)       26.60      (3.1%)   -0.0% (  -4% -    5%) 0.999
                 AndHighOrMedMed       48.78      (1.3%)       48.79      (1.2%)    0.0% (  -2% -    2%) 0.978
          CountFilteredOrHighMed      146.46      (0.9%)      146.48      (1.2%)    0.0% (  -2% -    2%) 0.966
                    FilteredTerm      153.98      (2.2%)      154.04      (1.5%)    0.0% (  -3% -    3%) 0.941
               FilteredOrHighMed      148.23      (1.3%)      148.37      (0.8%)    0.1% (  -1% -    2%) 0.777
                      OrHighRare      281.29      (8.3%)      281.70      (3.8%)    0.1% ( -10% -   13%) 0.943
                  FilteredOrMany       15.81      (2.0%)       15.83      (1.6%)    0.1% (  -3% -    3%) 0.799
                 CountOrHighHigh      332.30      (2.1%)      332.99      (4.0%)    0.2% (  -5% -    6%) 0.838
             CountFilteredPhrase       24.86      (2.4%)       24.98      (1.3%)    0.5% (  -3% -    4%) 0.417
             FilteredAndHighHigh       77.70      (1.8%)       78.09      (1.2%)    0.5% (  -2% -    3%) 0.293
                          OrMany       22.44      (4.3%)       22.55      (3.8%)    0.5% (  -7% -    8%) 0.682
             CombinedAndHighHigh       22.19      (1.3%)       22.32      (1.6%)    0.6% (  -2% -    3%) 0.204
                  CountOrHighMed      339.36      (2.4%)      341.52      (1.9%)    0.6% (  -3% -    4%) 0.346
     FilteredAnd2Terms2StopWords      212.32      (1.9%)      213.73      (1.4%)    0.7% (  -2% -    3%) 0.196
              CombinedAndHighMed       84.39      (1.2%)       85.02      (1.6%)    0.7% (  -2% -    3%) 0.091
               FilteredAnd3Terms      185.09      (2.9%)      186.51      (1.9%)    0.8% (  -3% -    5%) 0.321
                  FilteredPhrase       31.03      (1.5%)       31.27      (1.1%)    0.8% (  -1% -    3%) 0.059
            FilteredAndStopWords       64.07      (2.0%)       64.58      (1.2%)    0.8% (  -2% -    4%) 0.120
                        Or3Terms      222.41      (6.2%)      224.23      (5.2%)    0.8% ( -10% -   13%) 0.653
                       CountTerm     8641.68      (4.1%)     8727.93      (4.3%)    1.0% (  -7% -    9%) 0.449
              Or2Terms2StopWords      197.83      (5.0%)      199.85      (4.5%)    1.0% (  -8% -   11%) 0.496
                       OrHighMed      243.82      (7.2%)      246.48      (6.0%)    1.1% ( -11% -   15%) 0.602
              FilteredAndHighMed      151.68      (3.8%)      153.36      (2.5%)    1.1% (  -5% -    7%) 0.279
                AndMedOrHighHigh       80.46      (5.0%)       81.47      (4.1%)    1.3% (  -7% -   10%) 0.384
                     OrStopWords       46.75      (8.1%)       47.37      (6.8%)    1.3% ( -12% -   17%) 0.575
                      OrHighHigh       73.81      (9.1%)       74.95      (7.5%)    1.5% ( -13% -   19%) 0.558
             And2Terms2StopWords      196.07      (6.2%)      199.96      (4.1%)    2.0% (  -7% -   13%) 0.235
                       And3Terms      228.82      (7.7%)      233.77      (5.2%)    2.2% (  -9% -   16%) 0.297
                    AndStopWords       44.61      (9.5%)       45.68      (6.4%)    2.4% ( -12% -   20%) 0.348
                            Term      617.63      (8.3%)      636.08      (5.3%)    3.0% (  -9% -   18%) 0.176
                      AndHighMed      188.71     (10.7%)      194.64      (7.2%)    3.1% ( -13% -   23%) 0.275
                     AndHighHigh       64.17     (11.8%)       66.29      (8.0%)    3.3% ( -14% -   26%) 0.300

p-values are high due to quite high run-over-run variance, but queries that we'd have expected to get a speedup are at the bottom so it may give a tiny speedup in practice.

gf2121

This looks a right direction to me though the improvement does not seems very significant. Thank you!

lucene/core/src/java/org/apache/lucene/search/SloppyPhraseMatcher.java

lucene/core/src/test/org/apache/lucene/search/TestPhraseQuery.java

lucene/core/src/test/org/apache/lucene/search/TestSynonymQuery.java

…her.java Co-authored-by: Guo Feng <[email protected]>

…java Co-authored-by: Guo Feng <[email protected]>

….java Co-authored-by: Guo Feng <[email protected]>

In #15039 we introduced a bulk `SimScorer#score` API and used it to compute scores with the leading conjunctive clause and "essential" clauses of disjunctive queries. With this PR, we are now also using this bulk API when translating (term frequency, length normalization factor) pairs into the maximum possible score that a block of postings may produce. To do it right, I had to change the impacts API to no longer return a List of (term freq, norm) pairs, but instead two parallel arrays of term frequencies and norms that could (almost) directly be passed to the `SimScorer#score` bulk API. Unfortunately this makes the change quite big since many backward formats had to be touched. Co-authored-by: Guo Feng <[email protected]>

github-actions bot added module:core/index module:core/search module:core/codecs module:test-framework labels Sep 3, 2025

CHANGES

e8fbad7

github-actions bot added this to the 10.4.0 milestone Sep 3, 2025

jpountz requested a review from gf2121 September 3, 2025 16:16

gf2121 approved these changes Sep 7, 2025

View reviewed changes

jpountz and others added 5 commits September 7, 2025 20:50

Update lucene/core/src/java/org/apache/lucene/search/SloppyPhraseMatc…

6c93fb7

…her.java Co-authored-by: Guo Feng <[email protected]>

Update lucene/core/src/test/org/apache/lucene/search/TestPhraseQuery.…

2eb6737

…java Co-authored-by: Guo Feng <[email protected]>

Update lucene/core/src/test/org/apache/lucene/search/TestSynonymQuery…

41d9c9d

….java Co-authored-by: Guo Feng <[email protected]>

compilation

eb16871

Merge branch 'main' into vectorize_impact_score_computation

f494b6b

jpountz merged commit 2dcfd89 into apache:main Sep 8, 2025
8 checks passed

iverase mentioned this pull request Sep 30, 2025

[10.3] Fix returned Impacts when frequencies are not indexed #15263

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use the bulk SimScorer#score API to compute impact scores. #15151

Use the bulk SimScorer#score API to compute impact scores. #15151

Uh oh!

jpountz commented Sep 3, 2025

Uh oh!

github-actions bot commented Sep 3, 2025

Uh oh!

jpountz commented Sep 3, 2025

Uh oh!

gf2121 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use the bulk SimScorer#score API to compute impact scores. #15151

Use the bulk SimScorer#score API to compute impact scores. #15151

Uh oh!

Conversation

jpountz commented Sep 3, 2025

Uh oh!

github-actions bot commented Sep 3, 2025

Uh oh!

jpountz commented Sep 3, 2025

Uh oh!

gf2121 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants