Skip to content

Conversation

@jpountz
Copy link
Contributor

@jpountz jpountz commented Sep 3, 2025

In #15039 we introduced a bulk SimScorer#score API and used it to compute scores with the leading conjunctive clause and "essential" clauses of disjunctive queries. With this PR, we are now also using this bulk API when translating (term frequency, length normalization factor) pairs into the maximum possible score that a block of postings may produce.

To do it right, I had to change the impacts API to no longer return a List of (term freq, norm) pairs, but instead two parallel arrays of term frequencies and norms that could (almost) directly be passed to the SimScorer#score bulk API. Unfortunately this makes the change quite big since many backward formats had to be touched.

In apache#15039 we introduced a bulk `SimScorer#score` API and used it to compute
scores with the leading conjunctive clause and "essential" clauses of
disjunctive queries. With this PR, we are now also using this bulk API when
translating (term frequency, length normalization factor) pairs into the
maximum possible score that a block of postings may produce.

To do it right, I had to change the impacts API to no longer return a List of
(term freq, norm) pairs, but instead two parallel arrays of term frequencies
and norms that could (almost) directly be passed to the `SimScorer#score` bulk
API. Unfortunately this makes the change quite big since many backward formats
had to be touched.
@github-actions
Copy link
Contributor

github-actions bot commented Sep 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions github-actions bot added this to the 10.4.0 milestone Sep 3, 2025
@jpountz jpountz requested a review from gf2121 September 3, 2025 16:16
@jpountz
Copy link
Contributor Author

jpountz commented Sep 3, 2025

wikibigall on my machine gives the following results:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
             FilteredOrStopWords       44.09      (2.0%)       43.68      (1.2%)   -0.9% (  -4% -    2%) 0.070
                 FilteredPrefix3      147.21      (2.1%)      146.02      (2.0%)   -0.8% (  -4% -    3%) 0.208
                   TermTitleSort       84.02      (3.4%)       83.45      (3.7%)   -0.7% (  -7% -    6%) 0.543
              CombinedOrHighHigh       21.82      (2.9%)       21.67      (3.2%)   -0.7% (  -6% -    5%) 0.477
              FilteredOrHighHigh       64.80      (1.9%)       64.38      (1.0%)   -0.6% (  -3% -    2%) 0.170
                     CountPhrase        4.16      (1.8%)        4.13      (1.7%)   -0.6% (  -4% -    2%) 0.269
                    CombinedTerm       37.20      (2.0%)       37.03      (2.6%)   -0.5% (  -4% -    4%) 0.533
               CombinedOrHighMed       82.29      (2.7%)       81.99      (3.3%)   -0.4% (  -6% -    5%) 0.704
                      TermDTSort      383.61      (2.3%)      382.24      (2.7%)   -0.4% (  -5% -    4%) 0.651
      FilteredOr2Terms2StopWords      142.98      (1.4%)      142.64      (0.7%)   -0.2% (  -2% -    1%) 0.498
                   TermMonthSort     3268.83      (2.6%)     3263.88      (2.8%)   -0.2% (  -5% -    5%) 0.860
                FilteredOr3Terms      161.33      (1.1%)      161.11      (0.8%)   -0.1% (  -2% -    1%) 0.659
                     CountOrMany       28.41      (1.9%)       28.39      (3.2%)   -0.1% (  -5% -    5%) 0.936
                  FilteredIntNRQ      296.73      (1.0%)      296.54      (1.2%)   -0.1% (  -2% -    2%) 0.855
         CountFilteredOrHighHigh      134.69      (1.0%)      134.61      (1.7%)   -0.1% (  -2% -    2%) 0.901
                 CountAndHighMed      297.37      (1.3%)      297.31      (1.2%)   -0.0% (  -2% -    2%) 0.958
                CountAndHighHigh      350.18      (2.0%)      350.11      (3.8%)   -0.0% (  -5% -    5%) 0.983
               TermDayOfYearSort      282.87      (2.0%)      282.81      (1.4%)   -0.0% (  -3% -    3%) 0.974
             CountFilteredOrMany       26.60      (2.0%)       26.60      (3.1%)   -0.0% (  -4% -    5%) 0.999
                 AndHighOrMedMed       48.78      (1.3%)       48.79      (1.2%)    0.0% (  -2% -    2%) 0.978
          CountFilteredOrHighMed      146.46      (0.9%)      146.48      (1.2%)    0.0% (  -2% -    2%) 0.966
                    FilteredTerm      153.98      (2.2%)      154.04      (1.5%)    0.0% (  -3% -    3%) 0.941
               FilteredOrHighMed      148.23      (1.3%)      148.37      (0.8%)    0.1% (  -1% -    2%) 0.777
                      OrHighRare      281.29      (8.3%)      281.70      (3.8%)    0.1% ( -10% -   13%) 0.943
                  FilteredOrMany       15.81      (2.0%)       15.83      (1.6%)    0.1% (  -3% -    3%) 0.799
                 CountOrHighHigh      332.30      (2.1%)      332.99      (4.0%)    0.2% (  -5% -    6%) 0.838
             CountFilteredPhrase       24.86      (2.4%)       24.98      (1.3%)    0.5% (  -3% -    4%) 0.417
             FilteredAndHighHigh       77.70      (1.8%)       78.09      (1.2%)    0.5% (  -2% -    3%) 0.293
                          OrMany       22.44      (4.3%)       22.55      (3.8%)    0.5% (  -7% -    8%) 0.682
             CombinedAndHighHigh       22.19      (1.3%)       22.32      (1.6%)    0.6% (  -2% -    3%) 0.204
                  CountOrHighMed      339.36      (2.4%)      341.52      (1.9%)    0.6% (  -3% -    4%) 0.346
     FilteredAnd2Terms2StopWords      212.32      (1.9%)      213.73      (1.4%)    0.7% (  -2% -    3%) 0.196
              CombinedAndHighMed       84.39      (1.2%)       85.02      (1.6%)    0.7% (  -2% -    3%) 0.091
               FilteredAnd3Terms      185.09      (2.9%)      186.51      (1.9%)    0.8% (  -3% -    5%) 0.321
                  FilteredPhrase       31.03      (1.5%)       31.27      (1.1%)    0.8% (  -1% -    3%) 0.059
            FilteredAndStopWords       64.07      (2.0%)       64.58      (1.2%)    0.8% (  -2% -    4%) 0.120
                        Or3Terms      222.41      (6.2%)      224.23      (5.2%)    0.8% ( -10% -   13%) 0.653
                       CountTerm     8641.68      (4.1%)     8727.93      (4.3%)    1.0% (  -7% -    9%) 0.449
              Or2Terms2StopWords      197.83      (5.0%)      199.85      (4.5%)    1.0% (  -8% -   11%) 0.496
                       OrHighMed      243.82      (7.2%)      246.48      (6.0%)    1.1% ( -11% -   15%) 0.602
              FilteredAndHighMed      151.68      (3.8%)      153.36      (2.5%)    1.1% (  -5% -    7%) 0.279
                AndMedOrHighHigh       80.46      (5.0%)       81.47      (4.1%)    1.3% (  -7% -   10%) 0.384
                     OrStopWords       46.75      (8.1%)       47.37      (6.8%)    1.3% ( -12% -   17%) 0.575
                      OrHighHigh       73.81      (9.1%)       74.95      (7.5%)    1.5% ( -13% -   19%) 0.558
             And2Terms2StopWords      196.07      (6.2%)      199.96      (4.1%)    2.0% (  -7% -   13%) 0.235
                       And3Terms      228.82      (7.7%)      233.77      (5.2%)    2.2% (  -9% -   16%) 0.297
                    AndStopWords       44.61      (9.5%)       45.68      (6.4%)    2.4% ( -12% -   20%) 0.348
                            Term      617.63      (8.3%)      636.08      (5.3%)    3.0% (  -9% -   18%) 0.176
                      AndHighMed      188.71     (10.7%)      194.64      (7.2%)    3.1% ( -13% -   23%) 0.275
                     AndHighHigh       64.17     (11.8%)       66.29      (8.0%)    3.3% ( -14% -   26%) 0.300

p-values are high due to quite high run-over-run variance, but queries that we'd have expected to get a speedup are at the bottom so it may give a tiny speedup in practice.

Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a right direction to me though the improvement does not seems very significant. Thank you!

@jpountz jpountz merged commit 2dcfd89 into apache:main Sep 8, 2025
8 checks passed
jpountz added a commit that referenced this pull request Sep 8, 2025
In #15039 we introduced a bulk `SimScorer#score` API and used it to compute
scores with the leading conjunctive clause and "essential" clauses of
disjunctive queries. With this PR, we are now also using this bulk API when
translating (term frequency, length normalization factor) pairs into the
maximum possible score that a block of postings may produce.

To do it right, I had to change the impacts API to no longer return a List of
(term freq, norm) pairs, but instead two parallel arrays of term frequencies
and norms that could (almost) directly be passed to the `SimScorer#score` bulk
API. Unfortunately this makes the change quite big since many backward formats
had to be touched.

Co-authored-by: Guo Feng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants