Skip to content

span_not's documentation is misleading and "post" may not work as expected #27134

@gcampbell-epiq

Description

@gcampbell-epiq

Elasticsearch version documentation 5.6, runtime 5.3.2
Plugins installed: none
JVM version 1.8.0_71
OS version Windows Server 2008

span_not Documentation Issue

Description of the problem including expected versus actual behavior:

Documentation on all span queries is very sparse, particularly for the span_not. Deviating slightly from the example produces unexpected results.

Steps to reproduce:

  1. Create default index
    PUT span-not-test
  2. Insert docs containing "la hoya" and "la hoya hoya"
PUT span-not-test/doc/1
{
    "field1": "la hoya"
}
PUT span-not-test/doc/2
{
    "field1": "hoya la hoya"
}
PUT span-not-test/doc/3
{
    "field1": "la hoya hoya"
}
  1. Run example search modified slightly to increase slop and flip in_order to false
POST span-not-test/_search
{
    "query": {
        "span_not" : {
            "include" : {
                "span_term" : { "field1" : "hoya" } 
            },
            "exclude" : {
                "span_near" : {
                    "clauses" : [
                        { "span_term" : { "field1" : "la" } },
                        { "span_term" : { "field1" : "hoya" } }
                    ],
                    "slop" : 50,
                    "in_order" : false
                }
            }
        }
    }
}

Expected results:
No docs should be returned, because in the all test docs contain "la" and "hoya" within 100 tokens of eachother.

Actual results:
Doc 3 is returned.

I assume the issue is that the "included" runs of tokens are not checked for "exclusion" if they overlap with another run of included tokens.

Whatever the case, I understand that using dist/pre/post would yield correct results, but the documentation does not help me understand why this is the case.

SpanNotQuery post Issue

I have some single-value fields, the length of which is unlimited. In order to avoid secretly imposing a length limit and always getting valid results from a span_not query, I attempted to use int.max (2^31) as the dist in a span_not query. The issue is that SpanNotQuery took that dist value and added to it the end position of an "included" token. That yielding ~-2B. That negative position was then compared against an "excluded" term's position to determine if the excluded term's position was less, i.e. within dist of the included term. So, even though my dist was high, my excluded term was not detected in a position within range of my included term.

You can see the relevant code here in the Lucene repo.
~\core\src\java\org\apache\lucene\search\spans\SpanNotQuery.java
line 181

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Search/SearchSearch-related issues that do not fall into other categories>docsGeneral docs changes

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions