Skip to content

Varrying numbers of results from scan and scroll #16555

@ApproximateIdentity

Description

@ApproximateIdentity

Right off the bat, here's a little info:

$ uname -a
Linux jj-big-box 3.19.0-49-generic #55~14.04.1-Ubuntu SMP Fri Jan 22 11:24:31 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ curl -XGET 'localhost:9200'
{
  "status" : 200,
  "name" : "Bast",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "1.7.4",
    "build_hash" : "0d3159b9fc8bc8e367c5c40c09c2a57c0032b32e",
    "build_timestamp" : "2015-12-15T11:25:18Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.4"
  },
  "tagline" : "You Know, for Search"
}

I've seen some similar posts, but I've had trouble squaring their results with mine. I've noticed that I have do not receive consistent numbers of documents when running scan and scroll in elastic search. Here is python code exhibiting the behavior (hopefully the use of sockets is not too confusing...at first I was trying to make sure the problem had nothing to do with elasticsearch-py and that's why I went the route of raw code):

import socket
import httplib
import json
import re

HOST = 'localhost'
PORT = 9200

CRLF = "\r\n\r\n"

init_msg = """
GET /index/document/_search?search_type=scan&scroll=15m&timeout=30&size=10 HTTP/1.1
Host: localhost:9200
Accept-Encoding: identity
Content-Length: 94
connection: keep-alive

{"query": {"regexp": {"date_publ": "2001.*"}}, "_source": ["doc_id", "date_publ", "abstract"]}
"""

scroll_msg = """
GET /_search/scroll?scroll=15m HTTP/1.1
Host: localhost:9200
Accept-Encoding: identity
Content-Length: {sid_length}
connection: keep-alive

{sid}
"""

def get_stream(host, port, verbose=True):
    # Set up the socket.
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    s.connect((HOST, PORT))
    s.send(init_msg)

    # Fetch scroll_id and total number of hits.
    data = s.recv(4096)
    payload = json.loads(data.split(CRLF)[-1])
    sid = payload['_scroll_id']
    total_hits = payload['hits']['total']

    if verbose:
        print "Total hits: {}".format(total_hits)

    # Iterate through results.
    while True:
        # Send data request.
        msg = scroll_msg.format(sid=sid, sid_length=len(sid))
        s.send(msg)

        # Fetch the response body.
        data = s.recv(1024)
        header, body = data.split(CRLF)
        content_length = int(re.findall('Content-Length: (\d*)', header)[0])
        while len(body) < content_length:
            body += s.recv(1024)

        # Extract results from response body.
        payload = json.loads(body)
        sid = payload['_scroll_id']
        hits = payload['hits']['hits']

        #print payload['_shards']

        if not hits:
            break

        for hit in hits:
            yield hit


for count, _ in enumerate(get_stream(HOST, PORT), 1): pass

print count

When I run that a few times, I get the following:

$ python new_test.py 
Total hits: 56366
11650
$ python new_test.py 
Total hits: 56366
24550
$ python new_test.py 
Total hits: 56366
8550

Now if I un-comment the line #print payload['_shards'], the ended up being the following during one run:

Total hits: 56366
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}

...

{u'successful': 4, u'failed': 0, u'total': 4}
{u'successful': 4, u'failed': 0, u'total': 4}
{u'successful': 4, u'failed': 0, u'total': 4}
{u'successful': 4, u'failed': 0, u'total': 4}
{u'successful': 4, u'failed': 0, u'total': 4}
{u'successful': 4, u'failed': 0, u'total': 4}
28110

and ended up as the following the next run:

Total hits: 56366
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}

...

{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 3, u'failed': 0, u'total': 3}
{u'successful': 1, u'failed': 0, u'total': 1}
{u'successful': 0, u'failed': 0, u'total': 0}
56366

Note: The last run apparently returned all documents. This is the first time I've seen this during this experimentation.

Does anyone have any idea what's going on here? As far as I can tell, I never run into these issues when not doing the regular expression as part of the search, but other than that I'm at a loss.

Thanks for any help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Search/SearchSearch-related issues that do not fall into other categories>bugdiscuss

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions