-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Labels
Description
If _source is disabled or filtered in the mappings, we add a _recovery_source field to support shard recoveries and CCR. Once it's no longer needed, then future merges will drop the _recovery_source field to reclaim space.
In certain cases, it appears that _recovery_source can stick around even after a merge. I noticed this issue through the dense vector rally track. This command indexes 100,000 documents with _source disabled, then force merges to 1 segment:
esrally race --track=dense_vector --challenge=index-and-search --track-params="ingest_percentage:10" --on-error abort
At the end, the shard was larger than expected:
195M data/indices/gPefBjHjTCCxU_EnbSuGrQ/0/index
Using the disk usage API, we see this is due to recovery source:
"_recovery_source" : {
"total" : "149.9mb",
"total_in_bytes" : 157209753,
....
There are no replicas, so the force merge should have removed recovery source. I can reproduce this with both 1 and 2 shards. I haven't found a small-scale reproduction yet.
mayya-sharipova, rockdaboot, ruslaniv and TerroFlys