Skip to content

Cortex return 5xx due a single ingester outage #4381

@alanprot

Description

@alanprot

Describe the bug
Cortex can return 5xx due a single ingester failure when a tenant is being throttled (4xx). In this case, distributor can return the error from the bad ingester (5xx) even though the other 2 returned 4xx. See this.

Looking at this code seems that if we have replication factor = 2, 1 ingester down and the other 2 returning 4xx we can have for example:

4xx + 5xx + 4xx = 5xx
or
5xx + 4xx + 4xx = 4xx
etc

To Reproduce
Steps to reproduce the behavior:
I could create a unit test that reproduce the behavior:
alanprot@fd36d97

  1. Start Cortex (SHA or version)
    a4bf103
  2. Perform Operations(Read/Write/Others)
    Write
    Expected behavior
    Cortex should return the error respecting the quorum of the response from ingesters.
    So, if 2 ingesters return 4xx and one 5xx, cortex should return 4xx. This means that if distributor receive one 4xx and one 5xx, it needs to wait the response of the third ingester.

Environment:

  • Infrastructure: [e.g., Kubernetes, bare-metal, laptop]
    Kubernetes
  • Deployment tool: [e.g., helm, jsonnet]
    Helm
    Storage Engine
  • Blocks
  • Chunks

Additional Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    keepaliveSkipped by stale bot

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions