[SQS] Exceeding/Changing visibility timeout may lead to messages being processed twice in FIFO queues

**Type**: Bug

**Component**:
SQS

**Describe the bug**
Im not quite sure if this is a bug report, a feature request or just a misunderstanding on my side. So please bear with me 😇

We use a fifo queue to process "commands" that affect a 3rd party system. We use a message group id that groups commands that should be processed sequentially. One day we saw an increased latency in processing these commands and we saw exceptions like the following in our logs
```
com.amazonaws.services.sqs.model.AmazonSQSException: Value <receiptHandle> for parameter ReceiptHandle is invalid. Reason: The receipt handle has expired.
...
com.amazonaws.services.sqs.AmazonSQSClient.executeDeleteMessage(AmazonSQSClient.java:920)
...
```

The reason for this is simple, we have hit our visibility timeout. Every message (of the same receiveMessage call) after the first exception also printed the same exception.

That made me think about the implementation of the `MessageGroupExecutor` here. Theres https://github.com/awspring/spring-cloud-aws/pull/192 that fixes the case where we should stop processing messages whenever an exception was raised but I think theres another issue.

Imagine we call `receiveMessage` with the default message count of 10 and (worst case scenario) we receive 10 messages for the same message group. As soon as we receive them the default visibility timeout of the queue will start ticking. Lets say we didnt change the default, this means we have a default of 30s for each message. Really? I think no. In the scenario above we will have 30s for all 10 messages because if we spend 30s to process the first 5 messages, we wont be able to delete the sixth messages (see error above).

That basically means we have processed the message, we were not able to delete the message and it will be visible to the other consumers. **Actually we will process it twice**. 

I can only think of some ways to work around this right now (from a userland perspective):
- settings the maxMessageCount to 1 (which will affect throughput and increase cost)
- always use `Visibility` when using fifo queues and call `extend` at the very beginning of your handler to force an exception if the `receiptHandle` has already timed out

Option 2 is basically our current approach. We have a little helper class that runs "side effects" of SQS handlers which basically will call `extend` as the very first action and then starts a timer task that extends the timeout every X while executing the real action. At the end we introduced an eager check if the messages is still valid in the sense of calling an SQS API 🤔

Of course this will only work after https://github.com/awspring/spring-cloud-aws/pull/192 has been merged to master/released.

Would be happy if someone with deeper knowledge could enlighten me here? Am I totally wrong with assumptions about SQS, Fifo or the implementation here? 😇

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[SQS] Exceeding/Changing visibility timeout may lead to messages being processed twice in FIFO queues #196

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[SQS] Exceeding/Changing visibility timeout may lead to messages being processed twice in FIFO queues #196

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions