-
-
Notifications
You must be signed in to change notification settings - Fork 350
Description
Type: Bug
Component:
SQS
Describe the bug
Im not quite sure if this is a bug report, a feature request or just a misunderstanding on my side. So please bear with me 😇
We use a fifo queue to process "commands" that affect a 3rd party system. We use a message group id that groups commands that should be processed sequentially. One day we saw an increased latency in processing these commands and we saw exceptions like the following in our logs
com.amazonaws.services.sqs.model.AmazonSQSException: Value <receiptHandle> for parameter ReceiptHandle is invalid. Reason: The receipt handle has expired.
...
com.amazonaws.services.sqs.AmazonSQSClient.executeDeleteMessage(AmazonSQSClient.java:920)
...
The reason for this is simple, we have hit our visibility timeout. Every message (of the same receiveMessage call) after the first exception also printed the same exception.
That made me think about the implementation of the MessageGroupExecutor
here. Theres #192 that fixes the case where we should stop processing messages whenever an exception was raised but I think theres another issue.
Imagine we call receiveMessage
with the default message count of 10 and (worst case scenario) we receive 10 messages for the same message group. As soon as we receive them the default visibility timeout of the queue will start ticking. Lets say we didnt change the default, this means we have a default of 30s for each message. Really? I think no. In the scenario above we will have 30s for all 10 messages because if we spend 30s to process the first 5 messages, we wont be able to delete the sixth messages (see error above).
That basically means we have processed the message, we were not able to delete the message and it will be visible to the other consumers. Actually we will process it twice.
I can only think of some ways to work around this right now (from a userland perspective):
- settings the maxMessageCount to 1 (which will affect throughput and increase cost)
- always use
Visibility
when using fifo queues and callextend
at the very beginning of your handler to force an exception if thereceiptHandle
has already timed out
Option 2 is basically our current approach. We have a little helper class that runs "side effects" of SQS handlers which basically will call extend
as the very first action and then starts a timer task that extends the timeout every X while executing the real action. At the end we introduced an eager check if the messages is still valid in the sense of calling an SQS API 🤔
Of course this will only work after #192 has been merged to master/released.
Would be happy if someone with deeper knowledge could enlighten me here? Am I totally wrong with assumptions about SQS, Fifo or the implementation here? 😇