-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9216][Streaming] Define KinesisBackedBlockRDDs #7578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@zsxwing @koeninger Can you guys take a look |
|
Test build #37989 has finished for PR 7578 at commit
|
|
Test build #37994 has finished for PR 7578 at commit
|
|
Test build #38016 has finished for PR 7578 at commit
|
|
Test build #38022 has finished for PR 7578 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgive my unfamiliarity with Kinesis, but are sequence numbers contiguous and increasing ? Because this seems to be the assumption here. In the doc I find the scary quote:
Sequence numbers cannot be used as indexes to sets of data within the same stream. To logically separate sets of data, use partition keys or create a separate stream for each data set.
Here I also see:
Sequence numbers generally increase over time. To guarantee strictly increasing ordering, use the SequenceNumberForOrdering parameter
If they aren't, you might find yourself with :
- missing data, that may be with a sequence number
fromSeqNumber < x < lastSeqNumber, but not iterated on here because you request here withShardIteratorType.AT_SEQUENCE_NUMBERrather thanShardIteratorType.TRIM_HORIZON - superfluous data, because you're not checking that
fromSeqNumber < nextRecord.getSequenceNumber() < lastSeqNumberbefore returningnextRecord.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good doubts!
Sequence numbers are increasing within each shard for the data being consumed. Thats how the Kinesis Client Library keeps track of the data read (
What the document refers to as SequenceNumberForOrdering is to ensure ordering guarantees at the time of pushing data into Kinesis. Once data have been pushed in and Kinesis internal have assigned sequence numbers to each record, the ordering is well-defined and guaranteed. On consumption, the ordering will always be the same.
However sequence numbers are not contiguous (unlike Kafka offsets), hence they cannot be used as an index like 0 being first record, 1 being second, 2 being third. And there is lies the problem of why we cannot use the direct approach for Kinesis. See linked design doc for more explanations and discussions.
|
Just some minor comments. Otherwise LGTM. |
|
@zsxwing I made a few changes to add retry logic and timeouts, to make it more robust. Could you take a look, especially the retry logic. |
|
Test build #38286 has finished for PR 7578 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a redundant space after nextBytes
|
Test build #38290 has finished for PR 7578 at commit
|
|
LGTM except the minor style issue. |
|
Test build #1193 has finished for PR 7578 at commit
|
|
Test build #38291 has finished for PR 7578 at commit
|
|
Thanks @zsxwing for reviewing I am going to merge it in master. |
For more information see master JIRA: https://issues.apache.org/jira/browse/SPARK-9215
Design Doc: https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit