[SPARK-9216][Streaming] Define KinesisBackedBlockRDDs #7578

tdas · 2015-07-21T23:10:49Z

For more information see master JIRA: https://issues.apache.org/jira/browse/SPARK-9215
Design Doc: https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit

tdas · 2015-07-21T23:12:53Z

@zsxwing @koeninger Can you guys take a look

SparkQA · 2015-07-21T23:16:12Z

Test build #37989 has finished for PR 7578 at commit 5da3995.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SequenceNumberRange(
- case class SequenceNumberRanges(ranges: Array[SequenceNumberRange])
- class KinesisBackedBlockRDDPartition(
- class KinesisBackedBlockRDD(
- class KinesisSequenceRangeIterator(

SparkQA · 2015-07-21T23:28:37Z

Test build #37994 has finished for PR 7578 at commit 4a36096.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SequenceNumberRange(
- case class SequenceNumberRanges(ranges: Array[SequenceNumberRange])
- class KinesisBackedBlockRDDPartition(
- class KinesisBackedBlockRDD(
- class KinesisSequenceRangeIterator(

SparkQA · 2015-07-22T03:07:46Z

Test build #38016 has finished for PR 7578 at commit 575bdbc.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SequenceNumberRange(
- case class SequenceNumberRanges(ranges: Array[SequenceNumberRange])
- class KinesisBackedBlockRDDPartition(
- class KinesisBackedBlockRDD(
- class KinesisSequenceRangeIterator(

SparkQA · 2015-07-22T03:44:33Z

Test build #38022 has finished for PR 7578 at commit 8874b70.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SequenceNumberRange(
- case class SequenceNumberRanges(ranges: Array[SequenceNumberRange])
- class KinesisBackedBlockRDDPartition(
- class KinesisBackedBlockRDD(
- class KinesisSequenceRangeIterator(
- case class Data(topic: Vector, index: Int)
- case class Data(globalTopicTotals: Vector)
- case class VertexData(id: Long, topicWeights: Vector)
- case class EdgeData(srcId: Long, dstId: Long, tokenCounts: Double)

huitseeker · 2015-07-22T16:19:02Z

...as/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala

Forgive my unfamiliarity with Kinesis, but are sequence numbers contiguous and increasing ? Because this seems to be the assumption here. In the doc I find the scary quote:

Sequence numbers cannot be used as indexes to sets of data within the same stream. To logically separate sets of data, use partition keys or create a separate stream for each data set.

Here I also see:

Sequence numbers generally increase over time. To guarantee strictly increasing ordering, use the SequenceNumberForOrdering parameter

If they aren't, you might find yourself with :

missing data, that may be with a sequence number fromSeqNumber < x < lastSeqNumber, but not iterated on here because you request here with ShardIteratorType.AT_SEQUENCE_NUMBER rather than ShardIteratorType.TRIM_HORIZON

superfluous data, because you're not checking that fromSeqNumber < nextRecord.getSequenceNumber() < lastSeqNumber before returning nextRecord.

Good doubts!

Sequence numbers are increasing within each shard for the data being consumed. Thats how the Kinesis Client Library keeps track of the data read (

What the document refers to as SequenceNumberForOrdering is to ensure ordering guarantees at the time of pushing data into Kinesis. Once data have been pushed in and Kinesis internal have assigned sequence numbers to each record, the ordering is well-defined and guaranteed. On consumption, the ordering will always be the same.

However sequence numbers are not contiguous (unlike Kafka offsets), hence they cannot be used as an index like 0 being first record, 1 being second, 2 being third. And there is lies the problem of why we cannot use the direct approach for Kinesis. See linked design doc for more explanations and discussions.

zsxwing · 2015-07-23T16:45:03Z

Just some minor comments. Otherwise LGTM.

tdas · 2015-07-24T01:01:18Z

@zsxwing I made a few changes to add retry logic and timeouts, to make it more robust. Could you take a look, especially the retry logic.

SparkQA · 2015-07-24T01:05:43Z

Test build #38286 has finished for PR 7578 at commit c4f25d2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SequenceNumberRange(
- case class SequenceNumberRanges(ranges: Array[SequenceNumberRange])
- class KinesisBackedBlockRDDPartition(
- class KinesisBackedBlockRDD(
- class KinesisSequenceRangeIterator(

zsxwing · 2015-07-24T01:27:59Z

...as/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala

nit: a redundant space after nextBytes

SparkQA · 2015-07-24T01:43:39Z

Test build #38290 has finished for PR 7578 at commit 5082a30.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SequenceNumberRange(
- case class SequenceNumberRanges(ranges: Array[SequenceNumberRange])
- class KinesisBackedBlockRDDPartition(
- class KinesisBackedBlockRDD(
- class KinesisSequenceRangeIterator(

zsxwing · 2015-07-24T01:49:10Z

LGTM except the minor style issue.

SparkQA · 2015-07-24T02:03:36Z

Test build #1193 has finished for PR 7578 at commit 5082a30.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SequenceNumberRange(
- case class SequenceNumberRanges(ranges: Array[SequenceNumberRange])
- class KinesisBackedBlockRDDPartition(
- class KinesisBackedBlockRDD(
- class KinesisSequenceRangeIterator(

SparkQA · 2015-07-24T02:09:02Z

Test build #38291 has finished for PR 7578 at commit 543d208.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SequenceNumberRange(
- case class SequenceNumberRanges(ranges: Array[SequenceNumberRange])
- class KinesisBackedBlockRDDPartition(
- class KinesisBackedBlockRDD(
- class KinesisSequenceRangeIterator(

tdas · 2015-07-24T03:06:19Z

Thanks @zsxwing for reviewing I am going to merge it in master.

tdas added 3 commits July 19, 2015 18:50

Added KinesisBackedBlockRDD

3ae0814

Merge remote-tracking branch 'apache-github/master' into kinesis-rdd

528e206

Changed KinesisSuiteHelper to KinesisFunSuite

5da3995

Add license

4a36096

Fix scala style issues

575bdbc

Updated Kinesis RDD

8874b70

huitseeker reviewed Jul 22, 2015
View reviewed changes

tdas added 3 commits July 23, 2015 17:47

Added retry logic to make it more robust

f6e35c8

Minor update

d3d64d1

Addressed comment

c4f25d2

tdas added 2 commits July 23, 2015 18:23

Addressed comments

3f40c2d

Fixed scala style

5082a30

zsxwing reviewed Jul 24, 2015
View reviewed changes

Fixed scala style

543d208

asfgit closed this in d249636 Jul 24, 2015

[SPARK-9216][Streaming] Define KinesisBackedBlockRDDs #7578

[SPARK-9216][Streaming] Define KinesisBackedBlockRDDs #7578

Uh oh!

Conversation

tdas commented Jul 21, 2015

Uh oh!

tdas commented Jul 21, 2015

Uh oh!

SparkQA commented Jul 21, 2015

Uh oh!

SparkQA commented Jul 21, 2015

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

huitseeker Jul 22, 2015

Choose a reason for hiding this comment

Uh oh!

tdas Jul 23, 2015

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Jul 23, 2015

Uh oh!

tdas commented Jul 24, 2015

Uh oh!

SparkQA commented Jul 24, 2015

Uh oh!

zsxwing Jul 24, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 24, 2015

Uh oh!

zsxwing commented Jul 24, 2015

Uh oh!

SparkQA commented Jul 24, 2015

Uh oh!

SparkQA commented Jul 24, 2015

Uh oh!

tdas commented Jul 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants