[SPARK-18620][Streaming][Kinesis] Flatten input rates in timeline for streaming + kinesis #16114

maropu · 2016-12-02T05:28:22Z

What changes were proposed in this pull request?

This pr is to make input rates in timeline more flat for spark streaming + kinesis.
Since kinesis workers fetch records and push them into block generators in bulk, timeline in web UI has many spikes when maxRates applied (See a Figure.1 below). This fix splits fetched input records into multiple adRecords calls.

Figure.1 Apply maxRates=500 in vanilla Spark

Figure.2 Apply maxRates=500 in Spark with my patch

How was this patch tested?

Add tests to check to split input records into multiple addRecords calls.

dav009 · 2016-12-02T05:54:21Z

👍 just had a play with it, it solves my original issue.

SparkQA · 2016-12-02T06:24:53Z

Test build #69546 has finished for PR 16114 at commit 4f17a32.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-12-02T08:09:09Z

...l/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala

+    if (batch.size() <= maxRecords) {
+      addRecords(batch, checkpointer)
+    } else {
+      val numIter = batch.size / maxRecords


Is this clause a bit simpler as ...

for (start <- 0 until batch.size by maxRecords) { addRecords(batch.sublist(start, math.min(start + maxRecords, batch.size)), checkpointer) }

Thanks! I'll fix

srowen · 2016-12-02T08:10:01Z

external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisReceiver.scala


+  /** Return the current rate limit defined in [[BlockGenerator]]. */
+  private[kinesis] def getCurrentLimit: Int = {
+    assert(blockGenerator != null)


This is pretty trivial but do we use runtime assertions in general in the project? the next line fails already when it's null whether assertions are on or not.

I just added this assertion along with the other parts such as assert(kinesisCheckpointer != null, "Kinesis Checkpointer not initialized!") because both're initialized in onStart. But, I have no strong opnion on this and it's okay to remove this entry to me.

I would be okay to keep it if we add a useful error in case this assertion doesn't hold, e.g.
assert(blockGenerator != null, "Expected blockGenerator to be set for the receiver before the processor received records")

or something like that

SparkQA · 2016-12-02T09:03:32Z

Test build #69558 has finished for PR 16114 at commit 9a516e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-12-03T10:26:11Z

...l/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala

+      batch: List[Record], checkpointer: IRecordProcessorCheckpointer): Unit = {
+    val maxRecords = receiver.getCurrentLimit
+    if (batch.size() <= maxRecords) {
+      addRecords(batch, checkpointer)


I think the for loop even takes care of this case, but no big deal either way. It seems like a reasonable change.

Aha, I see. I'll fix, thanks!

maropu · 2016-12-03T10:59:53Z

@srowen Do u know qualified maintainers on this component?

SparkQA · 2016-12-03T11:52:11Z

Test build #69620 has finished for PR 16114 at commit f381ac2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-12-03T12:50:56Z

...l/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala

+  private def processRecordsWithLimit(
+      batch: List[Record], checkpointer: IRecordProcessorCheckpointer): Unit = {
+    val maxRecords = receiver.getCurrentLimit
+    for (start <- 0 until batch.size by maxRecords) {


Hm, it just occurred to me that you would have a problem here if batch.size and maxRecords were both over Int.MaxValue / 2, and maxRecords were a bit smaller than batch.size. The addition below overflows.

It seems like a corner case but I note above you already defensively capped the maxRecords at Int.MaxValue so maybe it's less unlikely than it sounds.

You can fix it by letting the addition and min comparison take place over longs and then convert back to int.

Alternatively I think this is even simpler in Scala, though I imagine there's some extra overhead here:

batch.grouped(maxRecords).foreach(batch => addRecords(batch, checkpointer))

I don't know of a good reviewer for this component but I think I'm comfortable merging a straightforward change like this.

Actually, since each kinesis shard has strict read limits of throughput (http://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html), batch.size hardly exceeds Int.MaxValue / 2. But, since I like your idea in terms of code clearness, I fixed.

srowen · 2016-12-03T12:51:11Z

...l/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala

+  private def addRecords(batch: List[Record], checkpointer: IRecordProcessorCheckpointer): Unit = {
+    receiver.addRecords(shardId, batch)
+    logDebug(s"Stored: Worker $workerId stored ${batch.size} records for shardId $shardId")
+    receiver.setCheckpointer(shardId, checkpointer)


BTW is this supposed to be called on every batch or once at the end? I don't know how it works.

yea, you're right and this code overwrites checkpointer every the callback function called (maybe, every 1 sec.). I'm not sure what an original author thinks about though, it seems this is waste of codes. But, I also not sure that it is worth fixing this and this fix is out of scope in this jira. If necessary, I'm pleased to fix in follow-up activities.

SparkQA · 2016-12-03T14:36:12Z

Test build #69624 has finished for PR 16114 at commit b625b8f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-12-03T14:41:37Z

Jenkins, retest this please.

SparkQA · 2016-12-03T14:46:03Z

Test build #69626 has finished for PR 16114 at commit b625b8f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-03T15:40:45Z

Test build #69627 has finished for PR 16114 at commit 8cc24ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-12-04T08:10:27Z

@brkyvz maybe you can give this a look to make sure it makes sense? especially the bit about the checkpointer.

srowen · 2016-12-06T22:15:23Z

...l/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala

+        // in `KinesisClientLibConfiguration`. For example, if we set 10 to the number of max
+        // records in a worker and a producer aggregates two records into one message, the worker
+        // possibly 20 records every callback function called.
+        batch.asScala.grouped(receiver.getCurrentLimit).foreach { batch =>


Sorry, one last comment -- batch is used for the overall data set and each subset. They should be named differently for clarity.

It's also my fault for not realizing the collections here were Java not Scala, and you have to convert to use the nice Scala idiom. I think it's OK as it's just going to wrap and not copy the class, but it does bear being careful about performance here.

yea, I also think, when maxRecords is small and batch is large, many iterations cause a little overheads. So, I restored the code to the previous java-style one.

brkyvz · 2016-12-07T00:26:46Z

...l/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala

+        batch.asScala.grouped(receiver.getCurrentLimit).foreach { batch =>
+          receiver.addRecords(shardId, batch.asJava)
+          logDebug(s"Stored: Worker $workerId stored ${batch.size} records for shardId $shardId")
+          receiver.setCheckpointer(shardId, checkpointer)


this should be outside, after the foreach

Yeah, that's what I suspected at #16114 (comment) -- thanks for confirming

thanks, I'll fix

SparkQA · 2016-12-07T02:00:37Z

Test build #69762 has finished for PR 16114 at commit 934e29b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-07T03:23:59Z

Test build #69765 has finished for PR 16114 at commit 50b6681.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2016-12-07T17:08:22Z

...l/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala

+          val miniBatch = batch.subList(start, math.min(start + maxRecords, batch.size))
+          receiver.addRecords(shardId, miniBatch)
+        }
        logDebug(s"Stored: Worker $workerId stored ${batch.size} records for shardId $shardId")


I would leave this comment inside the for loop, because IIRC addRecords will be a blocking call where it needs to be written to the WAL

brkyvz · 2016-12-07T17:09:46Z

I will need to take a deeper look at this to remember the code. I'm not sure but there may be some issues with the checkpointing happening to the WriteAheadLog and DynamoDB. Going to come back to this in a couple hours.

SparkQA · 2016-12-08T01:22:14Z

Test build #69833 has finished for PR 16114 at commit 4528c50.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-12-09T03:50:19Z

@brkyvz Could you also check this pr #16213? Thanks!

brkyvz · 2016-12-09T19:13:04Z

I've taken a look at the code. This change seems safe to me. Even if we process extra data, but fail to checkpoint to Kinesis, Spark streaming will re-process the exact same batch on a restart providing at least once semantics but with a stronger guarantee (that the data will be processed with exactly the same batching).

LGTM! Thanks @maropu

srowen · 2016-12-09T21:32:10Z

Merged to master

… streaming + kinesis ## What changes were proposed in this pull request? This pr is to make input rates in timeline more flat for spark streaming + kinesis. Since kinesis workers fetch records and push them into block generators in bulk, timeline in web UI has many spikes when `maxRates` applied (See a Figure.1 below). This fix splits fetched input records into multiple `adRecords` calls. Figure.1 Apply `maxRates=500` in vanilla Spark <img width="1084" alt="apply_limit in_vanilla_spark" src="https://cloud.githubusercontent.com/assets/692303/20823861/4602f300-b89b-11e6-95f3-164a37061305.png"> Figure.2 Apply `maxRates=500` in Spark with my patch <img width="1056" alt="apply_limit in_spark_with_my_patch" src="https://cloud.githubusercontent.com/assets/692303/20823882/6c46352c-b89b-11e6-81ab-afd8abfe0cfe.png"> ## How was this patch tested? Add tests to check to split input records into multiple `addRecords` calls. Author: Takeshi YAMAMURO <[email protected]> Closes apache#16114 from maropu/SPARK-18620.

maropu added 2 commits December 1, 2016 23:41

Limit the number of processed records in KinesisRecordProcessor

c1dd8da

Fix test failures

4f17a32

srowen reviewed Dec 2, 2016

View reviewed changes

Apply comments

9a516e6

srowen reviewed Dec 3, 2016

View reviewed changes

Remove unnecessary code

f381ac2

srowen reviewed Dec 3, 2016

View reviewed changes

Simplify code

b625b8f

Fix bugs

8cc24ec

srowen reviewed Dec 6, 2016

View reviewed changes

brkyvz reviewed Dec 7, 2016

View reviewed changes

maropu force-pushed the SPARK-18620 branch from c47fc44 to 934e29b Compare December 7, 2016 01:25

Apply comments from reviewers

50b6681

maropu force-pushed the SPARK-18620 branch from 934e29b to 50b6681 Compare December 7, 2016 02:36

brkyvz reviewed Dec 7, 2016

View reviewed changes

Put log prints inside for loops

4528c50

asfgit closed this in b08b500 Dec 9, 2016

maropu deleted the SPARK-18620 branch July 5, 2017 11:44

[SPARK-18620][Streaming][Kinesis] Flatten input rates in timeline for streaming + kinesis #16114

[SPARK-18620][Streaming][Kinesis] Flatten input rates in timeline for streaming + kinesis #16114

Conversation

maropu commented Dec 2, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dav009 commented Dec 2, 2016

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Dec 3, 2016

Uh oh!

SparkQA commented Dec 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 3, 2016

Uh oh!

maropu commented Dec 3, 2016

Uh oh!

SparkQA commented Dec 3, 2016

Uh oh!

SparkQA commented Dec 3, 2016

Uh oh!

srowen commented Dec 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 7, 2016

Uh oh!

SparkQA commented Dec 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brkyvz commented Dec 7, 2016

Uh oh!

SparkQA commented Dec 8, 2016

Uh oh!

maropu commented Dec 9, 2016

Uh oh!

brkyvz commented Dec 9, 2016

Uh oh!

srowen commented Dec 9, 2016

Uh oh!

Reviewers

Assignees

Labels