[SPARK-23541][SS] Allow Kafka source to read data with greater parallelism than the number of topic-partitions #20698

tdas · 2018-03-01T01:30:54Z

What changes were proposed in this pull request?

Currently, when the Kafka source reads from Kafka, it generates as many tasks as the number of partitions in the topic(s) to be read. In some case, it may be beneficial to read the data with greater parallelism, that is, with more number partitions/tasks. That means, offset ranges must be divided up into smaller ranges such the number of records in partition ~= total records in batch / desired partitions. This would also balance out any data skews between topic-partitions.

In this patch, I have added a new option called minPartitions, which allows the user to specify the desired level of parallelism.

How was this patch tested?

New tests in KafkaMicroBatchV2SourceSuite.

tdas · 2018-03-01T01:31:57Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala

+import org.apache.spark.sql.sources.v2.DataSourceOptions
+
+
+private[kafka010] class KafkaOffsetRangeCalculator(val minPartitions: Int) {


tdas · 2018-03-01T01:38:40Z

@zsxwing @jose-torres

SparkQA · 2018-03-01T01:58:52Z

Test build #87805 has finished for PR 20698 at commit ebb9b51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-01T02:09:03Z

Test build #87806 has finished for PR 20698 at commit 5f066a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2018-03-01T20:28:41Z

...rnal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala

    // Otherwise, interrupting a thread while running `KafkaConsumer.poll` may hang forever
    // (KAFKA-1894).
-    assert(Thread.currentThread().isInstanceOf[UninterruptibleThread])
+    require(Thread.currentThread().isInstanceOf[UninterruptibleThread])


What's the difference between assert and require here?

not much really. assert throws Assertions and require throws IllegalArgumentException. Just a matter of preference. I can revert this change.

Assertions can be turned off

jose-torres · 2018-03-01T20:31:21Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala

+
+  import KafkaOffsetRangeCalculator._
+  /**
+   * Calculate the offset ranges that we are going to process this batch. If `numPartitions`


nit: /s/numPartitions/minPartitions

jose-torres · 2018-03-01T20:32:08Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala

+
+private[kafka010] object KafkaOffsetRangeCalculator {
+
+  private val DEFAULT_MIN_PARTITIONS = 0


super-nit: this isn't really a default, 0 isn't a valid number of min partitions

Ideally, we shouldn't be using default values like this. Rather I want to use Options. However, DataSourceOptions does not give me a way to get back an Option[Int], thus forces me to specify some default value. Let me see what I can do about it. I dont want to reason about 0 in the subsequent conditions and math calculations either.

jose-torres · 2018-03-01T20:33:47Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala

+    // If minPartitions not set or there are enough partitions to satisfy minPartitions
+    if (minPartitions == DEFAULT_MIN_PARTITIONS || offsetRanges.size > minPartitions) {
+      // Assign preferred executor locations to each range such that the same topic-partition is
+      // always read from the same executor and the KafkaConsumer can be reused


I worry that "always" is misleading here. It's not guaranteed that the same executor will run the partition or that the KafkaConsumer can be reused.

jose-torres · 2018-03-01T21:27:33Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala

+        val tp = offsetRange.topicPartition
+        val size = offsetRange.untilOffset - offsetRange.fromOffset
+        // number of partitions to divvy up this topic partition to
+        val parts = math.max(math.round(size * 1.0 / totalSize * minPartitions), 1).toInt


It's hard to understand why this number is being calculated as it is. I think it's correct, but a comment explaining why this is the right number to divvy would help.

yeah, a comment about how this is calculating the weight of partitions to assign to this topic would help. In addition, the sum of parts after this calculation will be >= minPartitions

I rewrote this completely using the code used by from sparkContext.parallelize to make splits.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L123

zsxwing · 2018-03-01T23:56:31Z

cc @brkyvz

brkyvz

This is great! Left some clarification questions

brkyvz · 2018-03-02T00:04:40Z

...rnal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala

+
+  private val consumer = {
+    if (!reuseKafkaConsumer) {
+      // If we can't reuse CachedKafkaConsumers, creating a new CachedKafkaConsumer. As here we


nit: We use 'assign' here, hence don't need to ...

brkyvz · 2018-03-02T00:05:04Z

...rnal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala

  }

  override def close(): Unit = {
    // Indicate that we're no longer using this consumer


maybe remove this?

brkyvz · 2018-03-02T00:06:54Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala

+    }
+
+    // If minPartitions not set or there are enough partitions to satisfy minPartitions
+    if (minPartitions == DEFAULT_MIN_PARTITIONS || offsetRanges.size > minPartitions) {


I don't think we need the first check. offsetRanges.size should be greater than 0 right? Otherwise we wouldn't have called into this.

Rewritten. I dont want to rely on this default value of 0, as @jose-torres expressed concern earlier. So i rewrote this to explicitly check whether minPartitions have been set or not.

brkyvz · 2018-03-02T00:07:45Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala

+      fromOffsets: PartitionOffsetMap,
+      untilOffsets: PartitionOffsetMap,
+      executorLocations: Seq[String] = Seq.empty): Seq[KafkaOffsetRange] = {
+    val partitionsToRead = untilOffsets.keySet.intersect(fromOffsets.keySet)


was this check here before? What if there are new topic partitions? Are we missing those, because they may not exist in fromOffsets?

fromOffsets here will contain the initial offsets of new partitions. See the how fromOffsets is set with startOffsets + newPartitionInitialOffsets.

brkyvz · 2018-03-02T00:09:45Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala

+        val tp = offsetRange.topicPartition
+        val size = offsetRange.untilOffset - offsetRange.fromOffset
+        // number of partitions to divvy up this topic partition to
+        val parts = math.max(math.round(size * 1.0 / totalSize * minPartitions), 1).toInt


yeah, a comment about how this is calculating the weight of partitions to assign to this topic would help. In addition, the sum of parts after this calculation will be >= minPartitions

SparkQA · 2018-03-02T02:55:40Z

Test build #87870 has finished for PR 20698 at commit 3eae3f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2018-03-02T21:12:49Z

LGTM

brkyvz

A couple minor nits, and one additional test request (because I'm paranoid). otherwise LGTM

brkyvz · 2018-03-02T22:48:40Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala

+          KafkaOffsetRange(
+            range.topicPartition, splitStart.toLong, splitEnd.toLong, preferredLoc = None)
+        }
+


nit: extra line

brkyvz · 2018-03-02T23:20:11Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala

+    } else {
+
+      // Splits offset ranges with relatively large amount of data to smaller ones.
+      val totalSize = offsetRanges.map(o => o.untilOffset - o.fromOffset).sum


nit: map(_.size).sum

brkyvz · 2018-03-02T23:21:01Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala

+      offsetRanges.flatMap { range =>
+        // Split the current range into subranges as close to the ideal range size
+        val rangeSize = range.untilOffset - range.fromOffset
+        val numSplitsInRange = math.round(rangeSize.toDouble / idealRangeSize).toInt


nit: range.size, you may remove rangeSize above

brkyvz · 2018-03-02T23:22:15Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala

+private[kafka010] object KafkaOffsetRangeCalculator {
+
+  def apply(options: DataSourceOptions): KafkaOffsetRangeCalculator = {
+    val optionalValue = Option(options.get("minPartitions").orElse(null)).map(_.toInt)


nit: .orNull instead of .orElse(null). Why don't you actually do:

options.get("minPartitions").map(_.toInt)

Because it returns java Optional and not scala Option.

brkyvz · 2018-03-02T23:22:45Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala

+    fromOffset: Long,
+    untilOffset: Long,
+    preferredLoc: Option[String]) {
+  def size: Long = untilOffset - fromOffset


nite: maybe make this a lazy val so that it'll be calculated once

brkyvz · 2018-03-02T23:25:48Z

...-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculatorSuite.scala

+          KafkaOffsetRange(tp1, 4, 5, None))) // location pref not set when minPartition is set
+  }
+
+  testWithMinPartitions("N skewed TopicPartitions to M offset ranges", 3) { calc =>


can you also add a test:

fromOffsets = Map(tp1 -> 1), untilOffsets = Map(tp1 -> 10) minPartitions = 3

SparkQA · 2018-03-02T23:29:48Z

Test build #87911 has finished for PR 20698 at commit 1e244e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-03T00:21:45Z

Test build #87913 has finished for PR 20698 at commit 602ab36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2018-03-03T00:42:21Z

LGTM pending tests.

tdas · 2018-03-03T02:13:27Z

Thank you. Merging to master only as this is a new feature touching production code paths.

danielyahn · 2019-09-26T18:25:32Z

@tdas Is it tested whether this fix is backward compatible? I see that fix version is 2.4.0 in Jira. But it looks like your code change is all around spark-sql-kafka-0-10. I have a cluster in 2.3.2 and am interested in using this feature.

…elism than the number of topic-partitions Currently, when the Kafka source reads from Kafka, it generates as many tasks as the number of partitions in the topic(s) to be read. In some case, it may be beneficial to read the data with greater parallelism, that is, with more number partitions/tasks. That means, offset ranges must be divided up into smaller ranges such the number of records in partition ~= total records in batch / desired partitions. This would also balance out any data skews between topic-partitions. In this patch, I have added a new option called `minPartitions`, which allows the user to specify the desired level of parallelism. New tests in KafkaMicroBatchV2SourceSuite. Author: Tathagata Das <[email protected]> Closes apache#20698 from tdas/SPARK-23541. Ref: LIHADOOP-48531

Implemented

ebb9b51

tdas commented Mar 1, 2018

View reviewed changes

Minor changes

5f066a0

jose-torres reviewed Mar 1, 2018

View reviewed changes

brkyvz reviewed Mar 2, 2018

View reviewed changes

Addressed comments

3eae3f1

Added 1 more test

1e244e0

brkyvz approved these changes Mar 2, 2018

View reviewed changes

Addressed comments

602ab36

asfgit closed this in 486f99e Mar 3, 2018

		import org.apache.spark.sql.sources.v2.DataSourceOptions


		private[kafka010] class KafkaOffsetRangeCalculator(val minPartitions: Int) {


		private[kafka010] object KafkaOffsetRangeCalculator {

		private val DEFAULT_MIN_PARTITIONS = 0

[SPARK-23541][SS] Allow Kafka source to read data with greater parallelism than the number of topic-partitions #20698

[SPARK-23541][SS] Allow Kafka source to read data with greater parallelism than the number of topic-partitions #20698

Uh oh!

Conversation

tdas commented Mar 1, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas commented Mar 1, 2018

Uh oh!

SparkQA commented Mar 1, 2018

Uh oh!

SparkQA commented Mar 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Mar 1, 2018

Uh oh!

brkyvz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 2, 2018

Uh oh!

jose-torres commented Mar 2, 2018

Uh oh!

brkyvz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 2, 2018

Uh oh!

SparkQA commented Mar 3, 2018

Uh oh!