[SPARK-21869][SS] Revise Kafka producer pool to implement 'expire' correctly #26845

HeartSaVioR · 2019-12-11T04:47:58Z

What changes were proposed in this pull request?

This patch revises Kafka producer pool (cache) to implement 'expire' correctly.

Current implementation of Kafka producer cache leverages Guava cache, which decides cached producer instance to be expired if the instance is not "accessed" from cache. The behavior defines expiration time as "last accessed time + timeout", which is incorrect because some task may use the instance longer than timeout. There's no concept of "returning" in Guava cache as well, so it cannot be fixed with Guava cache.

This patch introduces a new pool implementation which tracks "reference count" of cached instance, and defines expiration time for the instance as "last returned time + timeout" if the reference count goes 0, otherwise Long.MaxValue (effectively no expire). Expiring instances will be done with evict thread explicitly instead of evicting in part of handling acquire. (It might bring more overhead, but it ensures clearing expired instances even the pool is idle.)

This patch also creates a new package producer under kafka010, to hide the details from kafka010 package. In point of kafka010 package's view, only acquire()/release()/reset() are available in pool, and even for CachedKafkaProducer the package cannot close the producer directly.

Why are the changes needed?

Explained above.

Does this PR introduce any user-facing change?

Yes, but only for the way of expiring cached instances. (The difference is described above.) Each executor leveraging spark-sql-kafka would have one eviction thread.

How was this patch tested?

New and existing UTs.

HeartSaVioR · 2019-12-11T04:51:28Z

cc. @zsxwing @vanzin @gaborgsomogyi (Gabor may not be available in this month but cc. to let him catch up sometimes next month)

HeartSaVioR · 2019-12-11T05:11:18Z

Similarly I feel it would be ideal to move the files regarding consumer pool to consumer package, and adjust visibility if necessary. Even without adjusting visibility, moving these files would help determining the files regarding consumer pool easily.

What do you think? If it makes sense I'll craft another PR for this. Thanks!

SparkQA · 2019-12-11T05:25:05Z

Test build #115144 has finished for PR 26845 at commit a73db22.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-12-11T06:23:25Z

Could you fix the failure?

sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError:
Could not initialize class org.apache.spark.sql.kafka010.producer.InternalKafkaProducerPool$

HeartSaVioR · 2019-12-11T06:59:09Z

Ah yes I had been investigating the issue. Thanks for reminding! Will update soon.

SparkQA · 2019-12-11T08:05:02Z

Test build #115153 has finished for PR 26845 at commit 13fd4f0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-12-11T08:10:38Z

retest this, please

SparkQA · 2019-12-11T08:59:49Z

Test build #115161 has finished for PR 26845 at commit 13fd4f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-12-12T01:59:16Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataWriter.scala

+      producer.foreach(_.producer.flush())
+      checkForErrors()
+      KafkaDataWriterCommitMessage
+    } finally {


Resource cleanup can be simplified once SPARK-30227 (#26855) is introduced.

…rrectly

SparkQA · 2019-12-13T10:55:51Z

Test build #115297 has finished for PR 26845 at commit a7aac78.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-13T11:42:57Z

Test build #115301 has finished for PR 26845 at commit ebb9341.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

A few mostly minor things.

vanzin · 2019-12-16T18:53:47Z

...fka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/producer/CachedKafkaProducer.scala

+import org.apache.spark.internal.Logging
+
+private[kafka010] class CachedKafkaProducer(
+  val cacheKey: Seq[(String, Object)],


nit: indent more

vanzin · 2019-12-16T18:59:25Z

...10-sql/src/main/scala/org/apache/spark/sql/kafka010/producer/InternalKafkaProducerPool.scala

+  private def evictExpired(): Unit = {
+    val producers = new mutable.ArrayBuffer[CachedProducerEntry]()
+    synchronized {
+      cache.filter { case (_, v) => v.expired }.foreach { case (k, v) =>


You can use cache.retain instead of filter + foreach.

vanzin · 2019-12-16T19:05:21Z

...10-sql/src/main/scala/org/apache/spark/sql/kafka010/producer/InternalKafkaProducerPool.scala

+      }
+    }
+
+    def expired: Boolean = _refCount <= 0 && _expireAt < clock.getTimeMillis()


Really should use nanoTime() when checking for expiration. Also, if you pass the current time to handleReturned() and expired(), you don't even need this class to know about clocks at all. (That also avoids calling clock.blah() for each entry in the cache when checking for expiration, which is a pretty small but cheap optimization.)

(The ref count check is also redundant, but that's ok for clarity.)

I feel a bit odd to pass current time to handleReturned() and expired(), but actually I also felt a bit odd to pass Clock as well, so good to change. And yes, the ref count check is intended to be placed for clarity.

vanzin · 2019-12-16T19:10:19Z

...10-sql/src/main/scala/org/apache/spark/sql/kafka010/producer/InternalKafkaProducerPool.scala

+
+    def handleReturned(): Unit = {
+      _refCount -= 1
+      if (_refCount <= 0) {


Hmm, shouldn't you assert that ref counts never go below 0? Or at least warn loudly with a stack trace so that if it happens, the stack trace helps with debugging.

vanzin · 2019-12-16T19:13:45Z

...l/src/test/scala/org/apache/spark/sql/kafka010/producer/InternalKafkaProducerPoolSuite.scala

+    val kafkaParams = getTestKafkaParams()
+    val producer = pool.acquire(kafkaParams)
+    val producer2 = pool.acquire(kafkaParams)
+    assert(producer === producer2)


You want them to be the same instance, right? So probably eq is more correct than ===.

vanzin · 2019-12-16T19:19:48Z

...10-sql/src/main/scala/org/apache/spark/sql/kafka010/producer/InternalKafkaProducerPool.scala

+    val paramsSeq: Seq[(String, Object)] = kafkaParams.asScala.toSeq.sortBy(x => x._1)
+    paramsSeq


single statement?

vanzin · 2019-12-16T19:22:02Z

...10-sql/src/main/scala/org/apache/spark/sql/kafka010/producer/InternalKafkaProducerPool.scala

+    cache.foreach { case (k, v) =>
+      cache.remove(k)


Is foreach + remove safe in Scala? (In Java maps you'd get a "concurrent modification exception" at some point.)

I'd just use foreach + a separate call to clear.

It seems to work for Scala mutable HashMap, but yes that's clearer and avoid the wondering.

vanzin · 2019-12-16T19:22:40Z

...10-sql/src/main/scala/org/apache/spark/sql/kafka010/producer/InternalKafkaProducerPool.scala

+      cache.remove(k)
+      v.producer.close()
+    }
+    scheduled = startEvictorThread()


Hmm, not sure why you need to restart this task?

Because reset is to ensure the state of producer pool is back to first initialization; so canceling the evict task to make sure evict task is not running, closing and clearing all cached producers, restarting the evict task.

It's ideal to simply recreate pool, but in tests we would like to reset pool in object instead of class instance. It's only used from testing.

Well, if you want to be absolutely sure that everything is being reset then, you need to use cancel(false), and wait for any pending run of the task to finish. Otherwise you may interrupt the task in the middle of closing an expired producer. Or maybe tests don't run into that because [insert reason here].

It just seems safer to me to not bother messing with the task. Leaving it alone will not break anything, and will reset the pool just as well.

It just seems safer to me to not bother messing with the task. Leaving it alone will not break anything, and will reset the pool just as well.

Agreed. Not touching the evict task while resetting seems to also work. Will change.

vanzin · 2019-12-16T19:23:19Z

...10-sql/src/main/scala/org/apache/spark/sql/kafka010/producer/InternalKafkaProducerPool.scala

+  }
+
+  private[producer] def shutdown(): Unit = {
+    ThreadUtils.shutdown(executorService)


You need to cancel the scheduler task before calling shutdown.

SparkQA · 2019-12-16T21:25:55Z

Test build #115408 has finished for PR 26845 at commit 6ea2fc5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-16T22:50:43Z

Test build #115410 has finished for PR 26845 at commit 5cbd425.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

A few minor things.

vanzin · 2019-12-17T18:48:03Z

...10-sql/src/main/scala/org/apache/spark/sql/kafka010/producer/InternalKafkaProducerPool.scala

+  /** exposed for testing */
+  private[producer] val cacheExpireTimeoutMillis: Long = conf.get(PRODUCER_CACHE_TIMEOUT)
+
+  private val evictorThreadRunIntervalMillis = conf.get(PRODUCER_CACHE_EVICTOR_THREAD_RUN_INTERVAL)


Only used in evictorThreadRunIntervalMillis, doesn't need to be a field.

vanzin · 2019-12-17T18:49:29Z

...10-sql/src/main/scala/org/apache/spark/sql/kafka010/producer/InternalKafkaProducerPool.scala

+  }
+
+  private[producer] def release(producer: CachedKafkaProducer): Unit = {
+    def closeProducerNotInCache(producer: CachedKafkaProducer): Unit = {


Only called in one place, inline.

vanzin · 2019-12-17T18:51:27Z

...10-sql/src/main/scala/org/apache/spark/sql/kafka010/producer/InternalKafkaProducerPool.scala

+  }
+
+  private[producer] def shutdown(): Unit = {
+    scheduled.foreach(_.cancel(true))


false is safer

vanzin · 2019-12-17T18:53:49Z

...10-sql/src/main/scala/org/apache/spark/sql/kafka010/producer/InternalKafkaProducerPool.scala

+    }
+
+    def handleReturned(curTimeNs: Long): Unit = {
+      _refCount -= 1


To be pedantic, you should check the state before modifying it.

SparkQA · 2019-12-17T22:51:21Z

Test build #115473 has finished for PR 26845 at commit e9dc140.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-12-23T19:14:20Z

retest this please

SparkQA · 2019-12-23T21:14:35Z

Test build #115658 has finished for PR 26845 at commit e9dc140.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Looks good. You could avoid creating the thread pool if the config is set to not expire, but not worth the trouble.

Merging to master (I'll fix the comment during merge).

vanzin · 2019-12-23T22:08:52Z

...10-sql/src/main/scala/org/apache/spark/sql/kafka010/producer/InternalKafkaProducerPool.scala

+    }
+  }
+
+  private var scheduled = startEvictorThread()


This can be a val now.

HeartSaVioR · 2019-12-23T23:08:26Z

Thanks for reviewing and merging!

… package ### What changes were proposed in this pull request? There're too many classes placed in a single package "org.apache.spark.sql.kafka010" which classes can be grouped by purpose. As a part of change in SPARK-21869 (#26845), we moved out producer related classes to "org.apache.spark.sql.kafka010.producer" and only expose necessary classes/methods to the outside of package. This patch applies the same to consumer related classes. ### Why are the changes needed? Described above. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Closes #26991 from HeartSaVioR/SPARK-30336. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Sean Owen <[email protected]>

gaborgsomogyi · 2020-01-08T10:02:01Z

Thanks guys to take care of this better approach. Just had a slight view and looks good.
What can be added to make it better is documentation in structured-streaming-kafka-integration.md just like Consumer Caching chapter. If you think it's a good idea just pull me in to review.

I'm going to have a deeper look to catch up...

HeartSaVioR · 2020-01-09T08:07:28Z

Thanks for the suggestion! Submitted #27146 to address documentation. Actually the producer pool is very simple in end users' point of view, so it's most likely closer to introduce available configurations.

…uration ### What changes were proposed in this pull request? This patch documents the configuration for the Kafka producer pool, newly revised via SPARK-21869 (#26845) ### Why are the changes needed? The explanation of new Kafka producer pool configuration is missing, whereas the doc has Kafka consumer pool configuration. ### Does this PR introduce any user-facing change? Yes. This is a documentation change. ![Screen Shot 2020-01-12 at 11 16 19 PM](https://user-images.githubusercontent.com/9700541/72238148-c8959e00-3591-11ea-87fc-a8918792017e.png) ### How was this patch tested? N/A Closes #27146 from HeartSaVioR/SPARK-21869-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…rrectly This patch revises Kafka producer pool (cache) to implement 'expire' correctly. Current implementation of Kafka producer cache leverages Guava cache, which decides cached producer instance to be expired if the instance is not "accessed" from cache. The behavior defines expiration time as "last accessed time + timeout", which is incorrect because some task may use the instance longer than timeout. There's no concept of "returning" in Guava cache as well, so it cannot be fixed with Guava cache. This patch introduces a new pool implementation which tracks "reference count" of cached instance, and defines expiration time for the instance as "last returned time + timeout" if the reference count goes 0, otherwise Long.MaxValue (effectively no expire). Expiring instances will be done with evict thread explicitly instead of evicting in part of handling acquire. (It might bring more overhead, but it ensures clearing expired instances even the pool is idle.) This patch also creates a new package `producer` under `kafka010`, to hide the details from `kafka010` package. In point of `kafka010` package's view, only acquire()/release()/reset() are available in pool, and even for CachedKafkaProducer the package cannot close the producer directly. Explained above. Yes, but only for the way of expiring cached instances. (The difference is described above.) Each executor leveraging spark-sql-kafka would have one eviction thread. New and existing UTs. Closes apache#26845 from HeartSaVioR/SPARK-21869-revised. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

HeartSaVioR mentioned this pull request Dec 11, 2019

[SPARK-21869][SS] Apply Apache Commons Pool to Kafka producer #25853

Closed

HeartSaVioR commented Dec 12, 2019

View reviewed changes

dongjoon-hyun added the STRUCTURED STREAMING label Dec 12, 2019

HeartSaVioR added 2 commits December 13, 2019 19:35

[SPARK-21869][SS] Revise Kafka producer pool to implement 'expire' co…

f04eb99

…rrectly

Fix UT failures - SparkEnv is not always available

a7aac78

HeartSaVioR force-pushed the SPARK-21869-revised branch from 13fd4f0 to a7aac78 Compare December 13, 2019 10:41

Fix build failure

ebb9341

vanzin reviewed Dec 16, 2019

View reviewed changes

Reflect review comments

6ea2fc5

Silly mistake

5cbd425

HeartSaVioR mentioned this pull request Dec 17, 2019

[SPARK-20568][SS] Provide option to clean up completed files in streaming query #22952

Closed

vanzin reviewed Dec 17, 2019

View reviewed changes

Reflect review comments

e9dc140

vanzin reviewed Dec 23, 2019

View reviewed changes

vanzin closed this in 7bff2db Dec 23, 2019

HeartSaVioR deleted the SPARK-21869-revised branch December 23, 2019 23:08

HeartSaVioR mentioned this pull request Dec 24, 2019

[SPARK-30336][SQL][SS] Move Kafka consumer-related classes to its own package #26991

Closed

HeartSaVioR mentioned this pull request Jan 9, 2020

[SPARK-21869][SS][DOCS][FOLLOWUP] Document Kafka producer pool configuration #27146

Closed

		val paramsSeq: Seq[(String, Object)] = kafkaParams.asScala.toSeq.sortBy(x => x._1)
		paramsSeq

[SPARK-21869][SS] Revise Kafka producer pool to implement 'expire' correctly #26845

[SPARK-21869][SS] Revise Kafka producer pool to implement 'expire' correctly #26845

Uh oh!

Conversation

HeartSaVioR commented Dec 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Dec 11, 2019

Uh oh!

HeartSaVioR commented Dec 11, 2019

Uh oh!

SparkQA commented Dec 11, 2019

Uh oh!

dongjoon-hyun commented Dec 11, 2019

Uh oh!

HeartSaVioR commented Dec 11, 2019

Uh oh!

SparkQA commented Dec 11, 2019

Uh oh!

HeartSaVioR commented Dec 11, 2019

Uh oh!

SparkQA commented Dec 11, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 13, 2019

Uh oh!

SparkQA commented Dec 13, 2019

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 16, 2019

Uh oh!

SparkQA commented Dec 16, 2019

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 17, 2019

HeartSaVioR commented Dec 11, 2019 •

edited

Loading