Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Dec 11, 2019

What changes were proposed in this pull request?

This patch revises Kafka producer pool (cache) to implement 'expire' correctly.

Current implementation of Kafka producer cache leverages Guava cache, which decides cached producer instance to be expired if the instance is not "accessed" from cache. The behavior defines expiration time as "last accessed time + timeout", which is incorrect because some task may use the instance longer than timeout. There's no concept of "returning" in Guava cache as well, so it cannot be fixed with Guava cache.

This patch introduces a new pool implementation which tracks "reference count" of cached instance, and defines expiration time for the instance as "last returned time + timeout" if the reference count goes 0, otherwise Long.MaxValue (effectively no expire). Expiring instances will be done with evict thread explicitly instead of evicting in part of handling acquire. (It might bring more overhead, but it ensures clearing expired instances even the pool is idle.)

This patch also creates a new package producer under kafka010, to hide the details from kafka010 package. In point of kafka010 package's view, only acquire()/release()/reset() are available in pool, and even for CachedKafkaProducer the package cannot close the producer directly.

Why are the changes needed?

Explained above.

Does this PR introduce any user-facing change?

Yes, but only for the way of expiring cached instances. (The difference is described above.) Each executor leveraging spark-sql-kafka would have one eviction thread.

How was this patch tested?

New and existing UTs.

@HeartSaVioR
Copy link
Contributor Author

cc. @zsxwing @vanzin @gaborgsomogyi (Gabor may not be available in this month but cc. to let him catch up sometimes next month)

@HeartSaVioR
Copy link
Contributor Author

Similarly I feel it would be ideal to move the files regarding consumer pool to consumer package, and adjust visibility if necessary. Even without adjusting visibility, moving these files would help determining the files regarding consumer pool easily.

What do you think? If it makes sense I'll craft another PR for this. Thanks!

@SparkQA
Copy link

SparkQA commented Dec 11, 2019

Test build #115144 has finished for PR 26845 at commit a73db22.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Could you fix the failure?

sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError:
Could not initialize class org.apache.spark.sql.kafka010.producer.InternalKafkaProducerPool$

@HeartSaVioR
Copy link
Contributor Author

Ah yes I had been investigating the issue. Thanks for reminding! Will update soon.

@SparkQA
Copy link

SparkQA commented Dec 11, 2019

Test build #115153 has finished for PR 26845 at commit 13fd4f0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Dec 11, 2019

Test build #115161 has finished for PR 26845 at commit 13fd4f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

producer.foreach(_.producer.flush())
checkForErrors()
KafkaDataWriterCommitMessage
} finally {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resource cleanup can be simplified once SPARK-30227 (#26855) is introduced.

@SparkQA
Copy link

SparkQA commented Dec 13, 2019

Test build #115297 has finished for PR 26845 at commit a7aac78.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2019

Test build #115301 has finished for PR 26845 at commit ebb9341.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few mostly minor things.

import org.apache.spark.internal.Logging

private[kafka010] class CachedKafkaProducer(
val cacheKey: Seq[(String, Object)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent more

private def evictExpired(): Unit = {
val producers = new mutable.ArrayBuffer[CachedProducerEntry]()
synchronized {
cache.filter { case (_, v) => v.expired }.foreach { case (k, v) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use cache.retain instead of filter + foreach.

}
}

def expired: Boolean = _refCount <= 0 && _expireAt < clock.getTimeMillis()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really should use nanoTime() when checking for expiration. Also, if you pass the current time to handleReturned() and expired(), you don't even need this class to know about clocks at all. (That also avoids calling clock.blah() for each entry in the cache when checking for expiration, which is a pretty small but cheap optimization.)

(The ref count check is also redundant, but that's ok for clarity.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a bit odd to pass current time to handleReturned() and expired(), but actually I also felt a bit odd to pass Clock as well, so good to change. And yes, the ref count check is intended to be placed for clarity.


def handleReturned(): Unit = {
_refCount -= 1
if (_refCount <= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, shouldn't you assert that ref counts never go below 0? Or at least warn loudly with a stack trace so that if it happens, the stack trace helps with debugging.

val kafkaParams = getTestKafkaParams()
val producer = pool.acquire(kafkaParams)
val producer2 = pool.acquire(kafkaParams)
assert(producer === producer2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want them to be the same instance, right? So probably eq is more correct than ===.

Comment on lines 145 to 146
val paramsSeq: Seq[(String, Object)] = kafkaParams.asScala.toSeq.sortBy(x => x._1)
paramsSeq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single statement?

Comment on lines 114 to 115
cache.foreach { case (k, v) =>
cache.remove(k)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is foreach + remove safe in Scala? (In Java maps you'd get a "concurrent modification exception" at some point.)

I'd just use foreach + a separate call to clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to work for Scala mutable HashMap, but yes that's clearer and avoid the wondering.

cache.remove(k)
v.producer.close()
}
scheduled = startEvictorThread()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure why you need to restart this task?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because reset is to ensure the state of producer pool is back to first initialization; so canceling the evict task to make sure evict task is not running, closing and clearing all cached producers, restarting the evict task.

It's ideal to simply recreate pool, but in tests we would like to reset pool in object instead of class instance. It's only used from testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if you want to be absolutely sure that everything is being reset then, you need to use cancel(false), and wait for any pending run of the task to finish. Otherwise you may interrupt the task in the middle of closing an expired producer. Or maybe tests don't run into that because [insert reason here].

It just seems safer to me to not bother messing with the task. Leaving it alone will not break anything, and will reset the pool just as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just seems safer to me to not bother messing with the task. Leaving it alone will not break anything, and will reset the pool just as well.

Agreed. Not touching the evict task while resetting seems to also work. Will change.

}

private[producer] def shutdown(): Unit = {
ThreadUtils.shutdown(executorService)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to cancel the scheduler task before calling shutdown.

@SparkQA
Copy link

SparkQA commented Dec 16, 2019

Test build #115408 has finished for PR 26845 at commit 6ea2fc5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 16, 2019

Test build #115410 has finished for PR 26845 at commit 5cbd425.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things.

/** exposed for testing */
private[producer] val cacheExpireTimeoutMillis: Long = conf.get(PRODUCER_CACHE_TIMEOUT)

private val evictorThreadRunIntervalMillis = conf.get(PRODUCER_CACHE_EVICTOR_THREAD_RUN_INTERVAL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only used in evictorThreadRunIntervalMillis, doesn't need to be a field.

}

private[producer] def release(producer: CachedKafkaProducer): Unit = {
def closeProducerNotInCache(producer: CachedKafkaProducer): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only called in one place, inline.

}

private[producer] def shutdown(): Unit = {
scheduled.foreach(_.cancel(true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false is safer

}

def handleReturned(curTimeNs: Long): Unit = {
_refCount -= 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be pedantic, you should check the state before modifying it.

@SparkQA
Copy link

SparkQA commented Dec 17, 2019

Test build #115473 has finished for PR 26845 at commit e9dc140.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Dec 23, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Dec 23, 2019

Test build #115658 has finished for PR 26845 at commit e9dc140.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. You could avoid creating the thread pool if the config is set to not expire, but not worth the trouble.

Merging to master (I'll fix the comment during merge).

}
}

private var scheduled = startEvictorThread()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a val now.

@vanzin vanzin closed this in 7bff2db Dec 23, 2019
@HeartSaVioR
Copy link
Contributor Author

Thanks for reviewing and merging!

@HeartSaVioR HeartSaVioR deleted the SPARK-21869-revised branch December 23, 2019 23:08
srowen pushed a commit that referenced this pull request Dec 31, 2019
… package

### What changes were proposed in this pull request?

There're too many classes placed in a single package "org.apache.spark.sql.kafka010" which classes can be grouped by purpose.

As a part of change in SPARK-21869 (#26845), we moved out producer related classes to "org.apache.spark.sql.kafka010.producer" and only expose necessary classes/methods to the outside of package. This patch applies the same to consumer related classes.

### Why are the changes needed?

Described above.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing UTs.

Closes #26991 from HeartSaVioR/SPARK-30336.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
@gaborgsomogyi
Copy link
Contributor

Thanks guys to take care of this better approach. Just had a slight view and looks good.
What can be added to make it better is documentation in structured-streaming-kafka-integration.md just like Consumer Caching chapter. If you think it's a good idea just pull me in to review.

I'm going to have a deeper look to catch up...

@HeartSaVioR
Copy link
Contributor Author

Thanks for the suggestion! Submitted #27146 to address documentation. Actually the producer pool is very simple in end users' point of view, so it's most likely closer to introduce available configurations.

dongjoon-hyun pushed a commit that referenced this pull request Jan 13, 2020
…uration

### What changes were proposed in this pull request?

This patch documents the configuration for the Kafka producer pool, newly revised via SPARK-21869 (#26845)

### Why are the changes needed?

The explanation of new Kafka producer pool configuration is missing, whereas the doc has Kafka
 consumer pool configuration.

### Does this PR introduce any user-facing change?

Yes. This is a documentation change.

![Screen Shot 2020-01-12 at 11 16 19 PM](https://user-images.githubusercontent.com/9700541/72238148-c8959e00-3591-11ea-87fc-a8918792017e.png)

### How was this patch tested?

N/A

Closes #27146 from HeartSaVioR/SPARK-21869-FOLLOWUP.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
fqaiser94 pushed a commit to fqaiser94/spark that referenced this pull request Mar 30, 2020
…rrectly

This patch revises Kafka producer pool (cache) to implement 'expire' correctly.

Current implementation of Kafka producer cache leverages Guava cache, which decides cached producer instance to be expired if the instance is not "accessed" from cache. The behavior defines expiration time as "last accessed time + timeout", which is incorrect because some task may use the instance longer than timeout. There's no concept of "returning" in Guava cache as well, so it cannot be fixed with Guava cache.

This patch introduces a new pool implementation which tracks "reference count" of cached instance, and defines expiration time for the instance as "last returned time + timeout" if the reference count goes 0, otherwise Long.MaxValue (effectively no expire). Expiring instances will be done with evict thread explicitly instead of evicting in part of handling acquire. (It might bring more overhead, but it ensures clearing expired instances even the pool is idle.)

This patch also creates a new package `producer` under `kafka010`, to hide the details from `kafka010` package. In point of `kafka010` package's view, only acquire()/release()/reset() are available in pool, and even for CachedKafkaProducer the package cannot close the producer directly.

Explained above.

Yes, but only for the way of expiring cached instances. (The difference is described above.) Each executor leveraging spark-sql-kafka would have one eviction thread.

New and existing UTs.

Closes apache#26845 from HeartSaVioR/SPARK-21869-revised.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants