Skip to content

Conversation

@ScrapCodes
Copy link
Member

@ScrapCodes ScrapCodes commented Aug 31, 2017

What changes were proposed in this pull request?

We track the producer, by maintaining an inuse thread count of the producer. If the producer is inuse and we get eviction orders from guava, we move such producers to a queue(closeQueue) and periodically (in a non-thread way) check for its inuse status and eventually close it. This way a producer will not be closed while being used and also not get assigned to a new task when evicted.

We had to do this because, guava has a limitation that it does not allow for custom eviction strategy. google/guava#3013

How was this patch tested?

Updated existing and added appropriate test case.

@SparkQA
Copy link

SparkQA commented Aug 31, 2017

Test build #81281 has finished for PR 19096 at commit f21bf16.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2017

Test build #81282 has finished for PR 19096 at commit 0afb48c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ScrapCodes ScrapCodes force-pushed the SPARK-21869/long-running-kafka-producer branch 2 times, most recently from c7705a2 to 7013e8a Compare September 8, 2017 12:22
@SparkQA
Copy link

SparkQA commented Sep 8, 2017

Test build #81558 has finished for PR 19096 at commit 7013e8a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ScrapCodes
Copy link
Member Author

Hi @zsxwing, are you okay with the changes?

@ScrapCodes ScrapCodes force-pushed the SPARK-21869/long-running-kafka-producer branch from 7013e8a to ab4c1dd Compare December 17, 2017 14:44
@SparkQA
Copy link

SparkQA commented Dec 17, 2017

Test build #85022 has finished for PR 19096 at commit ab4c1dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ScrapCodes ScrapCodes force-pushed the SPARK-21869/long-running-kafka-producer branch from f609d7e to 488c70f Compare December 20, 2017 09:54
@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85169 has finished for PR 19096 at commit 74bcdbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85170 has finished for PR 19096 at commit f609d7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ScrapCodes ScrapCodes force-pushed the SPARK-21869/long-running-kafka-producer branch from 488c70f to dbabbf9 Compare December 20, 2017 10:11
@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85174 has finished for PR 19096 at commit 488c70f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ScrapCodes ScrapCodes force-pushed the SPARK-21869/long-running-kafka-producer branch from dbabbf9 to edd5bc3 Compare December 20, 2017 10:28
@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85176 has finished for PR 19096 at commit dbabbf9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85180 has finished for PR 19096 at commit edd5bc3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ScrapCodes ScrapCodes changed the title [SPARK-21869][SS] A cached Kafka producer should not be closed if any task is using it. [SPARK-21869][SS] A cached Kafka producer should not be closed if any task is using it - adds inuse tracking. Dec 20, 2017
@ScrapCodes ScrapCodes force-pushed the SPARK-21869/long-running-kafka-producer branch from edd5bc3 to 024a407 Compare January 17, 2018 10:43
@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86262 has finished for PR 19096 at commit 024a407.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ScrapCodes
Copy link
Member Author

@zsxwing, please take another look.

@ScrapCodes ScrapCodes closed this Aug 23, 2018
@HeartSaVioR
Copy link
Contributor

This looks like abandoned due to just not reviewed in time, but it seems this patch is still needed so I'd like to bump.
@ScrapCodes Would you still be OK to keep this PR up to date? If then I would take a look, otherwise I'd like to take this up. (may and may not start with this PR) Thanks in advance!

@ScrapCodes ScrapCodes reopened this Feb 20, 2019
@ScrapCodes
Copy link
Member Author

@HeartSaVioR Thanks for your interest, let me reopen and take some time to update it.

@SparkQA
Copy link

SparkQA commented Feb 20, 2019

Test build #102535 has finished for PR 19096 at commit 024a407.

  • This patch fails due to an unknown error code, -9.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor

@ScrapCodes just wondering which scenario ends-up in multi-threaded access to the same KafkaConsumer where the key is a TopicPartition? If there is no such case I would say no benefit and just increase the complexity without value.

@gaborgsomogyi
Copy link
Contributor

As a general comment +1 on solving this issue. If a task uses a producer more than the configured eviction time it will be closed.

@ScrapCodes ScrapCodes force-pushed the SPARK-21869/long-running-kafka-producer branch from e3890a5 to a3443ea Compare April 12, 2019 06:17
@SparkQA
Copy link

SparkQA commented Apr 12, 2019

Test build #104539 has finished for PR 19096 at commit e3890a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2019

Test build #104540 has finished for PR 19096 at commit a3443ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@gaborgsomogyi gaborgsomogyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @ScrapCodes thanks for your hard work and @HeartSaVioR for your review effort!

@ScrapCodes
Copy link
Member Author

Hi @zsxwing, can you please take a look.

@ScrapCodes
Copy link
Member Author

I am not sure why "Changes requested" status did not clear up, inspite of making changes on the PR. Do I need to force push?

@gaborgsomogyi
Copy link
Contributor

@zsxwing requested changes, I think he has to change this.

@ScrapCodes
Copy link
Member Author

@zsxwing Ping !

@gaborgsomogyi
Copy link
Contributor

@ScrapCodes could you resolve the conflict?
@srowen could you help us out with a review. I would like to round-up the producer side as well, just like in #24590. This PR would be an initial step.

@SparkQA
Copy link

SparkQA commented May 13, 2019

Test build #105355 has finished for PR 19096 at commit 5c9e9ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor

@ScrapCodes may I ask to resolve conflicts?

@vanzin not much movement here but this blocks to finish delegation token (additionally several users suffering with long running batch queries). #23956 is required to pick up the latest token but this PR is needed to finish it. I would like to ask you to have a look please.

@ScrapCodes
Copy link
Member Author

@gaborgsomogyi Yes will do it soon.

@SparkQA
Copy link

SparkQA commented Jul 10, 2019

Test build #107440 has finished for PR 19096 at commit 972164d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@gaborgsomogyi gaborgsomogyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look at the test results and seems like the failure is relevant.

@SparkQA
Copy link

SparkQA commented Jul 11, 2019

Test build #107531 has finished for PR 19096 at commit 8a0906e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@jose-torres jose-torres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JIRA says "Right now a cached Kafka producer may be closed if a large task uses it for more than 10 minutes.", and I'm not sure this change is the right way to handle that. If we think it should be longer than 10 minutes, let's increase it (and maybe we'll have to make some code changes to do it, I don't know where the 10 minutes comes from). To say that we will never release a resource until a task says it's okay is inherently dangerous in a distributed system, and it looks like we'd have to do some risky changes to make it happen too.

topic: Option[String]) extends KafkaRowWriter(inputSchema, topic) {
// used to synchronize with Kafka callbacks
private var producer: KafkaProducer[Array[Byte], Array[Byte]] = _
protected val producer: CachedKafkaProducer =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in lifecycle for the producer. Are we sure that's safe?

val paramsSeq: Seq[(String, Object)] = paramsToSeq(updatedAuthConfigIfNeeded(kafkaParamsMap))
try {
guavaCache.get(paramsSeq)
val producer = this.synchronized {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required? It's risky to add new global locks to things.

/* Release a kafka producer back to the kafka cache. We simply decrement it's inuse count. */
private[kafka010] def release(producer: CachedKafkaProducer, failing: Boolean): Unit = {
this.synchronized {
// It should be ok to call release multiple times on the same producer object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not really okay, right? If task A calls release multiple times, the producer might have its inUseCount decremented to 0 even though task B is using it.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Aug 20, 2019

To say that we will never release a resource until a task says it's okay is inherently dangerous in a distributed system

That's also the way how consumer instances are cached (that's why missing close brought file descriptor leak - #21997), though the logic is pretty much simpler since we don't allow co-use of instance (Kafka consumer is not thread-safe, whereas Kafka producer is thread-safe). Same behavior applies to Commons Pool as well, once you borrow the instance, pool never forces to destroy the instance until you return it.

@gaborgsomogyi
Copy link
Contributor

If we think it should be longer than 10 minutes, let's increase it

I don't think it's realistic to expect the user to measure the average time of a batch query and set the timeout accordingly.

I don't know where the 10 minutes comes from

There is a config named spark.kafka.producer.cache.timeout on master, previously it was with different name. As a workaround users are setting this value to a high number.

To say that we will never release a resource until a task says it's okay is inherently dangerous in a distributed system

True! I'm not telling it's the most safe solution but on the consumer side we've similar approach already. Catching committer attention is super hard and thought it would be less risky from committer perspective to add ref counting than introducing a complete new lib (like Commons Pool). Thinking about alternatives I see mainly 2 (pretty sure there are others):

The size of the cache is limited by <code>spark.kafka.consumer.cache.capacity</code> (default: 64).
If this threshold is reached, it tries to remove the least-used entry that is currently not in use.
If it cannot be removed, then the cache will keep growing. In the worst case, the cache will grow to
the max number of concurrent tasks that can run in the executor (that is, number of tasks slots),
after which it will never reduce.

@gaborgsomogyi
Copy link
Contributor

#22138 has been merged which changed my view on how to solve this issue here (until know I was not sure committers have enough confidence to merge that new technology).

Proposal: use Apache Commons Pool
My considerations:

  • Such way we don't have to do reference counting manually (reflecting to @jose-torres concerns)
  • Monitoring Kafka consumer/producer cache is on my table for long time. Apache Commons Pool by default provides metrics.
  • TD suggested this when the consumer side PR was filed/merged. Of course he made his suggestion on the consumer side but his reasoning still applies here (no manual reference counting).

If you agree happy to give helping hand during review. @ScrapCodes if you don't have time to invest then I'm happy to do the coding part. My PR #23956 is depending on this for long time and would like to push this forward (not all the cases will delegation token work).

Guys, please share your thoughts.

@HeartSaVioR
Copy link
Contributor

I second on this, as that's in line what I proposed before. Either I can help reviewing or even work on this.

We still have some consideration regarding performance and number of connections (as we allow multiple-threads access whereas Apache Commons Pool will prevent this), but even in javadoc of Kafka producer it just says The producer is thread safe and sharing a single producer instance across threads will generally be faster than having multiple instances. which doesn't provide any perf. consideration, so I'd rather say let's try this out and measure it.

@ScrapCodes
Copy link
Member Author

@gaborgsomogyi Interestingly, I also wanted to make a fresh start on this. This is important for me as well, and thanks for pointing in new direction, I will start working on it.

@HeartSaVioR Thanks for your continued interest and encouragement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants