[SPARK-27042][SS] Invalidate cached Kafka producer in case of task retry #26470

gaborgsomogyi · 2019-11-11T14:52:34Z

What changes were proposed in this pull request?

If a task is failing due to a corrupt cached Kafka producer and the task is retried in the same executor, then the task may get the same producer over and over again. After several retries the query may stop.

In this PR I'm invalidating the old cached producers for a specific key and re-opening a new one. This will reduce the possibility of a faulty producer re-use. It must be mentioned if a producer under the key is used by another task, then in won't be closed. The functionality is similar to the consumer side here.

Why are the changes needed?

Increase producer side availability.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing + additional unit tests.

gaborgsomogyi · 2019-11-11T15:05:38Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala

    if (TaskContext.get != null && TaskContext.get.attemptNumber >= 1) {
      val cacheKey = new CacheKey(topicPartition, kafkaParams)

+      logDebug(s"Task re-attempt detected, invalidating consumers with key $cacheKey")


This is not related but thought it would be good to have symmetric logs.

SparkQA · 2019-11-11T15:46:50Z

Test build #113593 has finished for PR 26470 at commit 25446ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-11-11T15:52:05Z

cc @vanzin @zsxwing @HeartSaVioR since you have context.

HeartSaVioR

LGTM except very minor.

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala

SparkQA · 2019-11-13T13:16:48Z

Test build #113704 has finished for PR 26470 at commit d384324.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-11-13T23:30:04Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala

        .setAuthenticationConfigIfNeeded()
        .build()
    val key = toCacheKey(updatedKafkaParams)
+    if (TaskContext.get != null && TaskContext.get.attemptNumber >= 1) {


I know this is just matching what the consumer code does, but I'm a little unsure that it's correct (in both cases).

Can't you re-use the consumer in different jobs? If so, you can have:

job 1 task 1 attempt 0 runs with producer 1 and fails

job 1 task 1 attempt 1 runs with producer 2 and succeeds

job 2 task 1 attempt 0 runs with producer 1 and re-uses the same instance that should have been closed?

So I'm wondering if instead the task itself shouldn't be invalidating the producer when it detects an error.

Strictly saying, we should discard the instance (not instances) when the instance throws some error, but then we may need to wrap the code to catch exception and discard the instance wherever we use the instance. A bit verbose, but we still have to return the instance to the pool even it succeeds, so maybe not impossible.

I'm not aware whether Kafka consumer or Kafka producer is self-healing - if they provide such functionality, we wouldn't need to even discard them at any case.

To answer the question1: Yes, job 2 can use job 1 created producers in the same JVM. Such case another task retry is needed to recover in the following way:

job 2 task 1 attempt 0 runs with producer 1 and fails

job 2 task 1 attempt 1 runs with producer 3 (this would invalidate producer 1 and 2 if not used)

If I consider the use-case this would mean 2 jobs must write the same TopicPartition at the same time. Until now I haven't seen use-cases where the exact same topic would be used by 2 different job (of course this doesn't mean it's not done).

To answer the question2: Kafka consumers and producers doesn't provide any kind of self healing.

I think I have to take some time to check the other way because I've similar fear what @HeartSaVioR mentioned (horror complicated code with catch blocks all the places where producer touched). The general fear what I have with the invalidation from exception side is the following: if the code is leaking and doesn't catch one single exception then the task may not heal itself.

2 jobs must write the same TopicPartition at the same time

Why at the same time? All you need to hit my case is two tasks on the first job using different producers, and can't that happen e.g. if you have executors with more than one core?

It seems like Spark would eventually self-heal at the cost of some failed tasks, but can't we do better? As Jungtaek said we already have to return the producer to the pool. It's just a matter of, instead of returning it, invalidating it if an error is detected.

IIUC, this is trying to close a shared producer because of a random failure (including user code errors)? IMO, we should not do this for producer. It can be shared with multiple Spark jobs as long as they are using the same Kafka parameters. Kafka actually suggests to share the producer in their doc:

The producer is thread safe and sharing a single producer instance across threads will generally be faster than having multiple instances.

Hence I would assume it can self-heal. Kafka consumer is a different story. It's not thread-safe and cannot be shared with multiple tasks at the same time. That's why we can close the old one since we are pretty sure it's not used by another task in the same JVM.

IIUC, this is trying to close a shared producer because of a random failure (including user code errors)? IMO, we should not do this for producer. It can be shared with multiple Spark jobs as long as they are using the same Kafka parameters.

We changed the producer pool to follow the same approach as Kafka consumer - #25853 to resolve the long-standing "producer close on idle" issue. #19096 proves closing idle producer while producer can be shard with multiple tasks is very complicated - we should consider adding "in-use" and then should deal with thread-safety. One task one producer is pretty much simpler, and we didn't observe performance issue there.

Why at the same time?

I thought job 1 and job 2 are different queries. Single query + multiple cores could end-up in the same situation.

Hence I would assume it can self-heal.

I would assume it can't heal if programming failure would be in the producer code itself. On the other had I see your general concern related this. What is your point considering #25853?

I would assume it can't heal if programming failure would be in the producer code itself.

That's true. Softwares have bugs. But I have never heard any Kafka producer issues reported when using Spark. The task failure most likely is a user error, transient network error, or something unrelated. And a task failure like this should be isolated and should not impact other tasks in the same executor. This seems overkill.

What is your point considering #25853?

Thanks for pointing me to this PR. I posted my concerns there.

Since #25853 rolled back this PR just doesn't make sense, closing...

gaborgsomogyi · 2020-01-08T12:36:41Z

#25853 rolled back so doesn't make sense.

[SPARK-27042][SS] Invalidate cached Kafka producer in case of task retry

25446ba

gaborgsomogyi commented Nov 11, 2019

View reviewed changes

dongjoon-hyun added the STRUCTURED STREAMING label Nov 11, 2019

HeartSaVioR approved these changes Nov 13, 2019

View reviewed changes

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala Show resolved Hide resolved

Comment fix

d384324

vanzin reviewed Nov 13, 2019

View reviewed changes

gaborgsomogyi mentioned this pull request Nov 18, 2019

[SPARK-21869][SS] Apply Apache Commons Pool to Kafka producer #25853

Closed

gaborgsomogyi closed this Jan 8, 2020

[SPARK-27042][SS] Invalidate cached Kafka producer in case of task retry #26470

[SPARK-27042][SS] Invalidate cached Kafka producer in case of task retry #26470

Uh oh!

Conversation

gaborgsomogyi commented Nov 11, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gaborgsomogyi Nov 11, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 11, 2019

Uh oh!

gaborgsomogyi commented Nov 11, 2019

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Nov 13, 2019

Uh oh!

vanzin Nov 13, 2019

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi Nov 14, 2019

Choose a reason for hiding this comment

Uh oh!

vanzin Nov 14, 2019

Choose a reason for hiding this comment

Uh oh!

zsxwing Nov 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi Nov 15, 2019

Choose a reason for hiding this comment

Uh oh!

zsxwing Nov 15, 2019

Choose a reason for hiding this comment

Uh oh!

zsxwing Nov 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi Jan 8, 2020

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi commented Jan 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

HeartSaVioR Nov 14, 2019 •

edited

Loading

zsxwing Nov 14, 2019 •

edited

Loading

HeartSaVioR Nov 15, 2019 •

edited

Loading

zsxwing Nov 15, 2019 •

edited

Loading