Skip to content

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Aug 13, 2018

What changes were proposed in this pull request?

I'm still seeing the Kafka tests failed randomly due to kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING. I checked the test output and saw zookeeper picked up an ipv6 address. Most details can be found in https://issues.apache.org/jira/browse/KAFKA-7193

This PR just uses 127.0.0.1 rather than localhost to make sure zookeeper will never use an ipv6 address.

How was this patch tested?

Jenkins

@zsxwing
Copy link
Member Author

zsxwing commented Aug 13, 2018

cc @srowen

Copy link
Contributor

@koeninger koeninger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94711 has finished for PR 22097 at commit 90f55dc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Aug 13, 2018

Looks like there is a race condition during terminating Kafka cluster:

18/08/13 15:34:44.148 kafka-log-cleaner-thread-0 ERROR LogCleaner: Failed to access checkpoint file cleaner-offset-checkpoint in dir /home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-5ad98c9e-0d75-4f23-a948-9e29246651d2
org.apache.kafka.common.errors.KafkaStorageException: Error while reading checkpoint file /home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-5ad98c9e-0d75-4f23-a948-9e29246651d2/cleaner-offset-checkpoint
Caused by: java.io.FileNotFoundException: /home/jenkins/workspace/SparkPullRequestBuilder@3/target/tmp/spark-5ad98c9e-0d75-4f23-a948-9e29246651d2/cleaner-offset-checkpoint (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at kafka.server.checkpoints.CheckpointFile.liftedTree2$1(CheckpointFile.scala:87)
	at kafka.server.checkpoints.CheckpointFile.read(CheckpointFile.scala:86)
	at kafka.server.checkpoints.OffsetCheckpointFile.read(OffsetCheckpointFile.scala:61)
	at kafka.log.LogCleanerManager$$anonfun$allCleanerCheckpoints$1$$anonfun$apply$1.apply(LogCleanerManager.scala:89)
	at kafka.log.LogCleanerManager$$anonfun$allCleanerCheckpoints$1$$anonfun$apply$1.apply(LogCleanerManager.scala:87)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
	at kafka.log.LogCleanerManager$$anonfun$allCleanerCheckpoints$1.apply(LogCleanerManager.scala:87)
	at kafka.log.LogCleanerManager$$anonfun$allCleanerCheckpoints$1.apply(LogCleanerManager.scala:95)
	at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
	at kafka.log.LogCleanerManager.allCleanerCheckpoints(LogCleanerManager.scala:86)
	at kafka.log.LogCleanerManager$$anonfun$grabFilthiestCompactedLog$1.apply(LogCleanerManager.scala:126)
	at kafka.log.LogCleanerManager$$anonfun$grabFilthiestCompactedLog$1.apply(LogCleanerManager.scala:123)
	at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
	at kafka.log.LogCleanerManager.grabFilthiestCompactedLog(LogCleanerManager.scala:123)
	at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:296)
	at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:289)
	at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
sbt/sbt/0.13.17/test-interface-1.0.jar sbt.ForkMain 39359 failed with exit code 1
[error] (sql-kafka-0-10/test:test) sbt.TestsFailedException: Tests unsuccessful

It will call org.apache.kafka.common.utils.Exit to exit JVM and cause the non-zero exit code.

@zsxwing
Copy link
Member Author

zsxwing commented Aug 13, 2018

I set a custom Exit.Procedure to prevent from killing JVM. Hope this will make the test more stable.

@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94713 has finished for PR 22097 at commit b5eb542.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 14, 2018

Test build #94719 has finished for PR 22097 at commit b5eb542.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Aug 14, 2018

I'm going to merge this now since it does fix some issues. I will continue to investigate exit cod 1 issue.

@asfgit asfgit closed this in 80784a1 Aug 14, 2018
@zsxwing zsxwing deleted the fix-zookeeper-connect branch August 14, 2018 17:14
asfgit pushed a commit that referenced this pull request Aug 17, 2018
…roducers

## What changes were proposed in this pull request?

KafkaContinuousSinkSuite leaks a Kafka cluster because both KafkaSourceTest and KafkaContinuousSinkSuite create a Kafka cluster but `afterAll` only shuts down one cluster. This leaks a Kafka cluster and causes that some Kafka thread crash and kill JVM when SBT is trying to clean up tests.

This PR fixes the leak and also adds a shut down hook to detect Kafka cluster leak.

In additions, it also fixes `AdminClient` leak and cleans up cached producers (When a record is writtn using a producer, the producer will keep refreshing the topic and I don't find an API to clear it except closing the producer) to eliminate the following annoying logs:
```
8/13 15:34:42.568 kafka-admin-client-thread | adminclient-4 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node 0 could not be established. Broker may not be available.
18/08/13 15:34:42.570 kafka-admin-client-thread | adminclient-6 WARN NetworkClient: [AdminClient clientId=adminclient-6] Connection to node 0 could not be established. Broker may not be available.
18/08/13 15:34:42.606 kafka-admin-client-thread | adminclient-8 WARN NetworkClient: [AdminClient clientId=adminclient-8] Connection to node -1 could not be established. Broker may not be available.
18/08/13 15:34:42.729 kafka-producer-network-thread | producer-797 WARN NetworkClient: [Producer clientId=producer-797] Connection to node -1 could not be established. Broker may not be available.
18/08/13 15:34:42.906 kafka-producer-network-thread | producer-1598 WARN NetworkClient: [Producer clientId=producer-1598] Connection to node 0 could not be established. Broker may not be available.
```

I also reverted b5eb542 introduced by #22097 since it doesn't help.

## How was this patch tested?

Jenkins

Closes #22106 from zsxwing/SPARK-25116.

Authored-by: Shixiong Zhu <[email protected]>
Signed-off-by: Shixiong Zhu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants