Skip to content

Conversation

@functioner
Copy link
Contributor

Description of PR

HADOOP-18046

How was this patch tested?

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@functioner
Copy link
Contributor Author

@virajjasani @ayushtkn @iwasakims @aajisaka
This is the fix of flaky test TestIPC#testIOEOnListenerAccept.
I think the root cause is org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1922) which may throw IOException (Connection reset by peer) from time to time. Previously I only considered EOFException, which happens in most cases.
The test is failing in the daily build so we need to fix it soon. Thank you!

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 14s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 35m 37s trunk passed
+1 💚 compile 24m 14s trunk passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 compile 20m 50s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 1m 1s trunk passed
+1 💚 mvnsite 1m 37s trunk passed
+1 💚 javadoc 1m 9s trunk passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 40s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 2m 29s trunk passed
+1 💚 shadedclient 25m 33s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 2s the patch passed
+1 💚 compile 23m 33s the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javac 23m 33s the patch passed
+1 💚 compile 20m 34s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 20m 34s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 59s the patch passed
+1 💚 mvnsite 1m 35s the patch passed
+1 💚 javadoc 1m 6s the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 40s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 2m 42s the patch passed
+1 💚 shadedclient 25m 33s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 17m 24s hadoop-common in the patch passed.
+1 💚 asflicense 0m 48s The patch does not generate ASF License warnings.
212m 6s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3872/1/artifact/out/Dockerfile
GITHUB PR #3872
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux a8cd9d078d17 4.15.0-163-generic #171-Ubuntu SMP Fri Nov 5 11:55:11 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 6295d38
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3872/1/testReport/
Max. process+thread count 3143 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3872/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@iwasakims
Copy link
Member

iwasakims commented Jan 8, 2022

@functioner I needed to activate parallel-tests profile like mvn test -Dtest='TestIP*' -Pparallel-tests in order to reproduce the error. I got no issue on 100 run of mvn test -Dtest=TestIPC#testIOEOnListenerAccept.

If the cause is race between tests, "java.io.IOException: Connection reset by peer" should not be considered as expected?

@ayushtkn
Copy link
Member

@functioner this test doesn't fail when run independently, when I run the entire class, then only the test fails. so, that isn't the default behaviour. This is due to conflicting tests. We need to figure out that and then fix that.
Adding exception in the catch block doesn't looks like the correct fix

@steveloughran
Copy link
Contributor

some socket conflict across different threads?

@functioner
Copy link
Contributor Author

@iwasakims @ayushtkn @steveloughran Thank you for your suggestions.

I inspected this test more carefully and I found that previous I wrote the testIOEOnListenerAccept based on doErrorTest, which is generally used in the scenario where the fault is about the data type (e.g., IOEOnWriteWritable instead of LongWritable).

However, the scenario here is more similar to testRTEDuringConnectionSetup, testIpcTimeout, testIpcConnectTimeout, testIpcWithServiceClass, etc. I notice that these tests share some style which is slightly different from doErrorTest. Specifically, they have new TestServer(1, true) and make the client call with call(client, RANDOM.nextLong(), address, conf) where address is NetUtils.getConnectAddress(server).

I try to enforce this style and it works on my local machine when I run mvn test -Dtest=TestIPC -Pparallel-tests. On the contrary, without this modification, mvn test -Dtest=TestIPC -Pparallel-tests may fail testIOEOnListenerAccept in the way you guys reported.

I push this new commit and would like to see whether it works on our CI servers and see potential new comments.
@iwasakims @ayushtkn PTAL. Thank you!

@iwasakims
Copy link
Member

Thanks for digging this, @functioner.

While I can still reproduce the issue even with the latest patch.., sleep before call enabled by the new TestServer(1, true) might affect the probability.
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestIPC.java#L224-L229

Let me run the test for more iteration.

$ for i in `seq 100` ; do echo $i && mvn test -Dtest=TestIPC -Dmaven.test.failure.ignore=false || break ; done
$ less target/surefire-reports/org.apache.hadoop.ipc.TestIPC-output.txt
...
2022-02-01 07:27:41,499 INFO  ipc.CallQueueManager (CallQueueManager.java:<init>(93)) - Using callQueue: class java.util.concurrent.LinkedBlockingQueue, queueCapacity: 100, scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler, ipcBackoff: false.
2022-02-01 07:27:41,499 DEBUG ipc.Server (Server.java:getAuthMethods(3370)) - Server accepts auth methods:[SIMPLE]
2022-02-01 07:27:41,499 INFO  ipc.Server (Server.java:run(1405)) - Starting Socket Reader #1 for port 0
2022-02-01 07:27:41,500 INFO  ipc.Server (Server.java:run(1653)) - IPC Server Responder: starting
2022-02-01 07:27:41,500 INFO  ipc.Server (Server.java:run(1484)) - IPC Server listener on 0: starting
2022-02-01 07:27:41,501 DEBUG ipc.Server (Server.java:run(3075)) - IPC Server handler 0 on default port 45829: starting
2022-02-01 07:27:41,502 WARN  ipc.Server (Server.java:doAccept(1555)) - Error in an accepted SocketChannel
java.io.IOException: Injected fault
        at org.apache.hadoop.ipc.TestIPC.maybeThrowIOE(TestIPC.java:425)
        at org.apache.hadoop.ipc.TestIPC$1.configureSocketChannel(TestIPC.java:631)
        at org.apache.hadoop.ipc.Server$Listener.doAccept(Server.java:1553)
        at org.apache.hadoop.ipc.Server$Listener.run(Server.java:1498)
2022-02-01 07:27:41,502 WARN  ipc.TestIPC (TestIPC.java:testIOEOnListenerAccept(647)) - Got unexpected error
java.io.IOException: DestHost:destPort ip-172-31-197-233.ap-northeast-1.compute.internal:45829 , LocalHost:localPort ip-172-31-197-233.ap-northeast-1.compute.internal/172.31.197.233:0. Failed on local exception: java.io.IOException: Connection reset by peer
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:906)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1616)
        at org.apache.hadoop.ipc.Client.call(Client.java:1558)
        at org.apache.hadoop.ipc.Client.call(Client.java:1477)
        at org.apache.hadoop.ipc.TestIPC.call(TestIPC.java:167)
        at org.apache.hadoop.ipc.TestIPC.call(TestIPC.java:160)
        at org.apache.hadoop.ipc.TestIPC.testIOEOnListenerAccept(TestIPC.java:642)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:141)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:563)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1922)
        at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:

@ayushtkn
Copy link
Member

ayushtkn commented Feb 1, 2022

I just tried once and the test failed for me locally.

[ERROR] testIOEOnListenerAccept(org.apache.hadoop.ipc.TestIPC)  Time elapsed: 0.006 s  <<< FAILURE!
java.lang.AssertionError: Expected an EOFException to have been thrown
	at org.junit.Assert.fail(Assert.java:89)
	at org.apache.hadoop.ipc.TestIPC.testIOEOnListenerAccept(TestIPC.java:648)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:748)

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 42s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 34m 4s trunk passed
+1 💚 compile 24m 57s trunk passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 compile 21m 35s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 1m 5s trunk passed
+1 💚 mvnsite 1m 41s trunk passed
+1 💚 javadoc 1m 9s trunk passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 37s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 2m 31s trunk passed
+1 💚 shadedclient 23m 19s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 1s the patch passed
+1 💚 compile 23m 17s the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javac 23m 17s the patch passed
+1 💚 compile 20m 21s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 20m 21s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 0s the patch passed
+1 💚 mvnsite 1m 39s the patch passed
+1 💚 javadoc 1m 8s the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 41s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 2m 37s the patch passed
+1 💚 shadedclient 22m 52s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 11s hadoop-common in the patch passed.
+1 💚 asflicense 0m 53s The patch does not generate ASF License warnings.
207m 19s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3872/2/artifact/out/Dockerfile
GITHUB PR #3872
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 4f8d9f5b7c25 4.15.0-161-generic #169-Ubuntu SMP Fri Oct 15 13:41:54 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 8bfd3a1
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3872/2/testReport/
Max. process+thread count 3150 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3872/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@aajisaka
Copy link
Member

aajisaka commented Feb 8, 2022

HADOOP-18024 has been reverted. Closing this PR.

@aajisaka aajisaka closed this Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants