Skip to content

Conversation

@ZanderXu
Copy link
Contributor

Jira: HDFS-16933

We encountered a problem that NameNode randomly has wrong owner ship after loading the same fsimage if we enable parallel fsimage loading.

After tracing and found that maybe there is a race in SerialNumberMap.

public int get(T t) {
  if (t == null) {
    return 0;
  }
  Integer sn = t2i.get(t);
  if (sn == null) {
    // Assume there are two thread with different t, such as:
    // T1 with hbase
    // T2 with hdfs
    // If T1 and T2 get the sn in the same time, they will get the same sn, such as 10
    sn = current.getAndIncrement();
    if (sn > max) {
      current.getAndDecrement();
      throw new IllegalStateException(name + ": serial number map is full");
    }
    Integer old = t2i.putIfAbsent(t, sn);
    if (old != null) {
      current.getAndDecrement();
      return old;
    }
    // If T1 puts the 10->hbase to the i2t first, T2 will use 10 -> hdfs to overwrite it. So it will cause that the Inodes will get a wrong owner hdfs, actual it should be hbase.
    i2t.put(sn, t);
  }
  return sn;
} 

There are two mappings in SerialNumberMap, t2i and i2t. They should be safely updated together.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 14s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 41m 56s trunk passed
+1 💚 compile 1m 28s trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 compile 1m 22s trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 checkstyle 1m 9s trunk passed
+1 💚 mvnsite 1m 31s trunk passed
+1 💚 javadoc 1m 8s trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javadoc 1m 30s trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 spotbugs 3m 37s trunk passed
+1 💚 shadedclient 25m 48s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 22s the patch passed
+1 💚 compile 1m 24s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javac 1m 24s the patch passed
+1 💚 compile 1m 17s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 javac 1m 17s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 54s the patch passed
+1 💚 mvnsite 1m 23s the patch passed
+1 💚 javadoc 0m 53s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javadoc 1m 22s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 spotbugs 3m 30s the patch passed
+1 💚 shadedclient 25m 48s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 256m 37s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 44s The patch does not generate ASF License warnings.
373m 22s
Reason Tests
Failed junit tests hadoop.hdfs.server.namenode.TestFsck
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5430/1/artifact/out/Dockerfile
GITHUB PR #5430
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 9d68b90b171f 4.15.0-197-generic #208-Ubuntu SMP Tue Nov 1 17:23:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / c81af4c
Default Java Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5430/1/testReport/
Max. process+thread count 2185 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5430/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@Hexiaoqiao
Copy link
Contributor

@ZanderXu Thanks for your works. Great catch here! Do you take benchmark if any possible performance decrease here when parallel fsimage loading?

@Hexiaoqiao Hexiaoqiao self-requested a review September 3, 2023 15:26
@ZanderXu
Copy link
Contributor Author

ZanderXu commented Sep 4, 2023

@ZanderXu Thanks for your works. Great catch here! Do you take benchmark if any possible performance decrease here when parallel fsimage loading?

Almost no performance lost, because this sync is just used for new users.

[32 threads get names 1000000 times from 10000 different names], 311(ms) with sync, 268(ms) without sync. 43(ms)
[32 threads get names 1000000 times from 10000 different names], 270(ms) with sync, 269(ms) without sync. 1(ms)
[32 threads get names 1000000 times from 10000 different names], 299(ms) with sync, 268(ms) without sync. 31(ms)
[32 threads get names 1000000 times from 10000 different names], 291(ms) with sync, 268(ms) without sync. 23(ms)
[32 threads get names 1000000 times from 10000 different names], 319(ms) with sync, 270(ms) without sync. 49(ms)

@Hexiaoqiao
Copy link
Contributor

Great. LGTM. +1 from my side. Will check in if anymore comments for two work days here.

@Hexiaoqiao Hexiaoqiao merged commit 7c941e0 into apache:trunk Sep 6, 2023
@Hexiaoqiao
Copy link
Contributor

Committed to trunk. Thanks @ZanderXu for your contribution!

jiajunmao pushed a commit to jiajunmao/hadoop-MLEC that referenced this pull request Feb 6, 2024
…nd XATTR. (apache#5430). Contributed by ZanderXu.

Signed-off-by: He Xiaoqiao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants