-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDFS-14304: High lock contention on hdfsHashMutex in libhdfs #595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an idea that may simplify the diff and the code a bit:
Instead of defining a jclass and a const char* for each class, you did something like this in the header:
enum CachedJavaClass {
JC_CONFIGURATION,
JC_PATH,
...
JC_NUM_CACHED_CLASSES
};
extern jclass g_cached_classes[JC_NUM_CACHED_CLASSES];
and then in jclasses.c:
jclass g_cached_classes[JC_NUM_CACHED_CLASSES];
typedef struct idx_with_class {
int idx;
const char* clazz;
};
static idx_with_class IDS_WITH_CLASSES[] = {
{JC_CONFIGURATION, "org/apache/hadoop/conf/Configuration"},
{JC_PATH, "org/apache/hadoop/Path"},
{-1, NULL}
};
static void init_cached_class(int index, const char* class) {
... find java class and init g_cached_classes[index];
}
void init_cached_classes(JNIEnv* env) {
for (int i = 0; IDS_WITH_CLASSES[i].idx >= 0; i++) {
int idx = IDS_WITH_CLASSES[i].idx;
if (!init_cached_class(IDS_WITH_CLASSES[i].clazz, &g_cached_classes[idx])) {
FATAL("failed to init class %s', ...);
}
}
for (int i = 0; i < JC_NUM_CACHED_CLASSES; i++) {
if (!g_cached_classes[i]) {
// this would be a bug if we missed one in the IDS_WITH_CLASSES list, but cheap enough to just
// validate at runtime anyway
FATAL("missing initialization for class index %d", i);
}
}
}
Then in the various other call sites you can just pass the class 'ID' and look it up from this array? Might simplify the code a bit and make it a bit more "declarative" vs the repeated copy-pasting in the init sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I ended up implementing something similar to what you suggested, although some of the data structure layout is a bit different.
All of the logic is encapsulated in jclasses.h / jclasses.c. Essentially, there is a single array cachedJavaClasses that contains NUM_CACHED_CLASSES javaClassAndName structs (jclass and const char *className). jclasses.h exposes two methods: getJclass that returns a *jclass from a CachedJavaClass and getClassName that returns the class name of the CachedJavaClass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace:tabs in line
|
💔 -1 overall
This message was automatically generated. |
|
@toddlipcon addressed your comments. Perhaps I should not be force pushing my changes, because you comments seem to have disappeared from the PR landing page, but you can still seem them (and my responses) here: 05a31ca |
|
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of exposing this externally from jclasses.h I think it'd be better to just unconditionally call initCachedClasses() and make it internally decide whether to no-op it. That's a cold code path (only once per thread) so avoiding the non-inlined function call isn't important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved jclassesInitialized to jclasses.c and changed it to static instead of extern. initCachedClasses acquires a lock, checks if jclassesInitialized is true or not, and conditionally loads the jclasses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per above, I think we should document that this is idempotent and threadsafe (safe to call multiple times)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this double-checked locking idiom isn't safe unless you insert a memory barrier before the store to jclassesInitialized. (it has to be a store with "release" semantics)
Otherwise, the compiler (and/or CPU) is free to decide to reorder the store to 'jclassesInitialized = 1' up earlier within the critical section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that said, this isn't a hot path so I don't think it's worth trying to be fancy, and I think we can just acquire the mutex unconditionally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Removed the first if statement, so initCachedClasses always acquires the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a corresponding unlock for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this return (and the one above) miss some cleanup that should happen in the 'goto fail' label? (ugh I hate programming in C...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, fixed. I think the one above is okay; the cleanup is dependent on ThreadLocalState being created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you sort these includes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with this 'goto done', you end up calling va_end without the va_start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, missing a space after the comma here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing the mutex unlock pair here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not your fault, but seems odd that 'ret' is assigned here but not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again not your bug, but jthr isn't checked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you use a typedef here to avoid having to use 'struct javaClassAndName' everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this needs a rebase since I committed PR #600
* Fixed double-locking issue * Internalize jclass initialization inside initCachedClasses
|
Rebased on trunk. Only conflict was: This patch changed |
|
💔 -1 overall
This message was automatically generated. |
|
Fixing compilation issues. |
|
💔 -1 overall
This message was automatically generated. |
This closes apache#595 Signed-off-by: Todd Lipcon <[email protected]> (cherry picked from commit 18c57cf) (cherry picked from commit 72d425c) Conflicts: hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs-tests/native_mini_dfs.c hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/exception.c hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/jni_helper.h Change-Id: I9d6fb410cae3311e384df2008a3f2a6645355574
Currently only the ClusterBasedJobCoordinator and ZkJobCoordinator are creating changelog streams. The Passthrough one should also do it. Author: xinyuiscool <[email protected]> Reviewers: Bharath K <[email protected]> Closes apache#595 from xinyuiscool/SAMZA-1796
This closes apache#595 Signed-off-by: Todd Lipcon <[email protected]>
This closes apache#595 Signed-off-by: Todd Lipcon <[email protected]>
This closes apache#595 Signed-off-by: Todd Lipcon <[email protected]>
This closes apache#595 Signed-off-by: Todd Lipcon <[email protected]>
This closes apache#595 Signed-off-by: Todd Lipcon <[email protected]>
This closes apache#595 Signed-off-by: Todd Lipcon <[email protected]>
This closes apache#595 Signed-off-by: Todd Lipcon <[email protected]>
This closes apache#595 Signed-off-by: Todd Lipcon <[email protected]>
This closes apache#595 Signed-off-by: Todd Lipcon <[email protected]>
This closes apache#595 Signed-off-by: Todd Lipcon <[email protected]> (cherry picked from commit 18c57cf) Conflicts: hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs-tests/native_mini_dfs.c hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/exception.c hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/jni_helper.h Change-Id: I9d6fb410cae3311e384df2008a3f2a6645355574
HDFS-14304 describes the issue. The TL;DR is that libhdfs caches JNI
jclassobjects (which is a recommended practice, see https://www.ibm.com/developerworks/library/j-jni/index.html#notc). However, the issue is that the current implementation caches all thejclassobjects in a global hash table. Since libhdfs is designed to be used by multiple threads concurrently, the hash table is protected by a single mutex to avoid concurrency issues. The issue is that for short lived Java method, acquiring the lock contributes to significant performance overhead.I first considered using a RW Lock (
pthread_rwlock_t) to fix this issue, but after some prototyping I found that it didn't fix the actual issue. Acquiring a read lock still requires going through lock acquisition, so the performance issue is still present.Instead, after some digging, I landed on an approach similar to https://stackoverflow.com/questions/10617735/in-jni-how-do-i-cache-the-class-methodid-and-fieldids-per-ibms-performance-r
Instead of using a hash table to lazily-load and store all the
jclassobjects. This patch allocates all necessaryjclassobjects upon JVM load (the classes are loaded ingetGlobalJNIEnv). This removes the need for the hash table and the associated locking completely.One could argue that lazily-loading the
jclasss (which is what the current code does) is preferable to eagerly-loading thejclasss (which is what this patch does). However, after doing some analysis I don't think that holds. Consider the following:jclasss that need to be used throughout libhdfsjclasss should not significantly contribute to startup costs)Another concern is the space required to store all the
jclassobjects compared to storing them in thehtable.Consider that the current
htableused byjni_helperis initially allocated with 4096htable_pairs where eachhtable_pairconsists of two pointers. So the total size of thehtableshould be about 64 kilobytes.On the other hand, a single
jclassobject should conservatively take up ~248 bytes:Classobject takes up to 104 bytes (it depends on various system settings / JVM settings)Classobject has to store the name of theClassitself, which is of course variable, but is conservatively estimated to be 100 bytesjclassobject to reference the JavaClassobject (I'm assuming there are some additional pointers somewhere, but I don't know enough about the JNI implementation to know what they are).With 19
jclassobjects that is about 4.5 kilobytes; significantly less that the size of an emptyhtable. So overall, this patch should actually decrease the memory requirement for libhdfs.Some more details about the actual implementation:
jclasses.hcontains all thejclassobjects that are used throughout libhdfsinvokeMethodinjni_helperto take in an existingjclassand deleted all the hash table lookup codefindClassAndInvokeMethodis equivalent to the old implementation ofinvokeMethod- it is currently just used in test codehtableand all associated code: test classes, mutexes, etc.While this patch looks long, most of the changes are refactoring, since every call to
invokeMethodhad to be changed. The main files to look at are:jclasses.h,jclasses.c,jni_helper.h, andjni_helper.c.