[SPARK-46094] Support Executor JVM Profiling #44021

parthchandra · 2023-11-26T23:59:01Z

What changes were proposed in this pull request?

This adds support for the async profiler to Spark

Why are the changes needed?

Profiling of JVM applications on a cluster is cumbersome and it can be complicated to save the output of the profiler especially if the cluster is on K8s where the executor pods are removed and any files saved to the local file system become inaccessible. This feature makes it simple to turn profiling on/off, includes the jar/binaries needed for profiling, and makes it simple to save output to an HDFS location.

Does this PR introduce any user-facing change?

This PR introduces three new configuration parameters. These are described in the documentation.

How was this patch tested?

Tested manually on EKS

Was this patch authored or co-authored using generative AI tooling?

No

This adds support for the async profiler to Spark Profiling of JVM applications on a cluster is cumbersome and it can be complicated to save the output of the profiler especially if the cluster is on K8s where the executor pods are removed and any files saved to the local file system become inaccessible. This feature makes it simple to turn profiling on/off, includes the jar/binaries needed for profiling, and makes it simple to save output to an HDFS location. This PR introduces three new configuration parameters. These are described in the documentation.

HyukjinKwon · 2023-11-27T01:07:46Z

how do you use it? would be great if it contains the example, how to run, etc.

parthchandra · 2023-11-27T01:43:36Z

how do you use it? would be great if it contains the example, how to run, etc.

There's a README - connector/profiler/README.md. I can add more details if you think this is not enough.

parthchandra · 2023-11-27T17:26:39Z

There's a whole slew of errors like the following while building documentation -

[error] /__w/spark/spark/Loading source file /__w/spark/spark/common/sketch/src/main/java/org/apache/spark/util/sketch/BitArray.java...
[error] Loading source file /__w/spark/spark/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java...
[error] Loading source file /__w/spark/spark/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImpl.java...
[error] Loading source file /__w/spark/spark/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java...
[error] Loading source file /__w/spark/spark/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java...

How does one fix this?

dongjoon-hyun

Thank you for making a PR, @parthchandra .

dongjoon-hyun · 2023-11-28T16:26:53Z

connector/profiler/README.md

+
+To build 
+``` 
+  ./build/mvn clean package -P code-profiler


Please remove the trailing spaces.

In addition, please add -DskipTests in order to be more copy-and-paste friendly.

In Spark, it is more customary to write as -Pcode-profiler

dongjoon-hyun · 2023-11-28T16:31:08Z

assembly/pom.xml

      </dependencies>
    </profile>
+    <profile>
+      <id>code-profiler</id>


We need to avoid future confusions with Python Profiler, SPARK-40281 (spark.python.profile.memory).

https://pypi.org/project/memory-profiler/

https://docs.google.com/document/d/e/2PACX-1vR2K4TdrM1eAjNDC1bsflCNRH67UWLoC-lCv6TSUVXD91Ruksm99pYTnCeIm7Ui3RgrrRNcQU_D8-oh/pub

Shall we rename this from code-profiler to jvm-profiler (or java-profiler)?

dongjoon-hyun · 2023-11-28T16:35:02Z

connector/profiler/README.md

+
+## Executor Code Profiling
+
+The spark profiler module enables code profiling of executors in cluster mode based on the the [async profiler](https://github.com/async-profiler/async-profiler/blob/master/README.md), a low overhead sampling profiler. This allows a Spark application to capture CPU and memory profiles for application running on a cluster which can later be analyzed for performance issues. The profiler captures [Java Flight Recorder (jfr)](https://developers.redhat.com/blog/2020/08/25/get-started-with-jdk-flight-recorder-in-openjdk-8u#) files for each executor; these can be read by many tools including Java Mission Control and Intellij.


Let's be specific about what we are depending on.

- https://github.com/async-profiler/async-profiler/blob/master/README.md + https://github.com/async-profiler/async-profiler/blob/v2.10/README.md

For Java 8 reference link, it looks inappropriate because Apache Spark 4.0.0 dropped all Java versions less than 16.
Do you think we can have Java 17+ link, @parthchandra ?

Replaced with a link that references jdk17.

dongjoon-hyun · 2023-11-28T16:37:06Z

connector/profiler/README.md

+
+The profiler writes the jfr files to the executor's working directory in the executor's local file system and the files can grow to be large so it is advisable that the executor machines have adequate storage. The profiler can be configured to copy the jfr files to a hdfs location before the executor shuts down.
+
+Code profiling is currently only supported for


Just a question. Why not Windows?

Because async-profiler requires specific POSIX signals capabilities which Windows implements differently. So async-profiler doesn't support windows. More here: async-profiler/async-profiler#188

dongjoon-hyun · 2023-11-28T16:39:12Z

connector/profiler/README.md

+To get maximum profiling information set the following jvm options for the executor -
+
+```
+    -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:+PreserveFramePointer


Let's remove the trailing space.

dongjoon-hyun · 2023-11-28T16:39:26Z

connector/profiler/README.md

+*   Linux (musl, x64)
+*   MacOS
+
+To get maximum profiling information set the following jvm options for the executor -


dongjoon-hyun · 2023-11-28T16:41:28Z

connector/profiler/README.md

+  <td><code>spark.executor.profiling.enabled</code></td>
+  <td>
+    <code>false</code>
+  </td>


Let's unify the style. Like 74 line, one-liner (<td><code>false</code></td>) is better.

dongjoon-hyun · 2023-11-28T16:43:45Z

connector/profiler/README.md

+</tr>
+<tr>
+  <td><code>spark.executor.profiling.outputDir</code></td>
+  <td></td>


Please use (none).

spark/docs/configuration.md

Line 354 in ec71e22

<td>(none)</td>

dongjoon-hyun · 2023-11-28T16:44:01Z

connector/profiler/README.md

+  <td><code>spark.executor.profiling.outputDir</code></td>
+  <td></td>
+  <td>
+      An hdfs compatible path to which the profiler's output files are copied. The output files will be written as <i>outputDir/application_id/profile-appname-exec-executor_id.jfr</i> <br/>


hdfs -> HDFS

dongjoon-hyun · 2023-11-28T16:45:23Z

connector/profiler/README.md

+  <td></td>
+  <td>
+      An hdfs compatible path to which the profiler's output files are copied. The output files will be written as <i>outputDir/application_id/profile-appname-exec-executor_id.jfr</i> <br/>
+      If no outputDir is specified then the files are not copied over. 


Please add a warning about Out-Of-Disk situation because K8s is very strict about the disk usage unlike YARN or Standalone clusters.

Running out of space in the dfs will not affect the job. However, the jfr file may be corrupted. Added the warning.
Also added the warning for localDir where out of space in the local system may cause the job to fail on K8s.

dongjoon-hyun · 2023-11-28T16:49:11Z

connector/profiler/README.md

+  <td>4.0.0</td>
+</tr>
+<tr>
+  <td><code>spark.executor.profiling.outputDir</code></td>


Since this is remote HDFS-compatible location, let's follow our convention like the following.

spark/docs/configuration.md

Line 442 in ec71e22

<td><code>spark.driver.log.dfsDir</code></td>

In short, please rename outputDir to dfsDir.

- spark.executor.profiling.outputDir + spark.executor.profiling.dfsDir

dongjoon-hyun · 2023-11-28T16:50:17Z

connector/profiler/README.md

+</tr>
+</table>
+
+### Kubernetes


Thank you for adding this.

dongjoon-hyun · 2023-11-28T16:50:22Z

connector/profiler/README.md

+### Kubernetes
+On Kubernetes, spark will try to shut down the executor pods while the profiler files are still being saved. To prevent this set 
+```
+  spark.kubernetes.executor.deleteOnTermination=false


Remove the trailing spaces.

dongjoon-hyun · 2023-11-28T16:51:30Z

connector/profiler/src/main/scala/org/apache/spark/executor/ExecutorCodeProfiler.scala

+
+/**
+ * A class that enables the async code profiler
+ *


nit. Remove the empty line.

dongjoon-hyun · 2023-11-28T16:56:16Z

connector/profiler/src/main/scala/org/apache/spark/executor/ExecutorCodeProfiler.scala

+ * A class that enables the async code profiler
+ *
+ */
+private[spark] class ExecutorCodeProfiler(conf: SparkConf, executorId: String) extends Logging {


Spark Executors can have Java, Puython, and R runtimes. Given that, Code is a little vague term.

I'd like to propose to rename ExecutorCodeProfiler to ExecutorJVMProfiler.

Otherwise, at least, please document it clearly that this is JVM-only feature.

dongjoon-hyun · 2023-11-28T16:57:16Z

connector/profiler/src/main/scala/org/apache/spark/executor/ExecutorCodeProfiler.scala

+  private val resumecmd = s"resume,$profilerOptions,file=$profilerLocalDir/profile.jfr"
+
+  private val UPLOAD_SIZE = 8 * 1024 * 1024 // 8 MB
+  private val WRITE_INTERVAL = 30 // seconds


Why this is a magic number instead of a configuration?

I felt there were already too many configuration parameters and I found this to be a good value for real use cases.
Making this configurable.

dongjoon-hyun · 2023-11-28T16:58:38Z

connector/profiler/src/main/scala/org/apache/spark/executor/ExecutorCodeProfiler.scala

+  private var writing: Boolean = false
+
+  val profiler: AsyncProfiler = if (enableProfiler) {
+    if (AsyncProfilerLoader.isSupported) {


Could you give me a link for this method, please?

dongjoon-hyun · 2023-11-28T17:00:04Z

connector/profiler/src/main/scala/org/apache/spark/executor/ExecutorProfilerPlugin.scala

+
+/**
+ * Spark plugin to do code profiling of executors
+ *


nit. Remove this line.

dongjoon-hyun · 2023-11-28T17:00:20Z

connector/profiler/src/main/scala/org/apache/spark/executor/ExecutorProfilerPlugin.scala

+  }
+
+  override def shutdown(): Unit = {
+


nit. Remove this line.

dongjoon-hyun · 2023-11-28T17:00:30Z

connector/profiler/src/main/scala/org/apache/spark/executor/ExecutorProfilerPlugin.scala

+    }
+  }
+
+


nit. Remove the above two lines.

dongjoon-hyun · 2023-11-28T17:18:40Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+      .checkValue(v => v >= 0.0 && v < 1.0,
+        "Fraction of executors to profile must be in [0,1)")
+      .createWithDefault(0.1)
+


If you don't mind, shall we try to move this into a seperate package like Kafka module under connector.

spark/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/package.scala

Lines 25 to 34 in 7a0d041

package object kafka010 { // scalastyle:ignore

// ^^ scalastyle:ignore is for ignoring warnings about digits in package name

type PartitionOffsetMap = Map[TopicPartition, Long]

private[kafka010] val PRODUCER_CACHE_TIMEOUT =

ConfigBuilder("spark.kafka.producer.cache.timeout")

.doc("The expire time to remove the unused producers.")

.version("2.2.1")

.timeConf(TimeUnit.MILLISECONDS)

.createWithDefaultString("10m")

For YARN and K8s, we have Config.scala.

spark/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala

Line 26 in 7a0d041

private[spark] object Config extends Logging {

spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala

Line 27 in 7a0d041

package object config extends Logging {

Done. We don't need any yarn or kubernetes specific configuration

dongjoon-hyun

It's a great addition, @parthchandra . Thank you so much! I finished my first review with a few comments. Other designs look good to me.

mridulm · 2023-12-20T00:33:36Z

If async profiler does not allow us to map the native thread to its java thread (please validate this) - we cannot map stack traces to the corresponding task threads - and so limits usability of this integration in spark.
If this limitation exists, we should explore alternatives which might support it and is easy to integrate (honest profiler supports it but is not easy to integrate iirc).

Simply dumping per executor flamegraphs or stack traces has limited utility (and can be done today) unless we have a path to integrating with SPARK-45151 IMO.

Thoughts @dongjoon-hyun as well ?

parthchandra · 2023-12-20T02:45:08Z

If async profiler does not allow us to map the native thread to its java thread (please validate this) - we cannot map stack traces to the corresponding task threads - and so limits usability of this integration in spark.

AsyncGetCallTrace is used precisely to map calls in the native thread to calls in the java thread.
Not sure exactly what you are looking for here. Are you looking to profile individual tasks? It certainly can be done, but would require some changes similar to SPARK-45151 and some additional work if you want the profile available thru the UI. Or are you looking to enhance SPARK-45151 and get a stack trace that includes native calls? This is a little harder via async_profiler since there is no API to get a snapshot.
Note that getting a profile needs to be collected over a period of time and so is different from getting a snapshot as SPARK-45151 is doing.

Simply dumping per executor flamegraphs or stack traces has limited utility (and can be done today).

I would suggest that this PR makes it trivially simple to profile with no setup required. On K8s, with ephemeral storage, it is not a simple task to dump a profile to disk and get it off the pod before the pod is destroyed (it was in fact the original motivation behind doing this).

mridulm · 2023-12-20T05:23:21Z

AsyncGetCallTrace is used precisely to map calls in the native thread to calls in the java thread. Not sure exactly what you are looking for here. Are you looking to profile individual tasks? It certainly can be done, but would require some changes similar to SPARK-45151 and some additional work if you want the profile available thru the UI. Or are you looking to enhance SPARK-45151 and get a stack trace that includes native calls? This is a little harder via async_profiler since there is no API to get a snapshot. Note that getting a profile needs to be collected over a period of time and so is different from getting a snapshot as SPARK-45151 is doing.

There is a difference between native thread id's and java thread ids.
Given the async profiler output, can we map it to the corresponding task (given task's java thread id) ?
My understanding is currently no - but if I am missing something, do let me know.

Assuming no, this means the stack traces generated are for all threads in the executor jvm - and so does not allow us to get stack traces and/or flamegraphs for a particular task, tasks of a stage, etc.

If yes, this would be very useful - and will allow for future evolution as part of SPARK-44893 [1].

Simply dumping per executor flamegraphs or stack traces has limited utility (and can be done today).

I would suggest that this PR makes it trivially simple to profile with no setup required. On K8s, with ephemeral storage, it is not a simple task to dump a profile to disk and get it off the pod before the pod is destroyed (it was in fact the original motivation behind doing this).

I am not seeing a lot of value in including this into Apache Spark itself - plugin api is public, and users can leverage it to do precisely what the PR is proposing.
On other hand, if the PR is integrating well with SPARK-44893 [1] - and/or there is a path to leveraging it in that work, it would be more useful.

I am not -1 on this @dongjoon-hyun , but I am not seeing a lot of value in it: will let you make the call (also because I am on vacation, dont have my desktop handy to investigate in detail :) ).

[1] This is the jira I was trying to paste, but github mobile messed it up - and ended up referencing a subtask !

parthchandra · 2023-12-20T18:53:09Z

There is a difference between native thread id's and java thread ids. Given the async profiler output, can we map it to the corresponding task (given task's java thread id) ? My understanding is currently no - but if I am missing something, do let me know.

Yes we can map the stack traces to the java thread. Here's how it looks (this is in intellij's profiler window)

Assuming no, this means the stack traces generated are for all threads in the executor jvm - and so does not allow us to get stack traces and/or flamegraphs for a particular task, tasks of a stage, etc.

We can get individual threads and even filter to profile a single thread. This PR specifically profiles every thread in the executor.

If yes, this would be very useful - and will allow for future evolution as part of SPARK-44893 [1].

Ah, this JIRA makes it clearer. We can leverage the async-profiler to provide the features not yet implemented in SPARK-45209. The current implementation uses a simple snapshot of the task stack traces which can be enhanced by using the async-profiler to get accurate profiling.

I am not seeing a lot of value in including this into Apache Spark itself - plugin api is public, and users can leverage it to do precisely what the PR is proposing. On other hand, if the PR is integrating well with SPARK-44893 [1] - and/or there is a path to leveraging it in that work, it would be more useful.

I think we can certainly leverage this work. This PR by itself does not have the APIs needed to enhance SPARK-45209. It would probably need to be a separate PR because it may need changes to the UI implementation. We can either get a flamegraph (covering a period of time for a task) or collapsed call traces from which a flamegraph can be produced and the choice will affect the UI.

I am not -1 on this @dongjoon-hyun , but I am not seeing a lot of value in it: will let you make the call (also because I am on vacation, dont have my desktop handy to investigate in detail :) ).

🙏🏾

mridulm · 2023-12-21T11:14:40Z

That sounds promising !
What is unclear to me is how we are going to do the mapping without something which ends up introducing safe point bias (essentially, cost of this operation) ...
For example, if the native to java thread mapping requires mxbean.getThreadInfo and/or similar approaches, it becomes fairly expensive.

Essentially what I am trying to make sure is - given (native-thread-id -> timestamp -> stack_dumps+)*, can we identify the native-thread -> java-thread-id ?
If yes, we can build the java-thread-id -> task-id in spark, and essentially get to (task-id -> stack_dumps+)* for all (most ?) tasks.

When we built Safari, this is what ended up being extremely powerful for understanding application performance - per-task stack dumps, correlated across all tasks for a stage: allowing us to understand what the stack dump for a particular stage is, what the difference between 'expensive' tasks in a stage vs average task is, etc - and ignoring most of the non-task thread dumps in an executor is (unless explicitly required)
At that time atleast, async-profiler did not provide a way to 'cheaply' do this - and so I ended up enhancing honest-profiler to support it (unfortunately, honest-profiler does not publish to maven, so using it is currently not a viable option).

parthchandra · 2023-12-21T19:41:51Z

That was pretty cool stuff you did in Safari!
I think we may not have to do too much work ourselves.
The way I see it, async_profiler is doing the mapping of the java threads and stack traces already for us (and we know that both async_profiler and honest_profiler avoid the safepoint bias problem so this is as good as it gets). In addition there is a filter API to filter on one or more threads so async_profiler collects events only for the given thread(s). The API for filtering takes a java.lang.Thread as input.
The way I see it potentially working is: when a user asks to profile a task, we start profiling for only the task's thread similar to the way a task stack trace is being done today. Then we ship over the the collected data and display it.
I'll have to play around with this though. There might be some gotchas in profiling multiple threads simultaneously and/or some APIs might be private

dongjoon-hyun · 2023-12-23T23:20:58Z

Sorry for being away, @mridulm and @parthchandra . I've been traveling in South Korea since 14th December. I'll catch up the discussion and will revisit this PR on January. Thank you!

dongjoon-hyun · 2024-01-12T10:13:19Z