-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-46094] Support Executor JVM Profiling #44021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds support for the async profiler to Spark Profiling of JVM applications on a cluster is cumbersome and it can be complicated to save the output of the profiler especially if the cluster is on K8s where the executor pods are removed and any files saved to the local file system become inaccessible. This feature makes it simple to turn profiling on/off, includes the jar/binaries needed for profiling, and makes it simple to save output to an HDFS location. This PR introduces three new configuration parameters. These are described in the documentation.
|
how do you use it? would be great if it contains the example, how to run, etc. |
There's a README - connector/profiler/README.md. I can add more details if you think this is not enough. |
|
There's a whole slew of errors like the following while building documentation - How does one fix this? |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR, @parthchandra .
connector/profiler/README.md
Outdated
|
|
||
| To build | ||
| ``` | ||
| ./build/mvn clean package -P code-profiler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the trailing spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, please add -DskipTests in order to be more copy-and-paste friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Spark, it is more customary to write as -Pcode-profiler
assembly/pom.xml
Outdated
| </dependencies> | ||
| </profile> | ||
| <profile> | ||
| <id>code-profiler</id> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to avoid future confusions with Python Profiler, SPARK-40281 (spark.python.profile.memory).
- https://pypi.org/project/memory-profiler/
- https://docs.google.com/document/d/e/2PACX-1vR2K4TdrM1eAjNDC1bsflCNRH67UWLoC-lCv6TSUVXD91Ruksm99pYTnCeIm7Ui3RgrrRNcQU_D8-oh/pub
Shall we rename this from code-profiler to jvm-profiler (or java-profiler)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
connector/profiler/README.md
Outdated
|
|
||
| ## Executor Code Profiling | ||
|
|
||
| The spark profiler module enables code profiling of executors in cluster mode based on the the [async profiler](https://github.com/async-profiler/async-profiler/blob/master/README.md), a low overhead sampling profiler. This allows a Spark application to capture CPU and memory profiles for application running on a cluster which can later be analyzed for performance issues. The profiler captures [Java Flight Recorder (jfr)](https://developers.redhat.com/blog/2020/08/25/get-started-with-jdk-flight-recorder-in-openjdk-8u#) files for each executor; these can be read by many tools including Java Mission Control and Intellij. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be specific about what we are depending on.
- https://github.com/async-profiler/async-profiler/blob/master/README.md
+ https://github.com/async-profiler/async-profiler/blob/v2.10/README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Java 8 reference link, it looks inappropriate because Apache Spark 4.0.0 dropped all Java versions less than 16.
Do you think we can have Java 17+ link, @parthchandra ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced with a link that references jdk17.
|
|
||
| The profiler writes the jfr files to the executor's working directory in the executor's local file system and the files can grow to be large so it is advisable that the executor machines have adequate storage. The profiler can be configured to copy the jfr files to a hdfs location before the executor shuts down. | ||
|
|
||
| Code profiling is currently only supported for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question. Why not Windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because async-profiler requires specific POSIX signals capabilities which Windows implements differently. So async-profiler doesn't support windows. More here: async-profiler/async-profiler#188
| To get maximum profiling information set the following jvm options for the executor - | ||
|
|
||
| ``` | ||
| -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:+PreserveFramePointer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the trailing space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
connector/profiler/README.md
Outdated
| * Linux (musl, x64) | ||
| * MacOS | ||
|
|
||
| To get maximum profiling information set the following jvm options for the executor - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- -> :
connector/profiler/README.md
Outdated
| <td><code>spark.executor.profiling.enabled</code></td> | ||
| <td> | ||
| <code>false</code> | ||
| </td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's unify the style. Like 74 line, one-liner (<td><code>false</code></td>) is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
connector/profiler/README.md
Outdated
| </tr> | ||
| <tr> | ||
| <td><code>spark.executor.profiling.outputDir</code></td> | ||
| <td></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use (none).
Line 354 in ec71e22
| <td>(none)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
connector/profiler/README.md
Outdated
| <td><code>spark.executor.profiling.outputDir</code></td> | ||
| <td></td> | ||
| <td> | ||
| An hdfs compatible path to which the profiler's output files are copied. The output files will be written as <i>outputDir/application_id/profile-appname-exec-executor_id.jfr</i> <br/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hdfs -> HDFS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
connector/profiler/README.md
Outdated
| <td></td> | ||
| <td> | ||
| An hdfs compatible path to which the profiler's output files are copied. The output files will be written as <i>outputDir/application_id/profile-appname-exec-executor_id.jfr</i> <br/> | ||
| If no outputDir is specified then the files are not copied over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a warning about Out-Of-Disk situation because K8s is very strict about the disk usage unlike YARN or Standalone clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running out of space in the dfs will not affect the job. However, the jfr file may be corrupted. Added the warning.
Also added the warning for localDir where out of space in the local system may cause the job to fail on K8s.
connector/profiler/README.md
Outdated
| <td>4.0.0</td> | ||
| </tr> | ||
| <tr> | ||
| <td><code>spark.executor.profiling.outputDir</code></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is remote HDFS-compatible location, let's follow our convention like the following.
Line 442 in ec71e22
| <td><code>spark.driver.log.dfsDir</code></td> |
In short, please rename outputDir to dfsDir.
- spark.executor.profiling.outputDir
+ spark.executor.profiling.dfsDir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| </tr> | ||
| </table> | ||
|
|
||
| ### Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏🏾
| ### Kubernetes | ||
| On Kubernetes, spark will try to shut down the executor pods while the profiler files are still being saved. To prevent this set | ||
| ``` | ||
| spark.kubernetes.executor.deleteOnTermination=false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the trailing spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| /** | ||
| * A class that enables the async code profiler | ||
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Remove the empty line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * A class that enables the async code profiler | ||
| * | ||
| */ | ||
| private[spark] class ExecutorCodeProfiler(conf: SparkConf, executorId: String) extends Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark Executors can have Java, Puython, and R runtimes. Given that, Code is a little vague term.
- I'd like to propose to rename
ExecutorCodeProfilertoExecutorJVMProfiler. - Otherwise, at least, please document it clearly that this is JVM-only feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed
| private val resumecmd = s"resume,$profilerOptions,file=$profilerLocalDir/profile.jfr" | ||
|
|
||
| private val UPLOAD_SIZE = 8 * 1024 * 1024 // 8 MB | ||
| private val WRITE_INTERVAL = 30 // seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is a magic number instead of a configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt there were already too many configuration parameters and I found this to be a good value for real use cases.
Making this configurable.
| private var writing: Boolean = false | ||
|
|
||
| val profiler: AsyncProfiler = if (enableProfiler) { | ||
| if (AsyncProfilerLoader.isSupported) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give me a link for this method, please?
|
|
||
| /** | ||
| * Spark plugin to do code profiling of executors | ||
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Remove this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
|
|
||
| override def shutdown(): Unit = { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Remove this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Remove the above two lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| .checkValue(v => v >= 0.0 && v < 1.0, | ||
| "Fraction of executors to profile must be in [0,1)") | ||
| .createWithDefault(0.1) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't mind, shall we try to move this into a seperate package like Kafka module under connector.
spark/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/package.scala
Lines 25 to 34 in 7a0d041
| package object kafka010 { // scalastyle:ignore | |
| // ^^ scalastyle:ignore is for ignoring warnings about digits in package name | |
| type PartitionOffsetMap = Map[TopicPartition, Long] | |
| private[kafka010] val PRODUCER_CACHE_TIMEOUT = | |
| ConfigBuilder("spark.kafka.producer.cache.timeout") | |
| .doc("The expire time to remove the unused producers.") | |
| .version("2.2.1") | |
| .timeConf(TimeUnit.MILLISECONDS) | |
| .createWithDefaultString("10m") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For YARN and K8s, we have Config.scala.
spark/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
Line 26 in 7a0d041
| private[spark] object Config extends Logging { |
spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala
Line 27 in 7a0d041
| package object config extends Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. We don't need any yarn or kubernetes specific configuration
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a great addition, @parthchandra . Thank you so much! I finished my first review with a few comments. Other designs look good to me.
|
If async profiler does not allow us to map the native thread to its java thread (please validate this) - we cannot map stack traces to the corresponding task threads - and so limits usability of this integration in spark. Simply dumping per executor flamegraphs or stack traces has limited utility (and can be done today) unless we have a path to integrating with SPARK-45151 IMO. Thoughts @dongjoon-hyun as well ? |
I would suggest that this PR makes it trivially simple to profile with no setup required. On K8s, with ephemeral storage, it is not a simple task to dump a profile to disk and get it off the pod before the pod is destroyed (it was in fact the original motivation behind doing this). |
There is a difference between native thread id's and java thread ids. Assuming no, this means the stack traces generated are for all threads in the executor jvm - and so does not allow us to get stack traces and/or flamegraphs for a particular task, tasks of a stage, etc. If yes, this would be very useful - and will allow for future evolution as part of SPARK-44893 [1].
I am not seeing a lot of value in including this into Apache Spark itself - plugin api is public, and users can leverage it to do precisely what the PR is proposing. I am not -1 on this @dongjoon-hyun , but I am not seeing a lot of value in it: will let you make the call (also because I am on vacation, dont have my desktop handy to investigate in detail :) ). [1] This is the jira I was trying to paste, but github mobile messed it up - and ended up referencing a subtask ! |
Yes we can map the stack traces to the java thread. Here's how it looks (this is in intellij's profiler window)
We can get individual threads and even filter to profile a single thread. This PR specifically profiles every thread in the executor.
Ah, this JIRA makes it clearer. We can leverage the async-profiler to provide the features not yet implemented in SPARK-45209. The current implementation uses a simple snapshot of the task stack traces which can be enhanced by using the async-profiler to get accurate profiling.
I think we can certainly leverage this work. This PR by itself does not have the APIs needed to enhance SPARK-45209. It would probably need to be a separate PR because it may need changes to the UI implementation. We can either get a flamegraph (covering a period of time for a task) or collapsed call traces from which a flamegraph can be produced and the choice will affect the UI.
🙏🏾 |
|
That sounds promising ! Essentially what I am trying to make sure is - given When we built Safari, this is what ended up being extremely powerful for understanding application performance - per-task stack dumps, correlated across all tasks for a stage: allowing us to understand what the stack dump for a particular stage is, what the difference between 'expensive' tasks in a stage vs average task is, etc - and ignoring most of the non-task thread dumps in an executor is (unless explicitly required) |
|
That was pretty cool stuff you did in Safari! |
|
Sorry for being away, @mridulm and @parthchandra . I've been traveling in South Korea since 14th December. I'll catch up the discussion and will revisit this PR on January. Thank you! |
| running = true | ||
| startWriting() | ||
| } | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Let's merge line 71 into 70.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
| ) | ||
| } catch { | ||
| case e@(_: IllegalArgumentException | _: IllegalStateException | _: IOException) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Add proper spaces?
- case e@(
+ case e @ (There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| threadpool.scheduleWithFixedDelay(new Runnable() { | ||
| override def run(): Unit = writeChunk(false) | ||
| }, writeInterval, writeInterval, | ||
| TimeUnit.SECONDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation of the above four lines looks weird to me. Could you revise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Hopefully this is better now.
| } | ||
| } catch { | ||
| case e: IOException => logError("Exception occurred while writing some profiler output: ", e) | ||
| case e@(_: IllegalArgumentException | _: IllegalStateException) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. e@( -> e @ (
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| case e@(_: IllegalArgumentException | _: IllegalStateException) => | ||
| logError("Some profiler output not written." + | ||
| " Exception occurred in profiler native code: ", e) | ||
| case e: Exception => logError("Some profiler output not written. Unexpected exception: ", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering what we can get in this case because the flag writing is still true. Do we need to keep writing because we need to invoke finishWriting? However, it seems that we cannot invoke inputStream.close() and outputStream.close() eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get a failure, we can still keep writing because only the portion (chunk) of the data being written may be lost. The output file would still be valid.
finishWriting is eventually called to shutdown the thread and the streams cleanly.
We could potentially stop profiling if we get any of these exceptions, but I feel that it is perhaps too drastic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Since we have the error log, it might be okay for now.
| outputStream.close() | ||
| } catch { | ||
| case _: InterruptedException => Thread.currentThread().interrupt() | ||
| case e: IOException => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this come from when writeChunk swallows all exceptions with case e: Exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exceptions may come from threadpool.shutdown, awaitTermination which may throw InterruptedException or from {input,output}Stream.close which may get an IOException
|
Thank you all. Especially, @mridulm for many ideas and advices on this PR reviews.
I hope we can merge this as a part of Apache Spark 4.0.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. (Pending CIs)
|
Merged to master for Apache Spark 4.0.0. |
|
Thank you @dongjoon-hyun @mridulm |
…c-profiler ### What changes were proposed in this pull request? Introduce JVM profiling `JVMProfier` in Celeborn Worker using async-profiler to capture CPU and memory profiles. ### Why are the changes needed? [async-profiler](https://github.com/async-profiler) is a sampling profiler for any JDK based on the HotSpot JVM that does not suffer from Safepoint bias problem. It has low overhead and doesn’t rely on JVMTI. It avoids the safepoint bias problem by using the `AsyncGetCallTrace` API provided by HotSpot JVM to profile the Java code paths, and Linux’s perf_events to profile the native code paths. It features HotSpot-specific APIs to collect stack traces and to track memory allocations. The feature introduces a profier plugin that does not add any overhead unless enabled and can be configured to accept profiler arguments as a configuration parameter. It should support to turn profiling on/off, includes the jar/binaries needed for profiling. Backport [[SPARK-46094] Support Executor JVM Profiling](apache/spark#44021). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Worker cluster test. Closes #2409 from SteNicholas/CELEBORN-1299. Authored-by: SteNicholas <[email protected]> Signed-off-by: Shuang <[email protected]>
… `jvm-profiler` modules ### What changes were proposed in this pull request? This PR aims to fix `dev/scalastyle` to check `hadoop-cloud` and `jam-profiler` modules. Also, the detected scalastyle issues are fixed. ### Why are the changes needed? To prevent future scalastyle issues. Scala style violation was introduced here, but we missed because we didn't check all optional modules. - #46022 `jvm-profiler` module was added newly at Apache Spark 4.0.0 but we missed to add this to `dev/scalastyle`. Note that there was no scala style issues in that `module` at that time. - #44021 `hadoop-cloud` module was added at Apache Spark 2.3.0. - #17834 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with newly revised `dev/scalastyle`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46376 from dongjoon-hyun/SPARK-48127. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
… `jvm-profiler` modules ### What changes were proposed in this pull request? This PR aims to fix `dev/scalastyle` to check `hadoop-cloud` and `jam-profiler` modules. Also, the detected scalastyle issues are fixed. ### Why are the changes needed? To prevent future scalastyle issues. Scala style violation was introduced here, but we missed because we didn't check all optional modules. - apache#46022 `jvm-profiler` module was added newly at Apache Spark 4.0.0 but we missed to add this to `dev/scalastyle`. Note that there was no scala style issues in that `module` at that time. - apache#44021 `hadoop-cloud` module was added at Apache Spark 2.3.0. - apache#17834 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with newly revised `dev/scalastyle`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46376 from dongjoon-hyun/SPARK-48127. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…on DFS ### What changes were proposed in this pull request? This PR canonicalizes the JVM profiler added in SPARK-46094 profiling result files on DFS to ``` dfsDir/{{APP_ID}}/profile-exec-{{EXECUTOR_ID}}.jfr ``` which majorly follows the event logs file name pattern and layout. ### Why are the changes needed? According to #44021 (comment), we can integrate the profiling results with Spark UI (both live and history) in the future, so it's good to follow the event logs file name pattern and layout as much as possible. ### Does this PR introduce _any_ user-facing change? No, it's an unreleased feature. ### How was this patch tested? ``` $ bin/spark-submit run-example \ --master yarn \ --deploy-mode cluster \ --conf spark.plugins=org.apache.spark.executor.profiler.ExecutorProfilerPlugin \ --conf spark.executor.profiling.enabled=true \ --conf spark.executor.profiling.dfsDir=hdfs:///spark-profiling \ --conf spark.executor.profiling.fraction=1 \ SparkPi 100000 ``` ``` hadoopspark-dev1:~/spark$ hadoop fs -ls /spark-profiling/ Found 1 items drwxrwx--- - hadoop supergroup 0 2025-01-13 10:29 /spark-profiling/application_1736320707252_0023_1 ``` ``` hadoopspark-dev1:~/spark$ hadoop fs -ls /spark-profiling/application_1736320707252_0023_1 Found 48 items -rw-rw---- 3 hadoop supergroup 5255028 2025-01-13 10:29 /spark-profiling/application_1736320707252_0023_1/profile-exec-1.jfr -rw-rw---- 3 hadoop supergroup 3840775 2025-01-13 10:29 /spark-profiling/application_1736320707252_0023_1/profile-exec-10.jfr -rw-rw---- 3 hadoop supergroup 3889002 2025-01-13 10:29 /spark-profiling/application_1736320707252_0023_1/profile-exec-11.jfr -rw-rw---- 3 hadoop supergroup 3570697 2025-01-13 10:29 /spark-profiling/application_1736320707252_0023_1/profile-exec-12.jfr ... ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49440 from pan3793/SPARK-50783. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
| .createWithDefault(false) | ||
|
|
||
| private[profiler] val EXECUTOR_PROFILING_DFS_DIR = | ||
| ConfigBuilder("spark.executor.profiling.dfsDir") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun, I'm also investigating adding profiling support for the driver, should I fork all configurations, or simply treat the driver as a special executor and resue those configurations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, @pan3793 , could you use a new JIRA for your new goal, please?
Although I understand why you adds a comment here, it doesn't look like a good practice to me. Why do we add a comment about driver-related suggestion to the irrelevant executor-related PR like Spark Executor JVM Profiling?

What changes were proposed in this pull request?
This adds support for the async profiler to Spark
Why are the changes needed?
Profiling of JVM applications on a cluster is cumbersome and it can be complicated to save the output of the profiler especially if the cluster is on K8s where the executor pods are removed and any files saved to the local file system become inaccessible. This feature makes it simple to turn profiling on/off, includes the jar/binaries needed for profiling, and makes it simple to save output to an HDFS location.
Does this PR introduce any user-facing change?
This PR introduces three new configuration parameters. These are described in the documentation.
How was this patch tested?
Tested manually on EKS
Was this patch authored or co-authored using generative AI tooling?
No