Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions core/src/main/scala/org/apache/spark/internal/config/package.scala
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

package org.apache.spark.internal

import java.io.File
import java.util.Locale
import java.util.concurrent.TimeUnit

Expand Down Expand Up @@ -64,8 +65,16 @@ package object config {
.stringConf
.createOptional

private[spark] val DRIVER_DEFAULT_EXTRA_CLASS_PATH =
ConfigBuilder(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH)
.internal()
.version("4.0.0")
.stringConf
.createWithDefault(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE)

private[spark] val DRIVER_CLASS_PATH =
ConfigBuilder(SparkLauncher.DRIVER_EXTRA_CLASSPATH)
.withPrepended(DRIVER_DEFAULT_EXTRA_CLASS_PATH.key, File.pathSeparator)
.version("1.0.0")
.stringConf
.createOptional
Expand Down Expand Up @@ -254,8 +263,16 @@ package object config {
private[spark] val EXECUTOR_ID =
ConfigBuilder("spark.executor.id").version("1.2.0").stringConf.createOptional

private[spark] val EXECUTOR_DEFAULT_EXTRA_CLASS_PATH =
ConfigBuilder(SparkLauncher.EXECUTOR_DEFAULT_EXTRA_CLASS_PATH)
.internal()
.version("4.0.0")
.stringConf
.createWithDefault(SparkLauncher.EXECUTOR_DEFAULT_EXTRA_CLASS_PATH_VALUE)

private[spark] val EXECUTOR_CLASS_PATH =
ConfigBuilder(SparkLauncher.EXECUTOR_EXTRA_CLASSPATH)
.withPrepended(EXECUTOR_DEFAULT_EXTRA_CLASS_PATH.key, File.pathSeparator)
.version("1.0.0")
.stringConf
.createOptional
Expand Down
6 changes: 6 additions & 0 deletions dev/make-distribution.sh
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,12 @@ echo "Build flags: $@" >> "$DISTDIR/RELEASE"
# Copy jars
cp "$SPARK_HOME"/assembly/target/scala*/jars/* "$DISTDIR/jars/"

# Only create the hive-jackson directory if they exist.
for f in "$DISTDIR"/jars/jackson-*-asl-*.jar; do
mkdir -p "$DISTDIR"/hive-jackson
mv $f "$DISTDIR"/hive-jackson/
done
Comment on lines +192 to +196
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, what's benefit to have separate class path for hive-jackson jars?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 5 main benefits like yarn directory, @viirya .

  1. The following Apache Spark deamons (with uses bin/spark-daemon.sh start) will ignore hive-jackson directory.
    • Spark Master
    • Spark Worker
    • Spark History Server
$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$@"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh:  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \
  1. Recoverability: The AS-IS Spark 3 users can achieve the same goal if they delete those two files from Spark's jar directory manually. However, it's difficult to recover the deleted files when a production job fails due to Hive UDF. This PR provides more robust and safe way with a configuration.

  2. Communication: We (and the security team) can easily communicate that hive-jackson is not used like yarn directory because it's physically split from the distribution. Also, they can delete the directory easily (if they need) without knowing the details of dependency lists inside that directory.

  3. Robustness: If Apache Spark have everything in jars, it's difficult to prevent them from loading. Of course, we may choose a tricky way to filter out from class file lists via naming pattern. It's still less robust in a long term perspective.

  4. Compatibility with hive-jackson-provided: With the existing hive-jackson-provided, this PR provides a cleaner injection point for the provided dependencies. For example, the custom build Jackson dependencies can be placed in hive-jackson (after they create this) instead of jars. We are very reluctant if someone put their custom jar files into Apache Spark's jars directory directly. hive-jackson directory could be more recommendable way than copying into Spark's jars directory.


# Only create the yarn directory if the yarn artifacts were built.
if [ -f "$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar ]; then
mkdir "$DISTDIR/yarn"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,8 @@ Map<String, String> getEffectiveConfig() throws IOException {
Properties p = loadPropertiesFile();
p.stringPropertyNames().forEach(key ->
effectiveConfig.computeIfAbsent(key, p::getProperty));
effectiveConfig.putIfAbsent(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH,
SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE);
Comment on lines +274 to +275
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non SparkLauncher triggered (e.g., the Spark deamons you mentioned) classes are not affected, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, same question, why only driver default extra class is specified here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the following spark-daemon.sh start are not affected, @viirya .

$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 "$@"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh:  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS $WORKER_NUM \

This is a command on the driver side. The executors are supposed to use spark.executor.extraClassPath and spark.executor.defaultExtraClassPath.

And, same question, why only driver default extra class is specified here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why we need to specially deal with SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH here? I think Spark configs should be dealt together above:

p.stringPropertyNames().forEach(key ->
         effectiveConfig.computeIfAbsent(key, p::getProperty));

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required because, when it's loaded from the file, p.stringPropertyNames() doesn't have SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH. So, .forEach is not invoked at all.

Properties p = loadPropertiesFile();
      p.stringPropertyNames().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here.

When getEffectiveConfig is used like this in SparkSubmitCommandBuilder.java, config is a Java Map.
So, String defaultExtraClassPath = config.get(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH); become null if we don't have this.

    Map<String, String> config = getEffectiveConfig();
    boolean isClientMode = isClientMode(config);
    String extraClassPath = isClientMode ? config.get(SparkLauncher.DRIVER_EXTRA_CLASSPATH) : null;
    String defaultExtraClassPath = config.get(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH);
    if (extraClassPath == null || extraClassPath.trim().isEmpty()) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Although this is only used by SparkSubmitCommandBuilder but not SparkClassCommandBuilder so I wonder if we should directly put it in SparkSubmitCommandBuilder.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getEffectiveConfig is shared by other classes not only by SparkSubmitCommandBuilder. That's the decision why I fixed it in getEffectiveConfig .

$ git grep getEffectiveConfig
launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java:  Map<String, String> getEffectiveConfig() throws IOException {
launcher/src/main/java/org/apache/spark/launcher/InProcessLauncher.java:    if (builder.isClientMode(builder.getEffectiveConfig())) {
launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java:    return builder.getEffectiveConfig().get(CHILD_PROCESS_LOGGER_NAME);
launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java:    Map<String, String> config = getEffectiveConfig();
launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java:      getEffectiveConfig().get(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but I saw you only changed SparkSubmitCommandBuilder to explicitly pick it up? If SparkLauncher.DRIVER_EXTRA_CLASSPATH is not set at all, these other classes won't use SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me double-check with them InProcessLauncher and SparkLauncher once more. Last time, I checked it was okay, but your concern rings a bell to me.

}
return effectiveConfig;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,10 @@ public class SparkLauncher extends AbstractLauncher<SparkLauncher> {

/** Configuration key for the driver memory. */
public static final String DRIVER_MEMORY = "spark.driver.memory";
/** Configuration key for the driver default extra class path. */
public static final String DRIVER_DEFAULT_EXTRA_CLASS_PATH =
"spark.driver.defaultExtraClassPath";
public static final String DRIVER_DEFAULT_EXTRA_CLASS_PATH_VALUE = "hive-jackson/*";
Comment on lines +57 to +60
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So even users who use no hive distributions. they will also have this default class patch value in the extra class path, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the configuration is simply pointing non-existing directly. As you can see in make-distribution.sh and PR description. hive-jackson directory is created only when there exist jackson-*-asl-*.jars.

/** Configuration key for the driver class path. */
public static final String DRIVER_EXTRA_CLASSPATH = "spark.driver.extraClassPath";
/** Configuration key for the default driver VM options. */
Expand All @@ -65,6 +69,10 @@ public class SparkLauncher extends AbstractLauncher<SparkLauncher> {

/** Configuration key for the executor memory. */
public static final String EXECUTOR_MEMORY = "spark.executor.memory";
/** Configuration key for the executor default extra class path. */
public static final String EXECUTOR_DEFAULT_EXTRA_CLASS_PATH =
"spark.executor.defaultExtraClassPath";
public static final String EXECUTOR_DEFAULT_EXTRA_CLASS_PATH_VALUE = "hive-jackson/*";
/** Configuration key for the executor class path. */
public static final String EXECUTOR_EXTRA_CLASSPATH = "spark.executor.extraClassPath";
/** Configuration key for the default executor VM options. */
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,12 @@ private List<String> buildSparkSubmitCommand(Map<String, String> env)
Map<String, String> config = getEffectiveConfig();
boolean isClientMode = isClientMode(config);
String extraClassPath = isClientMode ? config.get(SparkLauncher.DRIVER_EXTRA_CLASSPATH) : null;
String defaultExtraClassPath = config.get(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH);
if (extraClassPath == null || extraClassPath.trim().isEmpty()) {
extraClassPath = defaultExtraClassPath;
} else {
extraClassPath += File.pathSeparator + defaultExtraClassPath;
}

List<String> cmd = buildJavaCommand(extraClassPath);
// Take Thrift/Connect Server as daemon
Expand Down Expand Up @@ -498,6 +504,8 @@ protected boolean handle(String opt, String value) {
case DRIVER_MEMORY -> conf.put(SparkLauncher.DRIVER_MEMORY, value);
case DRIVER_JAVA_OPTIONS -> conf.put(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS, value);
case DRIVER_LIBRARY_PATH -> conf.put(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH, value);
case DRIVER_DEFAULT_CLASS_PATH ->
conf.put(SparkLauncher.DRIVER_DEFAULT_EXTRA_CLASS_PATH, value);
Comment on lines +507 to +508
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why spark-submit can only specify driver default extra class path, cannot it specify executor default extra class path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No~ This is --driver-class-path which used as a special case because the driver JVM is started already.

We already have more general ones are spark.executor.extraClassPath and spark.driver.extraClassPath for both driver and executor. This PR extends the above with the following.

spark.driver.defaultExtraClassPath
spark.executor.defaultExtraClassPath

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, is your concern is about that the following is insufficient?

  private[spark] val EXECUTOR_CLASS_PATH =
    ConfigBuilder(SparkLauncher.EXECUTOR_EXTRA_CLASSPATH)
      .withPrepended(EXECUTOR_DEFAULT_EXTRA_CLASS_PATH.key, File.pathSeparator)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me double-check it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @viirya ,

  • In general, bin/spark-submit ... -c spark.driver.defaultExtraClassPath="" -c spark.executor.defaultExtraClassPath="" is a way to launch without hive-jackson dependency because hive-jackson/* are provided by those configuration. This works for the drivers of cluster mode submission and executors of both client and cluster submission modes always.
  • Apache Spark already provides --driver-class-path is used only for the cases where we cannot use spark.driver.extraClassPath. In this case, spark.driver.defaultExtraClassPath is also not applicable. So, this PR adds --driver-default-class-path "" in the same way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is identical with the way of the existing Apache Spark's spark.*.extraClassPath way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, -c spark.driver.defaultExtraClassPath="" should be same as --driver-default-class-path "", so I wonder why we need it especially? And also what's the case we cannot specify it through -c spark.driver.defaultExtraClassPath?

It's probably okay as we already have --driver-class-path so this addition follows it. Just a question if you know the exact reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--driver-class-path is required because Spark's Driver JVMs are started before loading Spark's property files or config file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot create SparkConf class if we cannot load Spark's jar files. This is the chicken and egg situation. To solve this issue, we blindly load jars/*. So, if we have other jars, we need to use --driver-class-path

case DRIVER_CLASS_PATH -> conf.put(SparkLauncher.DRIVER_EXTRA_CLASSPATH, value);
case CONF -> {
checkArgument(value != null, "Missing argument to %s", CONF);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ class SparkSubmitOptionParser {
protected final String CONF = "--conf";
protected final String DEPLOY_MODE = "--deploy-mode";
protected final String DRIVER_CLASS_PATH = "--driver-class-path";
protected final String DRIVER_DEFAULT_CLASS_PATH = "--driver-default-class-path";
protected final String DRIVER_CORES = "--driver-cores";
protected final String DRIVER_JAVA_OPTIONS = "--driver-java-options";
protected final String DRIVER_LIBRARY_PATH = "--driver-library-path";
Expand Down Expand Up @@ -94,6 +95,7 @@ class SparkSubmitOptionParser {
{ DEPLOY_MODE },
{ DRIVER_CLASS_PATH },
{ DRIVER_CORES },
{ DRIVER_DEFAULT_CLASS_PATH },
{ DRIVER_JAVA_OPTIONS },
{ DRIVER_LIBRARY_PATH },
{ DRIVER_MEMORY },
Expand Down