Skip to content

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Sep 6, 2024

What changes were proposed in this pull request?

This pull request introduces changes to the default value of the ivySettings parameter in the IvyTestUtils#withRepository function. During the construction of the IvySettings object, the configurations of DefaultIvyUserDir and DefaultCache within the instance are modified through an additional call to the MavenUtils.processIvyPathArg function:

  1. The DefaultIvyUserDir is set to ${user.home}/.ivy2.5.2.
  2. The DefaultCache is set to the cache directory under the modified IvyUserDir. By default, the cache directory is ${user.home}/.ivy2/cache.

These alterations are made to address a Badcase in the testing process.

Additionally, to allow IvyTestUtils to invoke the MavenUtils.processIvyPathArg function, the visibility of the processIvyPathArg function has been adjusted from private to private[util].

Why are the changes needed?

To fix a Badcase in the testing, the reproduction steps are as follows:

  1. Clean up files and directories related to mylib-0.1.jar under ~/.ivy2.5.2
  2. Execute the following tests using Java 21:
java -version
openjdk version "21.0.4" 2024-07-16 LTS
OpenJDK Runtime Environment Zulu21.36+17-CA (build 21.0.4+7-LTS)
OpenJDK 64-Bit Server VM Zulu21.36+17-CA (build 21.0.4+7-LTS, mixed mode, sharing)
build/sbt clean "connect-client-jvm/testOnly org.apache.spark.sql.application.ReplE2ESuite" -Phive
Deleting /Users/yangjie01/.ivy2/cache/my.great.lib, exists: false
file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-2a9107ea-4e09-4dfe-a270-921d799837fb/ added as a remote repository with the name: repo-1
:: loading settings :: url = jar:file:/Users/yangjie01/Library/Caches/Coursier/v1/https/maven-central.storage-download.googleapis.com/maven2/org/apache/ivy/ivy/2.5.2/ivy-2.5.2.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/yangjie01/.ivy2.5.2/cache
The jars for the packages stored in: /Users/yangjie01/.ivy2.5.2/jars
my.great.lib#mylib added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5827ff8a-7a85-4598-8ced-e949457752e4;1.0
	confs: [default]
	found my.great.lib#mylib;0.1 in repo-1
downloading file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-2a9107ea-4e09-4dfe-a270-921d799837fb/my/great/lib/mylib/0.1/mylib-0.1.jar ...
	[SUCCESSFUL ] my.great.lib#mylib;0.1!mylib.jar (1ms)
:: resolution report :: resolve 4325ms :: artifacts dl 2ms
	:: modules in use:
	my.great.lib#mylib;0.1 from repo-1 in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   1   |   1   |   0   ||   1   |   1   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-5827ff8a-7a85-4598-8ced-e949457752e4
	confs: [default]
	1 artifacts copied, 0 already retrieved (0kB/6ms)
Deleting /Users/yangjie01/.ivy2/cache/my.great.lib, exists: false
[info] - External JAR (6 seconds, 288 milliseconds)
...
[info] Run completed in 40 seconds, 441 milliseconds.
[info] Total number of tests run: 26
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
  1. Re-execute the above tests using Java 17:
java -version
openjdk version "17.0.12" 2024-07-16 LTS
OpenJDK Runtime Environment Zulu17.52+17-CA (build 17.0.12+7-LTS)
OpenJDK 64-Bit Server VM Zulu17.52+17-CA (build 17.0.12+7-LTS, mixed mode, sharing)
build/sbt clean "connect-client-jvm/testOnly org.apache.spark.sql.application.ReplE2ESuite" -Phive
[info] - External JAR *** FAILED *** (1 second, 626 milliseconds)
[info]   isContain was false Ammonite output did not contain 'Array[Int] = Array(1, 2, 3, 4, 5)':
[info]   scala>  

[info]   scala> // this import will fail 

[info]   scala> import my.great.lib.MyLib 

[info]   scala>  

[info]   scala> // making library available in the REPL to compile UDF 

[info]   scala> import coursierapi.{Credentials, MavenRepository} 
import coursierapi.{Credentials, MavenRepository}
[info]   
[info]   scala> interp.repositories() ++= Seq(MavenRepository.of("file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-6e6bc234-758f-44f1-a8b3-fbb79ed74647/")) 

[info]   
[info]   scala> import $ivy.`my.great.lib:mylib:0.1` 
import $ivy.$
[info]   
[info]   scala>  

[info]   scala> val func = udf((a: Int) => {
[info]            import my.great.lib.MyLib
[info]            MyLib.myFunc(a)
[info]          }) 
func: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction(
[info]     f = ammonite.$sess.cmd28$Helper$$Lambda$3059/0x0000000801da4218@721b2487,
[info]     dataType = IntegerType,
[info]     inputEncoders = ArraySeq(Some(value = PrimitiveIntEncoder)),
[info]     outputEncoder = Some(value = BoxedIntEncoder),
[info]     givenName = None,
[info]     nullable = true,
[info]     deterministic = true
[info]   )
[info]   
[info]   scala>  

[info]   scala> // add library to the Executor 

[info]   scala> spark.addArtifact("ivy://my.great.lib:mylib:0.1?repos=file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-6e6bc234-758f-44f1-a8b3-fbb79ed74647/") 

[info]   
[info]   scala>  

[info]   scala> spark.range(5).select(func(col("id"))).as[Int].collect() 

[info]   scala>  

[info]   scala> semaphore.release() 

[info]   Error Output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc
[info]   Compiling /Users/yangjie01/SourceCode/git/spark-sbt/connector/connect/client/jvm/(console)
[info]   cmd25.sc:1: not found: value my
[info]   import my.great.lib.MyLib
[info]          ^
[info]   Compilation Failed
[info]   org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] User defined function (` (cmd28$Helper$$Lambda$3054/0x0000007002189800)`: (int) => int) failed due to: java.lang.UnsupportedClassVersionError: my/great/lib/MyLib has been compiled by a more recent version of the Java Runtime (class file version 65.0), this version of the Java Runtime only recognizes class file versions up to 61.0. SQLSTATE: 39000
[info]     org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:195)
[info]     org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
[info]     org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:114)
[info]     org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info]     org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
[info]     org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.hasNext(ArrowConverters.scala:100)
[info]     scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
[info]     scala.collection.mutable.Growable.addAll(Growable.scala:61)
[info]     scala.collection.mutable.Growable.addAll$(Growable.scala:57)
[info]     scala.collection.mutable.ArrayBuilder.addAll(ArrayBuilder.scala:75)
[info]     scala.collection.IterableOnceOps.toArray(IterableOnce.scala:1505)
[info]     scala.collection.IterableOnceOps.toArray$(IterableOnce.scala:1498)
[info]     scala.collection.AbstractIterator.toArray(Iterator.scala:1303)
[info]     org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.$anonfun$processAsArrowBatches$5(SparkConnectPlanExecution.scala:183)
[info]     org.apache.spark.SparkContext.$anonfun$submitJob$1(SparkContext.scala:2608)
[info]     org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
[info]     org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
[info]     org.apache.spark.scheduler.Task.run(Task.scala:146)
[info]     org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
[info]     org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
[info]     org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)
[info]     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[info]     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[info]     java.lang.Thread.run(Thread.java:840)
[info]   org.apache.spark.SparkException: java.lang.UnsupportedClassVersionError: my/great/lib/MyLib has been compiled by a more recent version of the Java Runtime (class file version 65.0), this version of the Java Runtime only recognizes class file versions up to 61.0
[info]     java.lang.ClassLoader.defineClass1(Native Method)
[info]     java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
[info]     java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
[info]     java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
[info]     java.net.URLClassLoader$1.run(URLClassLoader.java:427)
[info]     java.net.URLClassLoader$1.run(URLClassLoader.java:421)
[info]     java.security.AccessController.doPrivileged(AccessController.java:712)
[info]     java.net.URLClassLoader.findClass(URLClassLoader.java:420)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:592)
[info]     org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:55)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:579)
[info]     org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:525)
[info]     org.apache.spark.executor.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:109)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:592)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:525)
[info]     ammonite.$sess.cmd28$Helper.$anonfun$func$1(cmd28.sc:3)
[info]     ammonite.$sess.cmd28$Helper.$anonfun$func$1$adapted(cmd28.sc:1)
[info]     org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:112)
[info]     org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info]     org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
[info]     org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.hasNext(ArrowConverters.scala:100)
[info]     scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
[info]     scala.collection.mutable.Growable.addAll(Growable.scala:61)
[info]     scala.collection.mutable.Growable.addAll$(Growable.scala:57)
[info]     scala.collection.mutable.ArrayBuilder.addAll(ArrayBuilder.scala:75)
[info]     scala.collection.IterableOnceOps.toArray(IterableOnce.scala:1505)
[info]     scala.collection.IterableOnceOps.toArray$(IterableOnce.scala:1498)
[info]     scala.collection.AbstractIterator.toArray(Iterator.scala:1303)
[info]     org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.$anonfun$processAsArrowBatches$5(SparkConnectPlanExecution.scala:183)
[info]     org.apache.spark.SparkContext.$anonfun$submitJob$1(SparkContext.scala:2608)
[info]     org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
[info]     org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
[info]     org.apache.spark.scheduler.Task.run(Task.scala:146)
[info]     org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
[info]     org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
[info]     org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)
[info]     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[info]     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[info]     java.lang.Thread.run(Thread.java:840) (ReplE2ESuite.scala:117)

The reasons I suspect for the aforementioned bad case are as follows:

  1. Following [SPARK-44914][BUILD] Upgrade Apache Ivy to 2.5.2 #45075, to address compatibility issues, Spark 4.0 adopted ~/.ivy2.5.2 as the default Ivy user directory. When tests are executed with Java 21, the compiled mylib-0.1.jar is published to the directory ~/.ivy2.5.2/cache/my.great.lib/mylib/jars.

  2. However, the getDefaultCache method within the default IvySettings instance still returns ~/.ivy2/cache. Consequently, when the purgeLocalIvyCache function is called within the withRepository function, it attempts to clean the artifact and deps directories under ~/.ivy2/cache. This results in the failure to effectively clean up the mylib-0.1.jar file located at ~/.ivy2.5.2/cache/my.great.lib/mylib/jars, which was originally published by Java 21. Subsequently, when tests are executed with Java 17 and attempt to load this Java 21-compiled mylib-0.1.jar, the tests fail.

private[spark] def withRepository(
artifact: MavenCoordinate,
dependencies: Option[String],
rootDir: Option[File],
useIvyLayout: Boolean = false,
withPython: Boolean = false,
withR: Boolean = false,
ivySettings: IvySettings = new IvySettings)(f: String => Unit): Unit = {
val deps = dependencies.map(MavenUtils.extractMavenCoordinates)
purgeLocalIvyCache(artifact, deps, ivySettings)
val repo = createLocalRepositoryForTests(artifact, dependencies, rootDir, useIvyLayout,

/** Deletes the test packages from the ivy cache */
private def purgeLocalIvyCache(
artifact: MavenCoordinate,
dependencies: Option[Seq[MavenCoordinate]],
ivySettings: IvySettings): Unit = {
// delete the artifact from the cache as well if it already exists
FileUtils.deleteDirectory(new File(ivySettings.getDefaultCache, artifact.groupId))
dependencies.foreach { _.foreach { dep =>
FileUtils.deleteDirectory(new File(ivySettings.getDefaultCache, dep.groupId))
}
}
}

To address this issue, the pull request modifies the default configuration of the IvySettings instance, ensuring that purgeLocalIvyCache is able to properly clean up the corresponding cache files located in ~/.ivy2.5.2/cache. This resolution fixes the aforementioned problem.

Does this PR introduce any user-facing change?

No, just for test

How was this patch tested?

  1. Pass GitHub Actions
  2. Manually executing the tests described in the pull request results in success, and it is confirmed that the ~/.ivy2.5.2/cache/my.great.lib directory is cleaned up promptly.

Was this patch authored or co-authored using generative AI tooling?

NO

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Sep 6, 2024

Test first, will update pr description later.

@LuciferYang LuciferYang marked this pull request as draft September 6, 2024 02:45
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Looks valid to me. Shall we file a JIRA and convert to a normal PR, @LuciferYang ?

@LuciferYang LuciferYang changed the title [CORE][TESTS] Change default ivySettings in the IvyTestUtis#withRepository function to use .ivy2.5.2 as the Default Ivy User Dir [SPARK-49533][CORE][TESTS] Change default ivySettings in the IvyTestUtis#withRepository function to use .ivy2.5.2 as the Default Ivy User Dir Sep 6, 2024
@LuciferYang LuciferYang marked this pull request as ready for review September 6, 2024 08:21
@LuciferYang
Copy link
Contributor Author

Thank you. Looks valid to me. Shall we file a JIRA and convert to a normal PR, @LuciferYang ?

done

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @LuciferYang .
Merged to master.

@LuciferYang
Copy link
Contributor Author

Thanks @dongjoon-hyun

IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
…stUtis#withRepository` function to use `.ivy2.5.2` as the Default Ivy User Dir

### What changes were proposed in this pull request?
This pull request introduces changes to the default value of the `ivySettings` parameter in the `IvyTestUtils#withRepository` function. During the construction of the `IvySettings` object, the configurations of `DefaultIvyUserDir` and `DefaultCache` within the instance are modified through an additional call to the `MavenUtils.processIvyPathArg` function:

1. The `DefaultIvyUserDir` is set to `${user.home}/.ivy2.5.2`.
2. The `DefaultCache` is set to the `cache` directory under the modified `IvyUserDir`. By default, the `cache` directory is `${user.home}/.ivy2/cache`.

These alterations are made to address a Badcase in the testing process.

Additionally, to allow `IvyTestUtils` to invoke the `MavenUtils.processIvyPathArg` function, the visibility of the `processIvyPathArg` function has been adjusted from `private` to `private[util]`.

### Why are the changes needed?
To fix a Badcase in the testing, the reproduction steps are as follows:

1. Clean up files and directories related to `mylib-0.1.jar` under `~/.ivy2.5.2`
2. Execute the following tests using Java 21:

```
java -version
openjdk version "21.0.4" 2024-07-16 LTS
OpenJDK Runtime Environment Zulu21.36+17-CA (build 21.0.4+7-LTS)
OpenJDK 64-Bit Server VM Zulu21.36+17-CA (build 21.0.4+7-LTS, mixed mode, sharing)
build/sbt clean "connect-client-jvm/testOnly org.apache.spark.sql.application.ReplE2ESuite" -Phive
```

```
Deleting /Users/yangjie01/.ivy2/cache/my.great.lib, exists: false
file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-2a9107ea-4e09-4dfe-a270-921d799837fb/ added as a remote repository with the name: repo-1
:: loading settings :: url = jar:file:/Users/yangjie01/Library/Caches/Coursier/v1/https/maven-central.storage-download.googleapis.com/maven2/org/apache/ivy/ivy/2.5.2/ivy-2.5.2.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/yangjie01/.ivy2.5.2/cache
The jars for the packages stored in: /Users/yangjie01/.ivy2.5.2/jars
my.great.lib#mylib added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5827ff8a-7a85-4598-8ced-e949457752e4;1.0
	confs: [default]
	found my.great.lib#mylib;0.1 in repo-1
downloading file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-2a9107ea-4e09-4dfe-a270-921d799837fb/my/great/lib/mylib/0.1/mylib-0.1.jar ...
	[SUCCESSFUL ] my.great.lib#mylib;0.1!mylib.jar (1ms)
:: resolution report :: resolve 4325ms :: artifacts dl 2ms
	:: modules in use:
	my.great.lib#mylib;0.1 from repo-1 in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   1   |   1   |   0   ||   1   |   1   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-5827ff8a-7a85-4598-8ced-e949457752e4
	confs: [default]
	1 artifacts copied, 0 already retrieved (0kB/6ms)
Deleting /Users/yangjie01/.ivy2/cache/my.great.lib, exists: false
[info] - External JAR (6 seconds, 288 milliseconds)
...
[info] Run completed in 40 seconds, 441 milliseconds.
[info] Total number of tests run: 26
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```

3. Re-execute the above tests using Java 17:

```
java -version
openjdk version "17.0.12" 2024-07-16 LTS
OpenJDK Runtime Environment Zulu17.52+17-CA (build 17.0.12+7-LTS)
OpenJDK 64-Bit Server VM Zulu17.52+17-CA (build 17.0.12+7-LTS, mixed mode, sharing)
build/sbt clean "connect-client-jvm/testOnly org.apache.spark.sql.application.ReplE2ESuite" -Phive
```

```
[info] - External JAR *** FAILED *** (1 second, 626 milliseconds)
[info]   isContain was false Ammonite output did not contain 'Array[Int] = Array(1, 2, 3, 4, 5)':
[info]   scala>

[info]   scala> // this import will fail

[info]   scala> import my.great.lib.MyLib

[info]   scala>

[info]   scala> // making library available in the REPL to compile UDF

[info]   scala> import coursierapi.{Credentials, MavenRepository}
import coursierapi.{Credentials, MavenRepository}
[info]
[info]   scala> interp.repositories() ++= Seq(MavenRepository.of("file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-6e6bc234-758f-44f1-a8b3-fbb79ed74647/"))

[info]
[info]   scala> import $ivy.`my.great.lib:mylib:0.1`
import $ivy.$
[info]
[info]   scala>

[info]   scala> val func = udf((a: Int) => {
[info]            import my.great.lib.MyLib
[info]            MyLib.myFunc(a)
[info]          })
func: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction(
[info]     f = ammonite.$sess.cmd28$Helper$$Lambda$3059/0x0000000801da4218721b2487,
[info]     dataType = IntegerType,
[info]     inputEncoders = ArraySeq(Some(value = PrimitiveIntEncoder)),
[info]     outputEncoder = Some(value = BoxedIntEncoder),
[info]     givenName = None,
[info]     nullable = true,
[info]     deterministic = true
[info]   )
[info]
[info]   scala>

[info]   scala> // add library to the Executor

[info]   scala> spark.addArtifact("ivy://my.great.lib:mylib:0.1?repos=file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-6e6bc234-758f-44f1-a8b3-fbb79ed74647/")

[info]
[info]   scala>

[info]   scala> spark.range(5).select(func(col("id"))).as[Int].collect()

[info]   scala>

[info]   scala> semaphore.release()

[info]   Error Output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc
[info]   Compiling /Users/yangjie01/SourceCode/git/spark-sbt/connector/connect/client/jvm/(console)
[info]   cmd25.sc:1: not found: value my
[info]   import my.great.lib.MyLib
[info]          ^
[info]   Compilation Failed
[info]   org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] User defined function (` (cmd28$Helper$$Lambda$3054/0x0000007002189800)`: (int) => int) failed due to: java.lang.UnsupportedClassVersionError: my/great/lib/MyLib has been compiled by a more recent version of the Java Runtime (class file version 65.0), this version of the Java Runtime only recognizes class file versions up to 61.0. SQLSTATE: 39000
[info]     org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:195)
[info]     org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
[info]     org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:114)
[info]     org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info]     org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
[info]     org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.hasNext(ArrowConverters.scala:100)
[info]     scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
[info]     scala.collection.mutable.Growable.addAll(Growable.scala:61)
[info]     scala.collection.mutable.Growable.addAll$(Growable.scala:57)
[info]     scala.collection.mutable.ArrayBuilder.addAll(ArrayBuilder.scala:75)
[info]     scala.collection.IterableOnceOps.toArray(IterableOnce.scala:1505)
[info]     scala.collection.IterableOnceOps.toArray$(IterableOnce.scala:1498)
[info]     scala.collection.AbstractIterator.toArray(Iterator.scala:1303)
[info]     org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.$anonfun$processAsArrowBatches$5(SparkConnectPlanExecution.scala:183)
[info]     org.apache.spark.SparkContext.$anonfun$submitJob$1(SparkContext.scala:2608)
[info]     org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
[info]     org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
[info]     org.apache.spark.scheduler.Task.run(Task.scala:146)
[info]     org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
[info]     org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
[info]     org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)
[info]     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[info]     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[info]     java.lang.Thread.run(Thread.java:840)
[info]   org.apache.spark.SparkException: java.lang.UnsupportedClassVersionError: my/great/lib/MyLib has been compiled by a more recent version of the Java Runtime (class file version 65.0), this version of the Java Runtime only recognizes class file versions up to 61.0
[info]     java.lang.ClassLoader.defineClass1(Native Method)
[info]     java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
[info]     java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
[info]     java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
[info]     java.net.URLClassLoader$1.run(URLClassLoader.java:427)
[info]     java.net.URLClassLoader$1.run(URLClassLoader.java:421)
[info]     java.security.AccessController.doPrivileged(AccessController.java:712)
[info]     java.net.URLClassLoader.findClass(URLClassLoader.java:420)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:592)
[info]     org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:55)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:579)
[info]     org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:525)
[info]     org.apache.spark.executor.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:109)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:592)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:525)
[info]     ammonite.$sess.cmd28$Helper.$anonfun$func$1(cmd28.sc:3)
[info]     ammonite.$sess.cmd28$Helper.$anonfun$func$1$adapted(cmd28.sc:1)
[info]     org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:112)
[info]     org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info]     org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
[info]     org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.hasNext(ArrowConverters.scala:100)
[info]     scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
[info]     scala.collection.mutable.Growable.addAll(Growable.scala:61)
[info]     scala.collection.mutable.Growable.addAll$(Growable.scala:57)
[info]     scala.collection.mutable.ArrayBuilder.addAll(ArrayBuilder.scala:75)
[info]     scala.collection.IterableOnceOps.toArray(IterableOnce.scala:1505)
[info]     scala.collection.IterableOnceOps.toArray$(IterableOnce.scala:1498)
[info]     scala.collection.AbstractIterator.toArray(Iterator.scala:1303)
[info]     org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.$anonfun$processAsArrowBatches$5(SparkConnectPlanExecution.scala:183)
[info]     org.apache.spark.SparkContext.$anonfun$submitJob$1(SparkContext.scala:2608)
[info]     org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
[info]     org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
[info]     org.apache.spark.scheduler.Task.run(Task.scala:146)
[info]     org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
[info]     org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
[info]     org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)
[info]     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[info]     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[info]     java.lang.Thread.run(Thread.java:840) (ReplE2ESuite.scala:117)
```

The reasons I suspect for the aforementioned bad case are as follows:

1. Following apache#45075, to address compatibility issues, Spark 4.0 adopted `~/.ivy2.5.2` as the default Ivy user directory. When tests are executed with Java 21, the compiled `mylib-0.1.jar` is published to the directory `~/.ivy2.5.2/cache/my.great.lib/mylib/jars`.

2. However, the `getDefaultCache` method within the default `IvySettings` instance still returns `~/.ivy2/cache`. Consequently, when the `purgeLocalIvyCache` function is called within the `withRepository` function, it attempts to clean the `artifact` and `deps` directories under `~/.ivy2/cache`. This results in the failure to effectively clean up the `mylib-0.1.jar` file located at `~/.ivy2.5.2/cache/my.great.lib/mylib/jars`, which was originally published by Java 21. Subsequently, when tests are executed with Java 17 and attempt to load this Java 21-compiled `mylib-0.1.jar`, the tests fail.

https://github.com/apache/spark/blob/9269a0bfed56429e999269dfdfd89aefcb1b7261/common/utils/src/test/scala/org/apache/spark/util/IvyTestUtils.scala#L361-L371

https://github.com/apache/spark/blob/9269a0bfed56429e999269dfdfd89aefcb1b7261/common/utils/src/test/scala/org/apache/spark/util/IvyTestUtils.scala#L392-L403

To address this issue, the pull request modifies the default configuration of the `IvySettings` instance, ensuring that `purgeLocalIvyCache` is able to properly clean up the corresponding cache files located in `~/.ivy2.5.2/cache`. This resolution fixes the aforementioned problem.

### Does this PR introduce _any_ user-facing change?
No, just for test

### How was this patch tested?
1. Pass GitHub Actions
2. Manually executing the tests described in the pull request results in success, and it is confirmed that the `~/.ivy2.5.2/cache/my.great.lib` directory is cleaned up promptly.

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes apache#48006 from LuciferYang/IvyTestUtils-withRepository.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…stUtis#withRepository` function to use `.ivy2.5.2` as the Default Ivy User Dir

### What changes were proposed in this pull request?
This pull request introduces changes to the default value of the `ivySettings` parameter in the `IvyTestUtils#withRepository` function. During the construction of the `IvySettings` object, the configurations of `DefaultIvyUserDir` and `DefaultCache` within the instance are modified through an additional call to the `MavenUtils.processIvyPathArg` function:

1. The `DefaultIvyUserDir` is set to `${user.home}/.ivy2.5.2`.
2. The `DefaultCache` is set to the `cache` directory under the modified `IvyUserDir`. By default, the `cache` directory is `${user.home}/.ivy2/cache`.

These alterations are made to address a Badcase in the testing process.

Additionally, to allow `IvyTestUtils` to invoke the `MavenUtils.processIvyPathArg` function, the visibility of the `processIvyPathArg` function has been adjusted from `private` to `private[util]`.

### Why are the changes needed?
To fix a Badcase in the testing, the reproduction steps are as follows:

1. Clean up files and directories related to `mylib-0.1.jar` under `~/.ivy2.5.2`
2. Execute the following tests using Java 21:

```
java -version
openjdk version "21.0.4" 2024-07-16 LTS
OpenJDK Runtime Environment Zulu21.36+17-CA (build 21.0.4+7-LTS)
OpenJDK 64-Bit Server VM Zulu21.36+17-CA (build 21.0.4+7-LTS, mixed mode, sharing)
build/sbt clean "connect-client-jvm/testOnly org.apache.spark.sql.application.ReplE2ESuite" -Phive
```

```
Deleting /Users/yangjie01/.ivy2/cache/my.great.lib, exists: false
file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-2a9107ea-4e09-4dfe-a270-921d799837fb/ added as a remote repository with the name: repo-1
:: loading settings :: url = jar:file:/Users/yangjie01/Library/Caches/Coursier/v1/https/maven-central.storage-download.googleapis.com/maven2/org/apache/ivy/ivy/2.5.2/ivy-2.5.2.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/yangjie01/.ivy2.5.2/cache
The jars for the packages stored in: /Users/yangjie01/.ivy2.5.2/jars
my.great.lib#mylib added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5827ff8a-7a85-4598-8ced-e949457752e4;1.0
	confs: [default]
	found my.great.lib#mylib;0.1 in repo-1
downloading file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-2a9107ea-4e09-4dfe-a270-921d799837fb/my/great/lib/mylib/0.1/mylib-0.1.jar ...
	[SUCCESSFUL ] my.great.lib#mylib;0.1!mylib.jar (1ms)
:: resolution report :: resolve 4325ms :: artifacts dl 2ms
	:: modules in use:
	my.great.lib#mylib;0.1 from repo-1 in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   1   |   1   |   0   ||   1   |   1   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-5827ff8a-7a85-4598-8ced-e949457752e4
	confs: [default]
	1 artifacts copied, 0 already retrieved (0kB/6ms)
Deleting /Users/yangjie01/.ivy2/cache/my.great.lib, exists: false
[info] - External JAR (6 seconds, 288 milliseconds)
...
[info] Run completed in 40 seconds, 441 milliseconds.
[info] Total number of tests run: 26
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```

3. Re-execute the above tests using Java 17:

```
java -version
openjdk version "17.0.12" 2024-07-16 LTS
OpenJDK Runtime Environment Zulu17.52+17-CA (build 17.0.12+7-LTS)
OpenJDK 64-Bit Server VM Zulu17.52+17-CA (build 17.0.12+7-LTS, mixed mode, sharing)
build/sbt clean "connect-client-jvm/testOnly org.apache.spark.sql.application.ReplE2ESuite" -Phive
```

```
[info] - External JAR *** FAILED *** (1 second, 626 milliseconds)
[info]   isContain was false Ammonite output did not contain 'Array[Int] = Array(1, 2, 3, 4, 5)':
[info]   scala>

[info]   scala> // this import will fail

[info]   scala> import my.great.lib.MyLib

[info]   scala>

[info]   scala> // making library available in the REPL to compile UDF

[info]   scala> import coursierapi.{Credentials, MavenRepository}
import coursierapi.{Credentials, MavenRepository}
[info]
[info]   scala> interp.repositories() ++= Seq(MavenRepository.of("file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-6e6bc234-758f-44f1-a8b3-fbb79ed74647/"))

[info]
[info]   scala> import $ivy.`my.great.lib:mylib:0.1`
import $ivy.$
[info]
[info]   scala>

[info]   scala> val func = udf((a: Int) => {
[info]            import my.great.lib.MyLib
[info]            MyLib.myFunc(a)
[info]          })
func: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction(
[info]     f = ammonite.$sess.cmd28$Helper$$Lambda$3059/0x0000000801da4218721b2487,
[info]     dataType = IntegerType,
[info]     inputEncoders = ArraySeq(Some(value = PrimitiveIntEncoder)),
[info]     outputEncoder = Some(value = BoxedIntEncoder),
[info]     givenName = None,
[info]     nullable = true,
[info]     deterministic = true
[info]   )
[info]
[info]   scala>

[info]   scala> // add library to the Executor

[info]   scala> spark.addArtifact("ivy://my.great.lib:mylib:0.1?repos=file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-6e6bc234-758f-44f1-a8b3-fbb79ed74647/")

[info]
[info]   scala>

[info]   scala> spark.range(5).select(func(col("id"))).as[Int].collect()

[info]   scala>

[info]   scala> semaphore.release()

[info]   Error Output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc
[info]   Compiling /Users/yangjie01/SourceCode/git/spark-sbt/connector/connect/client/jvm/(console)
[info]   cmd25.sc:1: not found: value my
[info]   import my.great.lib.MyLib
[info]          ^
[info]   Compilation Failed
[info]   org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] User defined function (` (cmd28$Helper$$Lambda$3054/0x0000007002189800)`: (int) => int) failed due to: java.lang.UnsupportedClassVersionError: my/great/lib/MyLib has been compiled by a more recent version of the Java Runtime (class file version 65.0), this version of the Java Runtime only recognizes class file versions up to 61.0. SQLSTATE: 39000
[info]     org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:195)
[info]     org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
[info]     org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:114)
[info]     org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info]     org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
[info]     org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.hasNext(ArrowConverters.scala:100)
[info]     scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
[info]     scala.collection.mutable.Growable.addAll(Growable.scala:61)
[info]     scala.collection.mutable.Growable.addAll$(Growable.scala:57)
[info]     scala.collection.mutable.ArrayBuilder.addAll(ArrayBuilder.scala:75)
[info]     scala.collection.IterableOnceOps.toArray(IterableOnce.scala:1505)
[info]     scala.collection.IterableOnceOps.toArray$(IterableOnce.scala:1498)
[info]     scala.collection.AbstractIterator.toArray(Iterator.scala:1303)
[info]     org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.$anonfun$processAsArrowBatches$5(SparkConnectPlanExecution.scala:183)
[info]     org.apache.spark.SparkContext.$anonfun$submitJob$1(SparkContext.scala:2608)
[info]     org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
[info]     org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
[info]     org.apache.spark.scheduler.Task.run(Task.scala:146)
[info]     org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
[info]     org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
[info]     org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)
[info]     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[info]     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[info]     java.lang.Thread.run(Thread.java:840)
[info]   org.apache.spark.SparkException: java.lang.UnsupportedClassVersionError: my/great/lib/MyLib has been compiled by a more recent version of the Java Runtime (class file version 65.0), this version of the Java Runtime only recognizes class file versions up to 61.0
[info]     java.lang.ClassLoader.defineClass1(Native Method)
[info]     java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
[info]     java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
[info]     java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
[info]     java.net.URLClassLoader$1.run(URLClassLoader.java:427)
[info]     java.net.URLClassLoader$1.run(URLClassLoader.java:421)
[info]     java.security.AccessController.doPrivileged(AccessController.java:712)
[info]     java.net.URLClassLoader.findClass(URLClassLoader.java:420)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:592)
[info]     org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:55)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:579)
[info]     org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:525)
[info]     org.apache.spark.executor.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:109)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:592)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:525)
[info]     ammonite.$sess.cmd28$Helper.$anonfun$func$1(cmd28.sc:3)
[info]     ammonite.$sess.cmd28$Helper.$anonfun$func$1$adapted(cmd28.sc:1)
[info]     org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:112)
[info]     org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info]     org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
[info]     org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.hasNext(ArrowConverters.scala:100)
[info]     scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
[info]     scala.collection.mutable.Growable.addAll(Growable.scala:61)
[info]     scala.collection.mutable.Growable.addAll$(Growable.scala:57)
[info]     scala.collection.mutable.ArrayBuilder.addAll(ArrayBuilder.scala:75)
[info]     scala.collection.IterableOnceOps.toArray(IterableOnce.scala:1505)
[info]     scala.collection.IterableOnceOps.toArray$(IterableOnce.scala:1498)
[info]     scala.collection.AbstractIterator.toArray(Iterator.scala:1303)
[info]     org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.$anonfun$processAsArrowBatches$5(SparkConnectPlanExecution.scala:183)
[info]     org.apache.spark.SparkContext.$anonfun$submitJob$1(SparkContext.scala:2608)
[info]     org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
[info]     org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
[info]     org.apache.spark.scheduler.Task.run(Task.scala:146)
[info]     org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
[info]     org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
[info]     org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)
[info]     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[info]     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[info]     java.lang.Thread.run(Thread.java:840) (ReplE2ESuite.scala:117)
```

The reasons I suspect for the aforementioned bad case are as follows:

1. Following apache#45075, to address compatibility issues, Spark 4.0 adopted `~/.ivy2.5.2` as the default Ivy user directory. When tests are executed with Java 21, the compiled `mylib-0.1.jar` is published to the directory `~/.ivy2.5.2/cache/my.great.lib/mylib/jars`.

2. However, the `getDefaultCache` method within the default `IvySettings` instance still returns `~/.ivy2/cache`. Consequently, when the `purgeLocalIvyCache` function is called within the `withRepository` function, it attempts to clean the `artifact` and `deps` directories under `~/.ivy2/cache`. This results in the failure to effectively clean up the `mylib-0.1.jar` file located at `~/.ivy2.5.2/cache/my.great.lib/mylib/jars`, which was originally published by Java 21. Subsequently, when tests are executed with Java 17 and attempt to load this Java 21-compiled `mylib-0.1.jar`, the tests fail.

https://github.com/apache/spark/blob/9269a0bfed56429e999269dfdfd89aefcb1b7261/common/utils/src/test/scala/org/apache/spark/util/IvyTestUtils.scala#L361-L371

https://github.com/apache/spark/blob/9269a0bfed56429e999269dfdfd89aefcb1b7261/common/utils/src/test/scala/org/apache/spark/util/IvyTestUtils.scala#L392-L403

To address this issue, the pull request modifies the default configuration of the `IvySettings` instance, ensuring that `purgeLocalIvyCache` is able to properly clean up the corresponding cache files located in `~/.ivy2.5.2/cache`. This resolution fixes the aforementioned problem.

### Does this PR introduce _any_ user-facing change?
No, just for test

### How was this patch tested?
1. Pass GitHub Actions
2. Manually executing the tests described in the pull request results in success, and it is confirmed that the `~/.ivy2.5.2/cache/my.great.lib` directory is cleaned up promptly.

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes apache#48006 from LuciferYang/IvyTestUtils-withRepository.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…stUtis#withRepository` function to use `.ivy2.5.2` as the Default Ivy User Dir

### What changes were proposed in this pull request?
This pull request introduces changes to the default value of the `ivySettings` parameter in the `IvyTestUtils#withRepository` function. During the construction of the `IvySettings` object, the configurations of `DefaultIvyUserDir` and `DefaultCache` within the instance are modified through an additional call to the `MavenUtils.processIvyPathArg` function:

1. The `DefaultIvyUserDir` is set to `${user.home}/.ivy2.5.2`.
2. The `DefaultCache` is set to the `cache` directory under the modified `IvyUserDir`. By default, the `cache` directory is `${user.home}/.ivy2/cache`.

These alterations are made to address a Badcase in the testing process.

Additionally, to allow `IvyTestUtils` to invoke the `MavenUtils.processIvyPathArg` function, the visibility of the `processIvyPathArg` function has been adjusted from `private` to `private[util]`.

### Why are the changes needed?
To fix a Badcase in the testing, the reproduction steps are as follows:

1. Clean up files and directories related to `mylib-0.1.jar` under `~/.ivy2.5.2`
2. Execute the following tests using Java 21:

```
java -version
openjdk version "21.0.4" 2024-07-16 LTS
OpenJDK Runtime Environment Zulu21.36+17-CA (build 21.0.4+7-LTS)
OpenJDK 64-Bit Server VM Zulu21.36+17-CA (build 21.0.4+7-LTS, mixed mode, sharing)
build/sbt clean "connect-client-jvm/testOnly org.apache.spark.sql.application.ReplE2ESuite" -Phive
```

```
Deleting /Users/yangjie01/.ivy2/cache/my.great.lib, exists: false
file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-2a9107ea-4e09-4dfe-a270-921d799837fb/ added as a remote repository with the name: repo-1
:: loading settings :: url = jar:file:/Users/yangjie01/Library/Caches/Coursier/v1/https/maven-central.storage-download.googleapis.com/maven2/org/apache/ivy/ivy/2.5.2/ivy-2.5.2.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/yangjie01/.ivy2.5.2/cache
The jars for the packages stored in: /Users/yangjie01/.ivy2.5.2/jars
my.great.lib#mylib added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5827ff8a-7a85-4598-8ced-e949457752e4;1.0
	confs: [default]
	found my.great.lib#mylib;0.1 in repo-1
downloading file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-2a9107ea-4e09-4dfe-a270-921d799837fb/my/great/lib/mylib/0.1/mylib-0.1.jar ...
	[SUCCESSFUL ] my.great.lib#mylib;0.1!mylib.jar (1ms)
:: resolution report :: resolve 4325ms :: artifacts dl 2ms
	:: modules in use:
	my.great.lib#mylib;0.1 from repo-1 in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   1   |   1   |   0   ||   1   |   1   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-5827ff8a-7a85-4598-8ced-e949457752e4
	confs: [default]
	1 artifacts copied, 0 already retrieved (0kB/6ms)
Deleting /Users/yangjie01/.ivy2/cache/my.great.lib, exists: false
[info] - External JAR (6 seconds, 288 milliseconds)
...
[info] Run completed in 40 seconds, 441 milliseconds.
[info] Total number of tests run: 26
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```

3. Re-execute the above tests using Java 17:

```
java -version
openjdk version "17.0.12" 2024-07-16 LTS
OpenJDK Runtime Environment Zulu17.52+17-CA (build 17.0.12+7-LTS)
OpenJDK 64-Bit Server VM Zulu17.52+17-CA (build 17.0.12+7-LTS, mixed mode, sharing)
build/sbt clean "connect-client-jvm/testOnly org.apache.spark.sql.application.ReplE2ESuite" -Phive
```

```
[info] - External JAR *** FAILED *** (1 second, 626 milliseconds)
[info]   isContain was false Ammonite output did not contain 'Array[Int] = Array(1, 2, 3, 4, 5)':
[info]   scala>

[info]   scala> // this import will fail

[info]   scala> import my.great.lib.MyLib

[info]   scala>

[info]   scala> // making library available in the REPL to compile UDF

[info]   scala> import coursierapi.{Credentials, MavenRepository}
import coursierapi.{Credentials, MavenRepository}
[info]
[info]   scala> interp.repositories() ++= Seq(MavenRepository.of("file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-6e6bc234-758f-44f1-a8b3-fbb79ed74647/"))

[info]
[info]   scala> import $ivy.`my.great.lib:mylib:0.1`
import $ivy.$
[info]
[info]   scala>

[info]   scala> val func = udf((a: Int) => {
[info]            import my.great.lib.MyLib
[info]            MyLib.myFunc(a)
[info]          })
func: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction(
[info]     f = ammonite.$sess.cmd28$Helper$$Lambda$3059/0x0000000801da4218721b2487,
[info]     dataType = IntegerType,
[info]     inputEncoders = ArraySeq(Some(value = PrimitiveIntEncoder)),
[info]     outputEncoder = Some(value = BoxedIntEncoder),
[info]     givenName = None,
[info]     nullable = true,
[info]     deterministic = true
[info]   )
[info]
[info]   scala>

[info]   scala> // add library to the Executor

[info]   scala> spark.addArtifact("ivy://my.great.lib:mylib:0.1?repos=file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-6e6bc234-758f-44f1-a8b3-fbb79ed74647/")

[info]
[info]   scala>

[info]   scala> spark.range(5).select(func(col("id"))).as[Int].collect()

[info]   scala>

[info]   scala> semaphore.release()

[info]   Error Output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc
[info]   Compiling /Users/yangjie01/SourceCode/git/spark-sbt/connector/connect/client/jvm/(console)
[info]   cmd25.sc:1: not found: value my
[info]   import my.great.lib.MyLib
[info]          ^
[info]   Compilation Failed
[info]   org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] User defined function (` (cmd28$Helper$$Lambda$3054/0x0000007002189800)`: (int) => int) failed due to: java.lang.UnsupportedClassVersionError: my/great/lib/MyLib has been compiled by a more recent version of the Java Runtime (class file version 65.0), this version of the Java Runtime only recognizes class file versions up to 61.0. SQLSTATE: 39000
[info]     org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:195)
[info]     org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
[info]     org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:114)
[info]     org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info]     org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
[info]     org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.hasNext(ArrowConverters.scala:100)
[info]     scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
[info]     scala.collection.mutable.Growable.addAll(Growable.scala:61)
[info]     scala.collection.mutable.Growable.addAll$(Growable.scala:57)
[info]     scala.collection.mutable.ArrayBuilder.addAll(ArrayBuilder.scala:75)
[info]     scala.collection.IterableOnceOps.toArray(IterableOnce.scala:1505)
[info]     scala.collection.IterableOnceOps.toArray$(IterableOnce.scala:1498)
[info]     scala.collection.AbstractIterator.toArray(Iterator.scala:1303)
[info]     org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.$anonfun$processAsArrowBatches$5(SparkConnectPlanExecution.scala:183)
[info]     org.apache.spark.SparkContext.$anonfun$submitJob$1(SparkContext.scala:2608)
[info]     org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
[info]     org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
[info]     org.apache.spark.scheduler.Task.run(Task.scala:146)
[info]     org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
[info]     org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
[info]     org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)
[info]     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[info]     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[info]     java.lang.Thread.run(Thread.java:840)
[info]   org.apache.spark.SparkException: java.lang.UnsupportedClassVersionError: my/great/lib/MyLib has been compiled by a more recent version of the Java Runtime (class file version 65.0), this version of the Java Runtime only recognizes class file versions up to 61.0
[info]     java.lang.ClassLoader.defineClass1(Native Method)
[info]     java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
[info]     java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
[info]     java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
[info]     java.net.URLClassLoader$1.run(URLClassLoader.java:427)
[info]     java.net.URLClassLoader$1.run(URLClassLoader.java:421)
[info]     java.security.AccessController.doPrivileged(AccessController.java:712)
[info]     java.net.URLClassLoader.findClass(URLClassLoader.java:420)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:592)
[info]     org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:55)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:579)
[info]     org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:525)
[info]     org.apache.spark.executor.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:109)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:592)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:525)
[info]     ammonite.$sess.cmd28$Helper.$anonfun$func$1(cmd28.sc:3)
[info]     ammonite.$sess.cmd28$Helper.$anonfun$func$1$adapted(cmd28.sc:1)
[info]     org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:112)
[info]     org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info]     org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
[info]     org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.hasNext(ArrowConverters.scala:100)
[info]     scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
[info]     scala.collection.mutable.Growable.addAll(Growable.scala:61)
[info]     scala.collection.mutable.Growable.addAll$(Growable.scala:57)
[info]     scala.collection.mutable.ArrayBuilder.addAll(ArrayBuilder.scala:75)
[info]     scala.collection.IterableOnceOps.toArray(IterableOnce.scala:1505)
[info]     scala.collection.IterableOnceOps.toArray$(IterableOnce.scala:1498)
[info]     scala.collection.AbstractIterator.toArray(Iterator.scala:1303)
[info]     org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.$anonfun$processAsArrowBatches$5(SparkConnectPlanExecution.scala:183)
[info]     org.apache.spark.SparkContext.$anonfun$submitJob$1(SparkContext.scala:2608)
[info]     org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
[info]     org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
[info]     org.apache.spark.scheduler.Task.run(Task.scala:146)
[info]     org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
[info]     org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
[info]     org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)
[info]     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[info]     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[info]     java.lang.Thread.run(Thread.java:840) (ReplE2ESuite.scala:117)
```

The reasons I suspect for the aforementioned bad case are as follows:

1. Following apache#45075, to address compatibility issues, Spark 4.0 adopted `~/.ivy2.5.2` as the default Ivy user directory. When tests are executed with Java 21, the compiled `mylib-0.1.jar` is published to the directory `~/.ivy2.5.2/cache/my.great.lib/mylib/jars`.

2. However, the `getDefaultCache` method within the default `IvySettings` instance still returns `~/.ivy2/cache`. Consequently, when the `purgeLocalIvyCache` function is called within the `withRepository` function, it attempts to clean the `artifact` and `deps` directories under `~/.ivy2/cache`. This results in the failure to effectively clean up the `mylib-0.1.jar` file located at `~/.ivy2.5.2/cache/my.great.lib/mylib/jars`, which was originally published by Java 21. Subsequently, when tests are executed with Java 17 and attempt to load this Java 21-compiled `mylib-0.1.jar`, the tests fail.

https://github.com/apache/spark/blob/9269a0bfed56429e999269dfdfd89aefcb1b7261/common/utils/src/test/scala/org/apache/spark/util/IvyTestUtils.scala#L361-L371

https://github.com/apache/spark/blob/9269a0bfed56429e999269dfdfd89aefcb1b7261/common/utils/src/test/scala/org/apache/spark/util/IvyTestUtils.scala#L392-L403

To address this issue, the pull request modifies the default configuration of the `IvySettings` instance, ensuring that `purgeLocalIvyCache` is able to properly clean up the corresponding cache files located in `~/.ivy2.5.2/cache`. This resolution fixes the aforementioned problem.

### Does this PR introduce _any_ user-facing change?
No, just for test

### How was this patch tested?
1. Pass GitHub Actions
2. Manually executing the tests described in the pull request results in success, and it is confirmed that the `~/.ivy2.5.2/cache/my.great.lib` directory is cleaned up promptly.

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes apache#48006 from LuciferYang/IvyTestUtils-withRepository.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@LuciferYang LuciferYang deleted the IvyTestUtils-withRepository branch May 2, 2025 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants