[SPARK-29292][SPARK-30010][CORE] Let core compile for Scala 2.13 #28971

srowen · 2020-07-01T17:41:06Z

What changes were proposed in this pull request?

The purpose of this PR is to partly resolve SPARK-29292, and fully resolve SPARK-30010, which should allow Spark to compile vs Scala 2.13 in Spark Core and up through GraphX (not SQL, Streaming, etc).

Note that we are not trying to determine here whether this makes Spark work on 2.13 yet, just compile, as a prerequisite for assessing test outcomes. However, of course, we need to ensure that the change does not break 2.12.

The changes are, in the main, adding .toSeq and .toMap calls where mutable collections / maps are returned as Seq / Map, which are immutable by default in Scala 2.13. The theory is that it should be a no-op for Scala 2.12 (these return themselves), and required for 2.13.

There are a few non-trivial changes highlighted below.
In particular, to get Core to compile, we need to resolve SPARK-30010 which removes a deprecated SparkConf method

Why are the changes needed?

Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1.

Does this PR introduce any user-facing change?

Yes, removal of the deprecated SparkConf.setAll overload, which isn't legal in Scala 2.13 anymore.

How was this patch tested?

Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12)

srowen · 2020-07-01T18:16:02Z

Jenkins retest this please

srowen · 2020-07-01T19:42:29Z

@shaneknapp I think we have the same corrupted .m2 repository issue again. Do I just keep retesting until it doesn't hit the worker?

srowen · 2020-07-01T19:42:38Z

Jenkins retest this please

srowen · 2020-07-01T23:30:14Z

Jenkins retest this please

core/src/main/scala/org/apache/spark/SparkConf.scala

dongjoon-hyun · 2020-07-02T01:42:34Z

Thank you so much, @srowen !

attilapiros · 2020-07-02T01:59:06Z

core/src/main/scala/org/apache/spark/deploy/FaultToleranceTest.scala


    def stateValid(): Boolean = {
-      (workers.map(_.ip) -- liveWorkerIPs).isEmpty &&
+      workers.map(_.ip).forall(liveWorkerIPs.contains) &&


Nit: What about using diff here?
As I see diff is not deprecated: https://www.scala-lang.org/api/current/scala/collection/Seq.html#diff[B%3E:A](that:scala.collection.Seq[B]):C

Suggested change

workers.map(_.ip).forall(liveWorkerIPs.contains) &&

workers.map(_.ip).diff(liveWorkerIPs).isEmpty &&

diff would work too, I think. It has multiset semantics, and I thought it was not necessary here. I went for what I thought was simpler, but I am not 100% sure.

attilapiros · 2020-07-02T02:13:09Z

core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala

          }
          else {
-            new Range(r.start + start * r.step, r.start + end * r.step, r.step)
+            new Range.Inclusive(r.start + start * r.step, r.start + end * r.step - 1, r.step)


What about Range.Exclusive?

Suggested change

new Range.Inclusive(r.start + start * r.step, r.start + end * r.step - 1, r.step)

new Range.Exclusive(r.start + start * r.step, r.start + end * r.step, r.step)

Range.Exclusive doesn't exist in 2.12, and Range() (exclusive in 2.12) doesn't exist in 2.13. :( I tried that initially. Because these are integers, I think we can get away with an Inclusive range that ends at end - 1 instead.

I see. In this case this is totally fine.

One more question: what about using until and by?

It's probably equivalent, yeah. I was shooting for minimal change here, but there may be several solutions.

srowen · 2020-07-02T04:56:37Z

Jenkins test this please

SparkQA · 2020-07-02T15:18:11Z

Test build #124915 has finished for PR 28971 at commit ab62b32.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-07-02T15:27:36Z

Jenkins retest this please

SparkQA · 2020-07-02T15:35:23Z

Test build #124916 has finished for PR 28971 at commit ab62b32.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-07-02T17:26:31Z

Retest this please.

SparkQA · 2020-07-09T13:51:50Z

Test build #125479 has started for PR 28971 at commit 89d19c6.

srowen · 2020-07-09T17:14:07Z

Jenkins retest this please

SparkQA · 2020-07-09T17:18:42Z

Test build #125499 has started for PR 28971 at commit 89d19c6.

dongjoon-hyun · 2020-07-10T22:54:35Z

Retest this please

SparkQA · 2020-07-11T03:18:37Z

Test build #125648 has finished for PR 28971 at commit 89d19c6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-07-11T14:58:15Z

Jenkins retest this please

SparkQA · 2020-07-11T18:15:36Z

Test build #125682 has finished for PR 28971 at commit 89d19c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-07-11T18:33:41Z

core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala

-          else {
-            new Range(r.start + start * r.step, r.start + end * r.step, r.step)
+          } else {
+            new Range.Inclusive(r.start + start * r.step, r.start + (end - 1) * r.step, r.step)


For previous reviewers: I fixed a bug from my initial change here. The non-inclusive end is not 1 less than the exclusive end, but one less r.step

srowen · 2020-07-11T18:34:16Z

core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala

    resourceProfileManager: ResourceProfileManager)
  extends CoarseGrainedSchedulerBackend(scheduler, rpcEnv) {

+  def this() = this(null, null, null, null)


I still have no idea how this wasn't required in Scala 2.12, as it's used with a no-arg constructor but none existed ?!

srowen · 2020-07-11T18:34:45Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala


  test("top with predefined ordering") {
-    val nums = Array.range(1, 100000)
+    val nums = Seq.range(1, 100000)


Side comment: generally speaking Seq types have less weird generic type problems than Arrays. This is a good example

dongjoon-hyun · 2020-07-11T20:55:03Z

Great, @srowen !

dongjoon-hyun · 2020-07-11T21:12:31Z

Could you fix the following four instances together in this PR, @srowen ?

[ERROR] [Error] /Users/dongjoon/PRS/SPARK-PR-28971/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala:23: object parallel is not a member of package collection
[ERROR] [Error] /Users/dongjoon/PRS/SPARK-PR-28971/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala:24: object parallel is not a member of package collection
[ERROR] [Error] /Users/dongjoon/PRS/SPARK-PR-28971/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala:64: not found: type ForkJoinTaskSupport
[ERROR] [Error] /Users/dongjoon/PRS/SPARK-PR-28971/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala:79: not found: type ParVector

srowen · 2020-07-11T21:22:54Z

Oh where do you see that? I can't find it in the test logs. That should be fine on 2.12 but is also working on my local 2.13 compilation.

dongjoon-hyun · 2020-07-11T21:25:05Z

I built with 2.13.3 like the following to verify this PR.

$ dev/change-scala-version.sh 2.13

$ ...
-    <scala.version>2.12.10</scala.version>
-    <scala.binary.version>2.12</scala.binary.version>
+    <scala.version>2.13.3</scala.version>
+    <scala.binary.version>2.13</scala.binary.version>

$ build/mvn package -DskipTests -pl core --am

srowen · 2020-07-11T21:26:59Z

Oh, you have to build with -Pscala-2.13 too

dongjoon-hyun · 2020-07-11T21:27:11Z

Oh. Got it. Thanks!

srowen · 2020-07-11T21:28:24Z

Oh, I did miss one necessary change. The room pom has to update to scala 2.13.3, yes. Let me get that part of the change in here.

dongjoon-hyun

+1, LGTM. Everything works. Thank you so much, @srowen .
Merged to master for Apache Spark 3.1.0.

srowen · 2020-07-11T22:05:05Z

Oh OK I wanted to make sure everyone was OK with the approach, but I think so as it's been the plan for a long time AFAICT. I will start making other similar PRs (this one does not resolved SPARK-29292 by itself)

dongjoon-hyun · 2020-07-11T22:13:34Z

Sorry for the rush. If needed, we are able to switch our approaches during Apache Spark 3.1.0 timeline. I believe this healthy core module will unlock Scala-2.13 progress as a baseline toward other modules and Scala 2.13 testing stage.

SparkQA · 2020-07-12T01:10:16Z

Test build #125693 has finished for PR 28971 at commit 8f5af5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…piling for Scala 2.13 ### What changes were proposed in this pull request? Continuation of #28971 which lets streaming, catalyst and sql compile for 2.13. Same idea. ### Why are the changes needed? Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12) Closes #29078 from srowen/SPARK-29292.2. Authored-by: Sean Owen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

HyukjinKwon · 2020-07-15T02:21:36Z

core/src/main/scala/org/apache/spark/SparkConf.scala

-   * Set multiple parameters together
-   */
-  @deprecated("Use setAll(Iterable) instead", "3.0.0")
-  def setAll(settings: Traversable[(String, String)]): SparkConf = {


@srowen, BTW, it might be best to file a JIRA as a reminder to keep this API back if we can't make Scala 2.13 in Spark 3.1.

I believe it is legitimate and inevitable to remove this because of Scala 2.13 but it might be problematic if we can't make it in Spark 3.1, and have a release out only with Scala 2.12.

Yeah if the whole thing doesn't make it for 3.1, I'd leave this method in 3.1.

… for Scala 2.13 compilation ### What changes were proposed in this pull request? Same as #29078 and #28971 . This makes the rest of the default modules (i.e. those you get without specifying `-Pyarn` etc) compile under Scala 2.13. It does not close the JIRA, as a result. this also of course does not demonstrate that tests pass yet in 2.13. Note, this does not fix the `repl` module; that's separate. ### Why are the changes needed? Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12) Closes #29111 from srowen/SPARK-29292.3. Authored-by: Sean Owen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ing modules ### What changes were proposed in this pull request? See again the related PRs like #28971 This completes fixing compilation for 2.13 for all but `repl`, which is a separate task. ### Why are the changes needed? Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12) Closes #29147 from srowen/SPARK-29292.4. Authored-by: Sean Owen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

First pass at getting Core compiling for 2.13.

08493d4

probot-autolabeler bot added CORE PYTHON labels Jul 1, 2020

srowen changed the title ~~[SPARK-29292][SPARK-30010][CORE] Let core compile for Scala 2.13~~ [WIP][SPARK-29292][SPARK-30010][CORE] Let core compile for Scala 2.13 Jul 1, 2020