Skip to content

Commit 76a1ca5

Browse files
xupefeiHyukjinKwon
authored andcommitted
[SPARK-49060][CONNECT] Clean up Mima rules for SQL-Connect binary compatibility checks
### What changes were proposed in this pull request? This PR modifies some Mima rules which are used for checking the binary compatibility between `sql` and `connect` modules. Major changes include: - Removed unnecessary filters for specific `private[sql]` constructors - there's a wildcard rule which filters out all of them. - Removed outdated filters about APIs that are already consistent. - Add a warning about unused filters. Current output: ```bash $ ./dev/connect-jvm-client-mima-check Do connect-client-jvm module mima check ... Warning: ExcludeByName[Problem]("org.apache.spark.sql.Dataset.queryExecution") did not filter out any problems. Warning: ExcludeByName[Problem]("org.apache.spark.sql.Dataset.sqlContext") did not filter out any problems. Warning: ExcludeByName[Problem]("org.apache.spark.sql.Dataset.selectUntyped") did not filter out any problems. Warning: ExcludeByName[Problem]("org.apache.spark.sql.Dataset.rdd") did not filter out any problems. Warning: ExcludeByName[Problem]("org.apache.spark.sql.Dataset.toJavaRDD") did not filter out any problems. Warning: ExcludeByName[Problem]("org.apache.spark.sql.Dataset.javaRDD") did not filter out any problems. finish connect-client-jvm module mima check ... connect-client-jvm module mima check passed. ``` I manually audited all rules defined in the list. One issue I found is that all APIs in `Dataset` are not being checked at all, likely due to having a `private[sql]` companion object in `spark-core`. Changing the object's visibility from `private[sql]` to `public` will resolve this issue. The exact reason is unknown and is to be investigated. ### Why are the changes needed? Need to make sure Mima is really working. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Not needed. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47487 from xupefei/mima-refactor. Authored-by: Paddy Xu <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
1 parent 682eb1b commit 76a1ca5

File tree

3 files changed

+31
-53
lines changed

3 files changed

+31
-53
lines changed

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -824,7 +824,7 @@ class KeyValueGroupedDataset[K, V] private[sql] () extends Serializable {
824824
* @param outputMode
825825
* The output mode of the stateful processor.
826826
*/
827-
def transformWithState[U: Encoder](
827+
private[sql] def transformWithState[U: Encoder](
828828
statefulProcessor: StatefulProcessor[K, V, U],
829829
timeMode: TimeMode,
830830
outputMode: OutputMode): Dataset[U] = {
@@ -850,7 +850,7 @@ class KeyValueGroupedDataset[K, V] private[sql] () extends Serializable {
850850
* @param outputEncoder
851851
* Encoder for the output type.
852852
*/
853-
def transformWithState[U: Encoder](
853+
private[sql] def transformWithState[U: Encoder](
854854
statefulProcessor: StatefulProcessor[K, V, U],
855855
timeMode: TimeMode,
856856
outputMode: OutputMode,
@@ -879,7 +879,7 @@ class KeyValueGroupedDataset[K, V] private[sql] () extends Serializable {
879879
*
880880
* See [[Encoder]] for more details on what types are encodable to Spark SQL.
881881
*/
882-
def transformWithState[U: Encoder, S: Encoder](
882+
private[sql] def transformWithState[U: Encoder, S: Encoder](
883883
statefulProcessor: StatefulProcessorWithInitialState[K, V, U, S],
884884
timeMode: TimeMode,
885885
outputMode: OutputMode,

connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala

Lines changed: 27 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@ import java.nio.charset.StandardCharsets
2121
import java.nio.file.{Files, Paths}
2222
import java.util.regex.Pattern
2323

24+
import scala.collection.mutable.{Set => MutableSet}
25+
2426
import com.typesafe.tools.mima.core._
2527
import com.typesafe.tools.mima.lib.MiMaLib
2628

@@ -207,13 +209,13 @@ object CheckConnectJvmClientCompatibility {
207209
ProblemFilters.exclude[MissingClassProblem](
208210
"org.apache.spark.sql.Dataset$" // private[sql]
209211
),
210-
ProblemFilters.exclude[Problem]("org.apache.spark.sql.Dataset.ofRows"),
211-
ProblemFilters.exclude[Problem]("org.apache.spark.sql.Dataset.DATASET_ID_TAG"),
212-
ProblemFilters.exclude[Problem]("org.apache.spark.sql.Dataset.COL_POS_KEY"),
213-
ProblemFilters.exclude[Problem]("org.apache.spark.sql.Dataset.DATASET_ID_KEY"),
214-
ProblemFilters.exclude[Problem]("org.apache.spark.sql.Dataset.curId"),
215212
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.ObservationListener"),
216213
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.ObservationListener$"),
214+
// TODO (SPARK-49096):
215+
// Mima check might complain the following Dataset rules does not filter any problem.
216+
// This is due to a potential bug in Mima that all methods in `class Dataset` are not being
217+
// checked for problems due to the presence of a private[sql] companion object.
218+
// Further investigation is needed.
217219
ProblemFilters.exclude[Problem]("org.apache.spark.sql.Dataset.queryExecution"),
218220
ProblemFilters.exclude[Problem]("org.apache.spark.sql.Dataset.sqlContext"),
219221
ProblemFilters.exclude[Problem]("org.apache.spark.sql.Dataset.selectUntyped"), // protected
@@ -232,7 +234,6 @@ object CheckConnectJvmClientCompatibility {
232234
ProblemFilters.exclude[MissingClassProblem](
233235
"org.apache.spark.sql.RelationalGroupedDataset$*" // private[sql]
234236
),
235-
ProblemFilters.exclude[Problem]("org.apache.spark.sql.RelationalGroupedDataset.apply"),
236237

237238
// SparkSession
238239
ProblemFilters.exclude[Problem]("org.apache.spark.sql.SparkSession.sparkContext"),
@@ -241,18 +242,14 @@ object CheckConnectJvmClientCompatibility {
241242
ProblemFilters.exclude[Problem]("org.apache.spark.sql.SparkSession.sqlContext"),
242243
ProblemFilters.exclude[Problem]("org.apache.spark.sql.SparkSession.listenerManager"),
243244
ProblemFilters.exclude[Problem]("org.apache.spark.sql.SparkSession.experimental"),
244-
ProblemFilters.exclude[Problem]("org.apache.spark.sql.SparkSession.udtf"),
245245
ProblemFilters.exclude[Problem]("org.apache.spark.sql.SparkSession.dataSource"),
246246
ProblemFilters.exclude[Problem]("org.apache.spark.sql.SparkSession.createDataFrame"),
247247
ProblemFilters.exclude[Problem](
248248
"org.apache.spark.sql.SparkSession.baseRelationToDataFrame"),
249249
ProblemFilters.exclude[Problem]("org.apache.spark.sql.SparkSession.createDataset"),
250250
ProblemFilters.exclude[Problem]("org.apache.spark.sql.SparkSession.executeCommand"),
251-
ProblemFilters.exclude[Problem]("org.apache.spark.sql.SparkSession.this"),
252251

253252
// SparkSession#implicits
254-
ProblemFilters.exclude[DirectMissingMethodProblem](
255-
"org.apache.spark.sql.SparkSession#implicits._sqlContext"),
256253
ProblemFilters.exclude[DirectMissingMethodProblem](
257254
"org.apache.spark.sql.SparkSession#implicits.session"),
258255

@@ -285,26 +282,9 @@ object CheckConnectJvmClientCompatibility {
285282
"org.apache.spark.sql.streaming.PythonStreamingQueryListenerWrapper"),
286283
ProblemFilters.exclude[MissingTypesProblem](
287284
"org.apache.spark.sql.streaming.StreamingQueryListener$Event"),
288-
ProblemFilters.exclude[MissingTypesProblem](
289-
"org.apache.spark.sql.streaming.StreamingQueryListener$QueryIdleEvent"),
290-
ProblemFilters.exclude[DirectMissingMethodProblem](
291-
"org.apache.spark.sql.streaming.StreamingQueryListener#QueryIdleEvent.logEvent"),
292-
ProblemFilters.exclude[MissingTypesProblem](
293-
"org.apache.spark.sql.streaming.StreamingQueryListener$QueryProgressEvent"),
294-
ProblemFilters.exclude[DirectMissingMethodProblem](
295-
"org.apache.spark.sql.streaming.StreamingQueryListener#QueryProgressEvent.logEvent"),
296-
ProblemFilters.exclude[MissingTypesProblem](
297-
"org.apache.spark.sql.streaming.StreamingQueryListener$QueryStartedEvent"),
298-
ProblemFilters.exclude[DirectMissingMethodProblem](
299-
"org.apache.spark.sql.streaming.StreamingQueryListener#QueryStartedEvent.logEvent"),
300-
ProblemFilters.exclude[MissingTypesProblem](
301-
"org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent"),
302-
ProblemFilters.exclude[DirectMissingMethodProblem](
303-
"org.apache.spark.sql.streaming.StreamingQueryListener#QueryTerminatedEvent.logEvent"),
304285

305286
// SQLImplicits
306287
ProblemFilters.exclude[Problem]("org.apache.spark.sql.SQLImplicits.rddToDatasetHolder"),
307-
ProblemFilters.exclude[Problem]("org.apache.spark.sql.SQLImplicits._sqlContext"),
308288
ProblemFilters.exclude[Problem]("org.apache.spark.sql.SQLImplicits.session"),
309289

310290
// Artifact Manager
@@ -341,24 +321,8 @@ object CheckConnectJvmClientCompatibility {
341321
"org.apache.spark.sql.KeyValueGroupedDatasetImpl"),
342322
ProblemFilters.exclude[MissingClassProblem](
343323
"org.apache.spark.sql.KeyValueGroupedDatasetImpl$"),
344-
ProblemFilters.exclude[ReversedMissingMethodProblem](
345-
"org.apache.spark.sql.SQLImplicits._sqlContext" // protected
346-
),
347324
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.internal.SessionCleaner"),
348325

349-
// private
350-
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.internal.CleanupTask"),
351-
ProblemFilters.exclude[MissingClassProblem](
352-
"org.apache.spark.sql.internal.CleanupTaskWeakReference"),
353-
ProblemFilters.exclude[MissingClassProblem](
354-
"org.apache.spark.sql.internal.CleanupCachedRemoteRelation"),
355-
ProblemFilters.exclude[MissingClassProblem](
356-
"org.apache.spark.sql.internal.CleanupCachedRemoteRelation$"),
357-
358-
// Catalyst Refactoring
359-
ProblemFilters.exclude[Problem]("org.apache.spark.sql.catalyst.util.SparkCollectionUtils"),
360-
ProblemFilters.exclude[Problem]("org.apache.spark.sql.catalyst.util.SparkCollectionUtils$"),
361-
362326
// New public APIs added in the client
363327
// ScalaUserDefinedFunction
364328
ProblemFilters
@@ -490,13 +454,24 @@ object CheckConnectJvmClientCompatibility {
490454
excludeRules: Seq[ProblemFilter]): List[Problem] = {
491455
val mima = new MiMaLib(Seq(newJar, oldJar))
492456
val allProblems = mima.collectProblems(oldJar, newJar, List.empty)
457+
458+
val effectiveExcludeRules = MutableSet.empty[ProblemFilter]
493459
val problems = allProblems
494460
.filter { p =>
495461
includedRules.exists(rule => rule(p))
496462
}
497463
.filter { p =>
498-
excludeRules.forall(rule => rule(p))
464+
excludeRules.forall { rule =>
465+
val passedRule = rule(p)
466+
if (!passedRule) {
467+
effectiveExcludeRules += rule
468+
}
469+
passedRule
470+
}
499471
}
472+
excludeRules.filterNot(effectiveExcludeRules.contains).foreach { rule =>
473+
println(s"Warning: $rule did not filter out any problems.")
474+
}
500475
problems
501476
}
502477

@@ -511,11 +486,14 @@ object CheckConnectJvmClientCompatibility {
511486
resultWriter.write(
512487
s"ERROR: Comparing Client jar: $clientModule and $targetName jar: $targetModule \n")
513488
resultWriter.write(s"problems with $targetName module: \n")
514-
resultWriter.write(s"${problems.map(p => p.description(description)).mkString("\n")}")
515-
resultWriter.write("\n")
516-
resultWriter.write(
517-
"Exceptions to binary compatibility can be added in " +
518-
s"'CheckConnectJvmClientCompatibility#checkMiMaCompatibilityWith${targetName}Module'\n")
489+
val problemDescriptions =
490+
problems.map(p => s"${p.getClass.getSimpleName}: ${p.description(description)}")
491+
resultWriter.write(problemDescriptions.mkString("\n"))
492+
resultWriter.write("\n\n")
493+
resultWriter.write("Exceptions to binary compatibility can be added in " +
494+
s"'CheckConnectJvmClientCompatibility#checkMiMaCompatibilityWith${targetName}Module':\n")
495+
resultWriter.write(problems.flatMap(_.howToFilter).distinct.mkString(",\n"))
496+
resultWriter.write("\n\n")
519497
}
520498
}
521499

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -229,7 +229,7 @@ class SparkSession private(
229229
*/
230230
def udf: UDFRegistration = sessionState.udfRegistration
231231

232-
def udtf: UDTFRegistration = sessionState.udtfRegistration
232+
private[sql] def udtf: UDTFRegistration = sessionState.udtfRegistration
233233

234234
/**
235235
* A collection of methods for registering user-defined data sources.

0 commit comments

Comments
 (0)