-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25159][SQL] json schema inference should only trigger one job #22152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -26,6 +26,7 @@ import org.apache.spark.rdd.RDD | |
| import org.apache.spark.sql.catalyst.analysis.TypeCoercion | ||
| import org.apache.spark.sql.catalyst.json.JacksonUtils.nextUntil | ||
| import org.apache.spark.sql.catalyst.util.{DropMalformedMode, FailFastMode, ParseMode, PermissiveMode} | ||
| import org.apache.spark.sql.internal.SQLConf | ||
| import org.apache.spark.sql.types._ | ||
| import org.apache.spark.util.Utils | ||
|
|
||
|
|
@@ -69,10 +70,17 @@ private[sql] object JsonInferSchema { | |
| }.reduceOption(typeMerger).toIterator | ||
| } | ||
|
|
||
| // Here we get RDD local iterator then fold, instead of calling `RDD.fold` directly, because | ||
| // `RDD.fold` will run the fold function in DAGScheduler event loop thread, which may not have | ||
| // active SparkSession and `SQLConf.get` may point to the wrong configs. | ||
| val rootType = mergedTypesFromPartitions.toLocalIterator.fold(StructType(Nil))(typeMerger) | ||
| // Here we manually submit a fold-like Spark job, so that we can set the SQLConf when running | ||
| // the fold functions in the scheduler event loop thread. | ||
| val existingConf = SQLConf.get | ||
| var rootType: DataType = StructType(Nil) | ||
| val foldPartition = (iter: Iterator[DataType]) => iter.fold(StructType(Nil))(typeMerger) | ||
| val mergeResult = (index: Int, taskResult: DataType) => { | ||
| rootType = SQLConf.withExistingConf(existingConf) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just a question, wouldn't: do the same without requiring these changes?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can work, but the problem is, we have to keep a large result array which can cause GC problems.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it would contain one result per partition, do you think this is enough to cause GC problems?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the schema can be very complex (e.g. very wide and deep schema).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, makes sense, thanks.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same question was in my mind. thanks for clarification. |
||
| typeMerger(rootType, taskResult) | ||
| } | ||
| } | ||
| json.sparkContext.runJob(mergedTypesFromPartitions, foldPartition, mergeResult) | ||
|
|
||
| canonicalizeType(rootType, configOptions) match { | ||
| case Some(st: StructType) => st | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to do
sc.clean(typeMerger)manually here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This closure is defined by us and I don't think we leak outer reference here. If we do, it's a bug and we should fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed.