-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9743] [SQL] Fixes JSONRelation refreshing #8035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,20 +22,22 @@ import java.io.CharArrayWriter | |
| import com.fasterxml.jackson.core.JsonFactory | ||
| import com.google.common.base.Objects | ||
| import org.apache.hadoop.fs.{FileStatus, Path} | ||
| import org.apache.hadoop.io.{Text, LongWritable, NullWritable} | ||
| import org.apache.hadoop.io.{LongWritable, NullWritable, Text} | ||
| import org.apache.hadoop.mapred.{JobConf, TextInputFormat} | ||
| import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat | ||
| import org.apache.hadoop.mapreduce.{RecordWriter, TaskAttemptContext, Job} | ||
| import org.apache.hadoop.mapreduce.lib.input.FileInputFormat | ||
| import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat | ||
| import org.apache.hadoop.mapreduce.{Job, RecordWriter, TaskAttemptContext} | ||
|
|
||
| import org.apache.spark.Logging | ||
| import org.apache.spark.broadcast.Broadcast | ||
| import org.apache.spark.mapred.SparkHadoopMapRedUtil | ||
|
|
||
| import org.apache.spark.rdd.RDD | ||
| import org.apache.spark.sql.catalyst.InternalRow | ||
| import org.apache.spark.sql.execution.datasources.PartitionSpec | ||
| import org.apache.spark.sql.sources._ | ||
| import org.apache.spark.sql.types.StructType | ||
| import org.apache.spark.sql.{AnalysisException, Row, SQLContext} | ||
| import org.apache.spark.util.SerializableConfiguration | ||
|
|
||
| private[sql] class DefaultSource extends HadoopFsRelationProvider { | ||
| override def createRelation( | ||
|
|
@@ -105,6 +107,15 @@ private[sql] class JSONRelation( | |
| jsonSchema | ||
| } | ||
|
|
||
| override private[sql] def buildScan( | ||
| requiredColumns: Array[String], | ||
| filters: Array[Filter], | ||
| inputPaths: Array[String], | ||
| broadcastedConf: Broadcast[SerializableConfiguration]): RDD[Row] = { | ||
| refresh() | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is that the same with the Actually, I've updated the #8023, by simply removing all of the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The refresh added in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I haven't finished reviewing #8023, let's have an offline discussion about it tomorrow :)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why calling refresh at here will work? Is
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What issue? Cached metadata can be out-of-date when users add data directly? Because refreshing ORC and Parquet tables' metadata is expensive, I think that is a expected behavior (users need to call refresh explicitly).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, but seems only The problem here, once the table cached, even if people call the refresh() implicitly, we will not get the refreshed data either.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For json, we do not have other metadata to cache other than file status, right?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't mean we need to refresh the schema / metadata, but fetching the latest file under the table path.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As the files under the table path probably changed by other applications, I think we have to refresh the file status before the file scanning. Probably we can move out the parquet meta data refreshing from the |
||
| super.buildScan(requiredColumns, filters, inputPaths, broadcastedConf) | ||
| } | ||
|
|
||
| override def buildScan( | ||
| requiredColumns: Array[String], | ||
| filters: Array[Filter], | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we create rdds for partitioned json table, are we using this broadcastedConf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, 'cause we're not using
SqlNewHadoopRDDhere. This method was originally marked asfinalinHadoopFsRelationand wasn't supposed to be overriden in concrete data source implementations.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, if we read a table with lots of partitioned, the time spent on planning the query will be pretty long, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's file a followup jira to re-use this
broadcastedConf. So, we will not broadcast hadoop conf for every RDD created for every partition.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, do we have a jira for investigating using
SerializableConfigurationinstead of this shared broad conf for 1.6?