[SPARK-20703][SQL][WIP] Add an operator for writing data out #17998
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Right now in the explain plan / UI, we have no way to tell whether a query is writing data out, and also there is no way to associate metrics with data writes. We should add an operator for writing data out. This operator can be used to track writing data out and related metrics.
The Approach
We have several paths for writing data out through some
RunnableCommandclasses.File-based relation:
InsertIntoHadoopFsRelationCommand,InsertIntoHiveTableThose commands use
FileFormatWriterto write out data files. We can record some metrics inFileFormatWriterand update it later.FileFormatWriteraccepts aQueryExecution. We can track the execution plan ofQueryExecution.This patch adds a new operator
WriteDataFileOutExec. It is simply used to track the metrics of writing data file out for file-based relations. Currently we track some metrics:Other datasources:
InsertIntoDataSourceCommand,SaveIntoDataSourceCommandFor other datasource relations, the logic of writing data out is delegated to the datasource implementations, e.g.,
InsertableRelation.insert,CreatableRelationProvider.createRelation. The APIs basically take aDataFramefor the data to write. Those APIs can possibly create newDataFramebased on the give one. So we can't easily track its execution, and theoretically we don't know the details of the API implementation in those datasources. So we can't obtain enough metrics.Note:
SaveIntoDataSourceCommandcan possibly go to invokeInsertIntoHadoopFsRelationCommandfor file-based data sources. For this case, the metrics should be tracked asInsertIntoHadoopFsRelationCommand.Note:
CreateDataSourceTableAsSelectCommandworks similarly asSaveIntoDataSourceCommand.Note:
CreateHiveTableAsSelectCommandinserts data by invokingInsertIntoHiveTable.How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.