Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented May 16, 2017

What changes were proposed in this pull request?

Right now in the explain plan / UI, we have no way to tell whether a query is writing data out, and also there is no way to associate metrics with data writes. We should add an operator for writing data out. This operator can be used to track writing data out and related metrics.

The Approach

We have several paths for writing data out through some RunnableCommand classes.

File-based relation: InsertIntoHadoopFsRelationCommand, InsertIntoHiveTable

Those commands use FileFormatWriter to write out data files. We can record some metrics in FileFormatWriter and update it later. FileFormatWriter accepts a QueryExecution. We can track the execution plan of QueryExecution.

This patch adds a new operator WriteDataFileOutExec. It is simply used to track the metrics of writing data file out for file-based relations. Currently we track some metrics:

  • number of written files
  • number of dynamic partitions
  • bytes of written files
  • number of output rows
  • writing data out time (ms)

Other datasources: InsertIntoDataSourceCommand, SaveIntoDataSourceCommand

For other datasource relations, the logic of writing data out is delegated to the datasource implementations, e.g., InsertableRelation.insert, CreatableRelationProvider.createRelation. The APIs basically take a DataFrame for the data to write. Those APIs can possibly create new DataFrame based on the give one. So we can't easily track its execution, and theoretically we don't know the details of the API implementation in those datasources. So we can't obtain enough metrics.

Note: SaveIntoDataSourceCommand can possibly go to invoke InsertIntoHadoopFsRelationCommand for file-based data sources. For this case, the metrics should be tracked as InsertIntoHadoopFsRelationCommand.

Note: CreateDataSourceTableAsSelectCommand works similarly as SaveIntoDataSourceCommand.

Note: CreateHiveTableAsSelectCommand inserts data by invoking InsertIntoHiveTable.

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@hvanhovell
Copy link
Contributor

@shaneknapp is amplap jenkins down?

@shaneknapp
Copy link
Contributor

shaneknapp commented May 16, 2017 via email

@SparkQA
Copy link

SparkQA commented May 16, 2017

Test build #76960 has finished for PR 17998 at commit 6e50181.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

How about CreateDataSourceTableAsSelectCommand?

@viirya
Copy link
Member Author

viirya commented May 17, 2017

CreateDataSourceTableAsSelectCommand works similarly as SaveIntoDataSourceCommand. Depending on the type of datasource, it calls CreatableRelationProvider.createRelation or invokes InsertIntoHadoopFsRelationCommand to write data.

@viirya
Copy link
Member Author

viirya commented May 17, 2017

Btw, CreateHiveTableAsSelectCommand inserts data by invoking InsertIntoHiveTable.

@viirya
Copy link
Member Author

viirya commented May 17, 2017

cc @rxin Do you think the current approach makes sense to you? Thanks.

@viirya
Copy link
Member Author

viirya commented May 31, 2017

#18064 is merged. Since it changes relative classes and code paths a lot. And seems an alternative approach for showing the metrics of writing data out is better. I'd close this and create a new PR for this.

@viirya viirya closed this May 31, 2017
@viirya viirya deleted the SPARK-20703 branch December 27, 2023 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants