[SPARK-20703][SQL][WIP] Add an operator for writing data out #17998

viirya · 2017-05-16T10:26:25Z

What changes were proposed in this pull request?

Right now in the explain plan / UI, we have no way to tell whether a query is writing data out, and also there is no way to associate metrics with data writes. We should add an operator for writing data out. This operator can be used to track writing data out and related metrics.

The Approach

We have several paths for writing data out through some RunnableCommand classes.

File-based relation: `InsertIntoHadoopFsRelationCommand`, `InsertIntoHiveTable`

Those commands use FileFormatWriter to write out data files. We can record some metrics in FileFormatWriter and update it later. FileFormatWriter accepts a QueryExecution. We can track the execution plan of QueryExecution.

This patch adds a new operator WriteDataFileOutExec. It is simply used to track the metrics of writing data file out for file-based relations. Currently we track some metrics:

number of written files
number of dynamic partitions
bytes of written files
number of output rows
writing data out time (ms)

Other datasources: `InsertIntoDataSourceCommand`, `SaveIntoDataSourceCommand`

For other datasource relations, the logic of writing data out is delegated to the datasource implementations, e.g., InsertableRelation.insert, CreatableRelationProvider.createRelation. The APIs basically take a DataFrame for the data to write. Those APIs can possibly create new DataFrame based on the give one. So we can't easily track its execution, and theoretically we don't know the details of the API implementation in those datasources. So we can't obtain enough metrics.

Note: SaveIntoDataSourceCommand can possibly go to invoke InsertIntoHadoopFsRelationCommand for file-based data sources. For this case, the metrics should be tracked as InsertIntoHadoopFsRelationCommand.

Note: CreateDataSourceTableAsSelectCommand works similarly as SaveIntoDataSourceCommand.

Note: CreateHiveTableAsSelectCommand inserts data by invoking InsertIntoHiveTable.

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

hvanhovell · 2017-05-16T14:33:50Z

@shaneknapp is amplap jenkins down?

shaneknapp · 2017-05-16T15:17:54Z

yep. it was wedged, so i kicked it and it's back now.

…

On Tue, May 16, 2017 at 7:34 AM, Herman van Hovell ***@***.*** > wrote: @shaneknapp <https://github.com/shaneknapp> is amplap jenkins down? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17998 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABiDrG901lIJIaajUix8U4MfQLM1KAfUks5r6bP2gaJpZM4NcSg8> .

SparkQA · 2017-05-16T18:27:43Z

Test build #76960 has finished for PR 17998 at commit 6e50181.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-05-16T18:43:20Z

How about CreateDataSourceTableAsSelectCommand?

viirya · 2017-05-17T01:03:11Z

CreateDataSourceTableAsSelectCommand works similarly as SaveIntoDataSourceCommand. Depending on the type of datasource, it calls CreatableRelationProvider.createRelation or invokes InsertIntoHadoopFsRelationCommand to write data.

viirya · 2017-05-17T08:35:38Z

Btw, CreateHiveTableAsSelectCommand inserts data by invoking InsertIntoHiveTable.

viirya · 2017-05-17T08:37:05Z

cc @rxin Do you think the current approach makes sense to you? Thanks.

viirya · 2017-05-31T03:18:55Z

#18064 is merged. Since it changes relative classes and code paths a lot. And seems an alternative approach for showing the metrics of writing data out is better. I'd close this and create a new PR for this.

viirya added 3 commits May 14, 2017 11:21

Add an operator for writing data out.

f60e280

Set metrics for file-based relation.

eb8553e

Add tests for writing data metrics.

6e50181

viirya force-pushed the SPARK-20703 branch from afb6e75 to 6e50181 Compare May 16, 2017 10:28

viirya mentioned this pull request May 23, 2017

[SPARK-20213][SQL] Fix DataFrameWriter operations in SQL UI tab #18064

Closed

viirya closed this May 31, 2017

viirya deleted the SPARK-20703 branch December 27, 2023 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-20703][SQL][WIP] Add an operator for writing data out #17998

[SPARK-20703][SQL][WIP] Add an operator for writing data out #17998

Uh oh!

viirya commented May 16, 2017 •

edited

Loading

Uh oh!

hvanhovell commented May 16, 2017

Uh oh!

shaneknapp commented May 16, 2017 via email

Uh oh!

SparkQA commented May 16, 2017

Uh oh!

gatorsmile commented May 16, 2017

Uh oh!

viirya commented May 17, 2017

Uh oh!

viirya commented May 17, 2017

Uh oh!

viirya commented May 17, 2017

Uh oh!

viirya commented May 31, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-20703][SQL][WIP] Add an operator for writing data out #17998

[SPARK-20703][SQL][WIP] Add an operator for writing data out #17998

Uh oh!

Conversation

viirya commented May 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

The Approach

File-based relation: InsertIntoHadoopFsRelationCommand, InsertIntoHiveTable

Other datasources: InsertIntoDataSourceCommand, SaveIntoDataSourceCommand

How was this patch tested?

Uh oh!

hvanhovell commented May 16, 2017

Uh oh!

shaneknapp commented May 16, 2017 via email

Uh oh!

SparkQA commented May 16, 2017

Uh oh!

gatorsmile commented May 16, 2017

Uh oh!

viirya commented May 17, 2017

Uh oh!

viirya commented May 17, 2017

Uh oh!

viirya commented May 17, 2017

Uh oh!

viirya commented May 31, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

viirya commented May 16, 2017 •

edited

Loading

File-based relation: `InsertIntoHadoopFsRelationCommand`, `InsertIntoHiveTable`

Other datasources: `InsertIntoDataSourceCommand`, `SaveIntoDataSourceCommand`