Skip to content

Conversation

@liancheng
Copy link
Contributor

JIRA issue: SPARK-1852

This issue is related to SPARK-1021, but this PR doesn't try to solve that one. Worked around by only forcing query planning when running DDL and other native commands.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15376/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15380/

@marmbrus
Copy link
Contributor

marmbrus commented Jun 3, 2014

Thanks for looking into this! Just a thought: what do you think about instead putting a method called process() or something into QueryExecution. That way we wouldn't have to duplicate the logic in two places, and subclasses of QueryExecution could also override it if there were special cases?

@marmbrus
Copy link
Contributor

marmbrus commented Jun 3, 2014

/cc @concretevitamin who is looking at similar parts of the code at the moment.

@concretevitamin
Copy link
Contributor

Removing duplication is definitely a good thing. The other benefit is that it seems more natural to me to push such DDL/command processing logic into QueryExecution, instead of putting them into two thin entry methods (sql and hql).

I am super new to Spark SQL so bear with me if this is silly -- what is the reason we don't do this as well for InsertIntoHiveTable and InsertIntoParquetTable?

@liancheng
Copy link
Contributor Author

@marmbrus Agree. I'll try to remove the lazy qualifier of SchemaRDD.queryExecution since the field object itself is rather cheap (all of its fields are lazy), and then call the process() method in QueryExecution constructor to evaluate native commands eagerly.

@concretevitamin InsertIntoHiveTable and InsertIntoParquetTable are physical operations, which are parts of physical plans. However, we only need to take care of logical plan here. InsertIntoTable and InsertIntoCreatedTable are translated into those physical operators after logical plan optimization.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build triggered.

@liancheng
Copy link
Contributor Author

@marmbrus @concretevitamin While working on the cache table SQL command and reviewing PR #956, I think this PR may make both changes cleaner, so I tried to finish this one first. For example, we won't need processCmd and eagerlyProcess anymore since all commands and insertions are eagerly executed in a unified way.

Instead of QueryExecution, I factored the duplicated code into SchemaRDDLike to make the change cleaner and more straightforward (after some experiments, it turned out that moving code to QueryExecution complicates things a lot).

@concretevitamin
Copy link
Contributor

Hey @liancheng - I think this refactoring will solve this particular ticket (i.e. queries w/ sorts will not be eagerly executed anymore). However, I don't see why we don't need processCmd and eagerlyProcess anymore; for instance, for set commands this PR's changes will eagerly call toRdd, but we still need some logic in the two toRdd to handle the set. Are you proposing to just take that piece of logic and put them into SchemaRDDLike? That could work. And what about processCmd then?

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15596/

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15598/

@liancheng
Copy link
Contributor Author

Just a memo: please refer to this comment and SPARK-2094 for more details about this PR and PR #956.

@liancheng
Copy link
Contributor Author

Closing this PR because corresponding change is merged into another up coming PR that aims to solve SPARK-2094.

@liancheng liancheng closed this Jun 11, 2014
@liancheng liancheng deleted the spark-1852 branch September 24, 2014 00:12
agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022
…message from mapr stream which was produced before application start (apache#948)
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
…message from mapr stream which was produced before application start (apache#948)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants