Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Apr 13, 2016

What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-14593

We now disable whole stage codegen if the input column number are too large. If we can make CodeGenContext.currentVars work with CodeGenContext.splitExpressions. We can enable whole stage codegen for large input columns.

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Apr 13, 2016

Test build #55695 has finished for PR 12351 at commit ccc72e1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 13, 2016

Test build #55698 has finished for PR 12351 at commit 33280bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 14, 2016

cc @cloud-fan @davies Please take a look. Thanks.

@cloud-fan
Copy link
Contributor

A high level question: given a lot of expression code, we will group them into several blocks, and each block will be put into a method. For the method, previously we only use a single InternalRow parameter, as we assume all expressions only need the input row. To support currentVars, a simple idea is just adding currentVars as parameters, but this may introduce a very long parameter list and reach java limit, how do you fix it?

@viirya
Copy link
Member Author

viirya commented Apr 14, 2016

@cloud-fan I noticed that problem during testing this pr. For each block, I only pass the parameters needed by the expressions in the block. That said, I also group the parameters into the parts for blocks (methods). Although by doing this, the parameter number for each method can still reach java limit. So I set a limit for splitting expressions with currentVars.

@cloud-fan
Copy link
Contributor

How do you group the parameters? For the worst case, every expression may reference to all parameters.

@viirya
Copy link
Member Author

viirya commented Apr 14, 2016

For the current usage of splitExpressions, the expressions to be split are corresponding to currentVars if they are used. That said, each expression only uses one variable in currentVars. What they did is to process one variable in currentVars and assign it to another variable. So we don't face the case you worry about.

@cloud-fan
Copy link
Contributor

I don't think it's the semantic of splitExpressions, cc @davies

@viirya
Copy link
Member Author

viirya commented Apr 14, 2016

maybe. actually splitExpressions is now in limited usage.

@viirya
Copy link
Member Author

viirya commented Apr 14, 2016

Current usage of splitExpressions is to group the sequential access of elements in an input row to many blocks (methods).

As during whole stage pipeline, we mostly process currentVars, instead of input row. It is not expected to use splitExpressions widely across this pipeline. That is why I say it is in limited usage.

@viirya
Copy link
Member Author

viirya commented Apr 15, 2016

cc @davies have time to look at this?

@davies
Copy link
Contributor

davies commented Apr 15, 2016

Since we can easily fallback, I'd like not to make it even complicated to support all the corner. Each column require two arguments, java method only support 255 arguments, to make this really works (not hit this limit), it need to be more complicated than current shape.

Without a very good reason, I'd not to do this for now.

@viirya
Copy link
Member Author

viirya commented Apr 15, 2016

ok. then let me close this now.

@viirya viirya closed this Apr 15, 2016
@viirya viirya deleted the split-expres-current-vars branch December 27, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants