Skip to content

Conversation

@mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

Code generation is disabled for CaseWhen when the number of branches is higher than spark.sql.codegen.maxCaseBranches (which defaults to 20). This was done to prevent the well known 64KB method limit exception.
This PR proposes to support code generation also in those cases (without causing exceptions of course). As a side effect, we could get rid of the spark.sql.codegen.maxCaseBranches configuration.

How was this patch tested?

existing UTs

@SparkQA
Copy link

SparkQA commented Nov 15, 2017

Test build #83866 has finished for PR 19752 at commit 98eaae9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CaseWhen(

@mgaido91
Copy link
Contributor Author

@kiszk may I kindly ask you to review this please? Thanks.

@kiszk
Copy link
Member

kiszk commented Nov 19, 2017

Sure, I will review this.
It may have some overlaps with #18641. I will review this after #18641 due to avoiding a conflict.

@mgaido91
Copy link
Contributor Author

@kiszk actually checking your PR I think that the same issue addressed there would be handled also here by default. What do you think?

val casesCode = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
cases.mkString("\n")
} else {
ctx.splitExpressions(cases, "caseWhen",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In almost all the cases, we do not need to call splitExpressions after merging the PR #19767?

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need to call it, indeed, as explained in this comment: #19767 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think that this implicitly covers also #18641, even though its main goal is another.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, could you show us a test case? Can be a performance test if the function is hard to hit a 64KB limit.

Copy link
Contributor Author

@mgaido91 mgaido91 Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reuse the same UT added in #18641 for the 64KB limit, if it is ok for you.

As far as the performance is regarded, I'd need to create a test with many rows, otherwise the overhead is higher than the execution time and such a case is not going to finish in few seconds. Do you want me to post here in the PR some code and the times of execution before and after the PR, without adding it as a test?

@mgaido91
Copy link
Contributor Author

@gatorsmile I added a test case to check that the execution plan is WholeStageCodegenExec as expected. I also made some performance test using almost the same code, ie.:

val N = 30
val nRows = 1000000
var expr1 = when($"id" === lit(0), 0)
var expr2 = when($"id" === lit(0), 10)
(1 to N).foreach { i =>
  expr1 = expr1.when($"id" === lit(i), -i)
  expr2 = expr2.when($"id" === lit(i + 10), i)
}
time { spark.range(nRows).select(expr1.as("c1"), expr2.otherwise(0).as("c2")).sort("c1").show }

before this PR, it takes on average 1091.690996ms. After the PR, it takes on average 106.894443ns.

Actually there is a problem which is fixed in #18641 and it is not fixed here, ie. when the code contains deeply nested exceptions, the 64KB limit exception can still happen. But this should be handled in a more generic way in #19813.

@kiszk What do you think?

@SparkQA
Copy link

SparkQA commented Nov 24, 2017

Test build #84168 has finished for PR 19752 at commit 6225c8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"""
},
foldFunctions = { funcCalls =>
funcCalls.map(funcCall => s"$conditionMet = $funcCall;").mkString("\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When caseWhenConditionMet is false, we do not need to call funcCall .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, but we already have checks about it inside the functions. Do we need also to check it outside?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to avoid the extra function calls here. It is not cheap when the number of rows is large. Now, we split the functions pretty aggressively. I saw many new functions are generated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll do. Then I'd suggest to do the same also in other places. I can check where an analogous pattern is used and create a PR if it is ok.

// }
// }
// }
val conditionMet = ctx.freshName("caseWhenConditionMet")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to explain what it is.

// }
// }
val conditionMet = ctx.freshName("caseWhenConditionMet")
ctx.addMutableState("boolean", ev.isNull, "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx.JAVA_BOOLEAN

allConditions.mkString("\n")
} else {
ctx.splitExpressions(allConditions, "caseWhen",
("InternalRow", ctx.INPUT_ROW) :: ("boolean", conditionMet) :: Nil, returnType = "boolean",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx.JAVA_BOOLEAN

val code = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
allConditions.mkString("\n")
} else {
ctx.splitExpressions(allConditions, "caseWhen",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style issue. Indent

1
> SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;
2
> SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 ELSE null END;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 ELSE null END;
->
SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 END;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you double check Hive returns NULL in the following case?

SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 END;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can follow your first suggestion and I can test this on hive, but actually I haven't changed this part of code. I will post ASAP the result in Hive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm that Hive returns NULL. Then I am updating the description as requested.

expr2 = expr2.when($"id" === lit(i + 10), i)
}
val df = spark.range(1).select(expr1, expr2.otherwise(0))
df.show
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare the results?

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84198 has finished for PR 19752 at commit f9c20be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84199 has finished for PR 19752 at commit 9063583.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM cc @cloud-fan @kiszk

// evaluates to `true` is met and therefore is not needed to go on anymore on the computation
// of the following conditions.
val conditionMet = ctx.freshName("caseWhenConditionMet")
ctx.addMutableState(ctx.JAVA_BOOLEAN, ev.isNull, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ctx.addMutableState(ctx.JAVA_BOOLEAN, ev.isNull), as empty string is the default value of the 3rd parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I branched from a version when there was no default value. I merged and fixed it.

if ($conditionMet) {
continue;
}"""
}.mkString("do {", "", "\n} while (false);")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need a \n after do {?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, since there is a newline at the beginning of each expression.

extends CaseWhenBase(branches, elseValue) with Serializable {

override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
// Generate code that looks like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we keep this comment and update it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is necessary since now the generated code is way easier and more standard and nowhere else a comment like this is provided. Anyway, if you feel it is needed, I can add it.

@cloud-fan
Copy link
Contributor

LGTM except a few minor comments

allConditions.mkString("\n")
} else {
ctx.splitExpressions(allConditions, "caseWhen",
("InternalRow", ctx.INPUT_ROW) :: (ctx.JAVA_BOOLEAN, conditionMet) :: Nil,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass conditionMet as an argument? I think conditionMet is always false when a function is called.
If true, we can declare conditionMet as a local variable.

Copy link
Member

@gatorsmile gatorsmile Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the latest changes, conditionMet can be changed to a local variable.

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84206 has finished for PR 19752 at commit f4c7896.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

s"""
${ctx.JAVA_BOOLEAN} $conditionMet = false;
$func
return $conditionMet;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we apply the same do while optimization here? i.e.

do {
  conditionMet = caseWhen_1(i);
  if(conditionMet) {
    continue;
  }
} while (false)

boolean caseWhen_1(IntenralRow i) {
  do {
    if (!conditionMet) {
      code...
      set value and isnull...
      $conditionMet = true;
      continue;
    }
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would complicate the code and I don't think it is worth, since if the code is not split, it means that we don't have many conditions, thus we would save only few if (conditionMet) evaluations. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in most cases we just split the codes into a few methods, which means, it's more important to apply the do while optimization inside the method(a method may have a lot of conditions checks), than between the methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but having this optimization outside means skipping whole methods. Anyway, if you think that this optimization is needed I can do it. I think only that the code readability would be a bit worse but I'll try to address this problem with comments.

// continue;
// }
// ...
// } while (false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplified to

do {
  if (caseWhen_1(i)) {
    continue;
  }
  if (caseWhen_2(i)) {
    continue;
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because in this way we would not set conditionMet. This can cause an error, in the case we have so many functions that their invocation goes beyond the 64KB limit: this problem is fixed by #19480, by creating a method which calls the methods of that class, In this case, if we don't set properly conditionMet, we would have a bug in the implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it, are you saying

do {
  if (caseWhen_1(i)) {
    continue;
  }
  if (caseWhen_2(i)) {
    continue;
  }
}

is worse than

do {
  condition = caseWhen_1(i);
  if (condition) {
    continue;
  }
  condition = caseWhen_2(i);
  if (condition) {
    continue;
  }
}

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am saying that the first one is wrong since it doesn't set condition (which can be returned at the end of the method as of #19480), since it may return wrong results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to explain with an example. If we have a lot of methods, this can exceed 64KB:

do {
  condition = caseWhen_1(i);
  if (condition) {
    continue;
  }
  condition = caseWhen_2(i);
  if (condition) {
    continue;
  }
  // a lot of other methods here
}

Thus, #19480 can split them into:

do {
  condition = bunchOf_caseWhen_1(i);
  if (condition) {
    continue;
  }
  condition = bunchOf_caseWhen_2(i);
  if (condition) {
    continue;
  }
  // maybe some other here
} while (false);
...
InnerClass1 {
  boolean bunchOf_caseWhen_1(InternalRow i) {
   boolean condition = false;
    do {
      condition = caseWhen_1(i);
      if (condition) {
        continue;
      }
      condition = caseWhen_2(i);
      if (condition) {
        continue;
      }
      // a lot of other methods here
    } while (false);
    return condition;
  }
  ...
}

in this case, the implementation you are suggesting can return a wrong value in bunchOf_caseWhen_1 and this will affect the correctness of the code.

Copy link
Contributor

@cloud-fan cloud-fan Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see, we have to set the condition if the code is inside a method.

do {
  if (bunchOf_caseWhen_1(i)) {
    continue;
  }
  if (bunchOf_caseWhen_2(i)) {
    continue;
  }
  // maybe some other here
} while (false);
...
InnerClass1 {
  boolean bunchOf_caseWhen_1(InternalRow i) {
    do {
      if (caseWhen_1(i)) {
        return true;
      }
      if (caseWhen_2(i)) {
        return true;
      }
      // a lot of other methods here
    } while (false);
    return false;
  }
  ...
}

This would be the optimal but it's too much complexity with only a little gain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would require a refactoring of the splitExpressions method which is really not worth IMHO. I will leave this as it is now, while I am addressing your other comment, thanks.

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84212 has finished for PR 19752 at commit 5adb513.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84224 has finished for PR 19752 at commit c7347b1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84229 has finished for PR 19752 at commit dd5f455.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 087879a Nov 27, 2017
ghost pushed a commit to dbtsai/spark that referenced this pull request Nov 28, 2017
## What changes were proposed in this pull request?

a minor cleanup for apache#19752 . Remove the outer if as the code is inside `do while`

## How was this patch tested?

existing tests

Author: Wenchen Fan <[email protected]>

Closes apache#19830 from cloud-fan/minor.
ghost pushed a commit to dbtsai/spark that referenced this pull request Dec 3, 2017
## What changes were proposed in this pull request?

In many parts of the codebase for code generation, we are splitting the code to avoid exceptions due to the 64KB method size limit. This is generating a lot of methods which are called every time, even though sometime this is not needed. As pointed out here: apache#19752 (comment), this is a not negligible overhead which can be avoided.

The PR applies the same approach used in apache#19752 also to the other places where this was feasible.

## How was this patch tested?

existing UTs.

Author: Marco Gaido <[email protected]>

Closes apache#19860 from mgaido91/SPARK-22669.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants