Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

Current, SparkSQL initCap is using toTitleCase function. However, UTF8String.toTitleCase implementation changes only the first letter and just copy the other letters: e.g. sParK --> SParK. This is the correct implementation toTitleCase.

hive> select initcap('sParK');
Spark
scala> sql("select initcap('sParK')").head
res0: org.apache.spark.sql.Row = [SParK]

This PR updates the implementation of initcap using toLowerCase and toTitleCase.

How was this patch tested?

Pass the Jenkins tests (including new testcase).

@SparkQA
Copy link

SparkQA commented Apr 5, 2016

Test build #54976 has finished for PR 12175 at commit d165bad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-14402][CORE] UTF8String.toTitleCase should return non-first letters in lowercase [WIP][SPARK-14402][CORE] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string Apr 5, 2016
@dongjoon-hyun
Copy link
Member Author

Hi, @srowen . I minimized the change on master.

  • Undo the changes on common module.
  • Implement initCap by the following changes in stringExpressions.scala of catalyst module.
   override def nullSafeEval(string: Any): Any = {
-    string.asInstanceOf[UTF8String].toTitleCase
+    string.asInstanceOf[UTF8String].toLowerCase.toTitleCase
   }
   override def genCode(ctx: CodegenContext, ev: ExprCode): String = {
-    defineCodeGen(ctx, ev, str => s"$str.toTitleCase()")
+    defineCodeGen(ctx, ev, str => s"$str.toLowerCase().toTitleCase()")
   }

I think it's enough for initCap function as a small fix for now. How do you think about this?

@dongjoon-hyun dongjoon-hyun changed the title [WIP][SPARK-14402][CORE] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string [SPARK-14402][CORE] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string Apr 5, 2016
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-14402][CORE] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string [SPARK-14402][SQL] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string Apr 5, 2016
@srowen
Copy link
Member

srowen commented Apr 5, 2016

I think that's pretty reasonable as a minimally invasive fix. CC @marmbrus for visibility as it's technically a behavior change

@dongjoon-hyun
Copy link
Member Author

Thank you, @srowen !

@marmbrus
Copy link
Contributor

marmbrus commented Apr 5, 2016

It does seem reasonable to match hive since that was probably the original intention. I've tagged the JIRA for inclusion in the release notes. A few comments:

  • Update the scala doc to correctly describe the new behavior.
  • (existing) it would be great to add an expression description annotation to InitCap
  • (minor) We are now double allocating the byte array. It might be nice to actually implement this in UTF8String, but we don't have to do this to merge this PR.
  • Is there no way to do title case anymore?

@dongjoon-hyun
Copy link
Member Author

Thank you, @marmbrus . I will update the scala docand add description annotation for InitCap.
And the for last comment, if you agree, I hope to implement toTitleCase as a new function (also in another PR.)

@SparkQA
Copy link

SparkQA commented Apr 5, 2016

Test build #55000 has finished for PR 12175 at commit 69c6e1c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 5, 2016

Test build #55006 has finished for PR 12175 at commit e6258aa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 5, 2016

Test build #55009 has finished for PR 12175 at commit 003ab98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Finally, I passed the Jenkins with the updated code.
Please let me know if there is something I missed.
Thank you, @marmbrus and @srowen .

@marmbrus
Copy link
Contributor

marmbrus commented Apr 5, 2016

Thanks, merging to master.

@asfgit asfgit closed this in c59abad Apr 5, 2016
@dongjoon-hyun dongjoon-hyun deleted the SPARK-14402 branch May 12, 2016 00:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants