[SPARK-18653][SQL] Fix incorrect space padding for unicode character at Dataset.show #16086

kiszk · 2016-11-30T19:14:54Z

What changes were proposed in this pull request?

This PR put correct space padding for unicode character at Dataset.show().
The reason of putting incorrect padding is to count string width by string.length. This PR counds string width by using East Asian Width.

Example program

case class UnicodeCaseClass(整数: Int, 実数: Double, s: String)
Seq(UnicodeCaseClass(1, 1.1, "文字列1")).toDS.show

Output without this PR

+---+---+----+
| 整数| 実数|   s|
+---+---+----+
|  1|1.1|文字列1|
+---+---+----+

Output with this PR

+----+----+-------+
|整数|実数|      s|
+----+----+-------+
|   1| 1.1|文字列1|
+----+----+-------+

How was this patch tested?

Add a test suite

SparkQA · 2016-11-30T19:21:02Z

Test build #69424 has finished for PR 16086 at commit 93379b6.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-30T21:55:49Z

Test build #69425 has finished for PR 16086 at commit e4555f7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-01T04:27:02Z

Test build #69443 has finished for PR 16086 at commit 4e71dc6.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-12-01T05:03:13Z

Test build #69446 has finished for PR 16086 at commit 350e1ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2016-12-01T05:15:51Z

@gatorsmile would it be possible to review this? You would be familiar with Kanji?

srowen · 2016-12-01T08:40:57Z

sql/core/pom.xml

I'm not against it, but I'm a little hesitant to bring in all this weight to fix a basically cosmetic problem. This may already be included transitively though. WOrth checking the a) license of this library and b) whether it's already in use in the transitive dependencies?

Sure.
a) I think there is no limitation in the licence
b) I cannot find this jar in the current transitive dependency

OK, yes that's a cat-A license so it's OK. With any dependency I'd also want to check whether it brings in anything else under a different license or whether it's particularly large, etc.

Disregard my other comment. I think I was thinking of the fact that Lucene already uses this.

ICU is widely used, as shown in http://site.icu-project.org . A very useful package. Before, we used it for codepage conversion.

viirya · 2016-12-01T13:46:44Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

good catch, done

viirya · 2016-12-01T13:47:55Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

There is a ";" at the end of this line.

thanks, done

viirya · 2016-12-01T13:49:02Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

ident looks weird.

oh, updated

viirya · 2016-12-01T14:00:38Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

Any reason we replace StringUtils.leftPad/rightPad with repeatPadding?

StringUtils.leftPad/rightPad uses String.length. Since this usage causes the same problem, the new code does not use these methods.

oh. Got it.

For this purpose, current repeatPadding looks verbose. If you just want to create exact number of spaces, you can use " " * n.

SparkQA · 2016-12-01T17:59:34Z

Test build #69483 has finished for PR 16086 at commit 60aa2cb.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-12-01T18:20:17Z

Test build #69487 has finished for PR 16086 at commit e828ad9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-01T18:26:54Z

Test build #69486 has finished for PR 16086 at commit 7049809.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-02T00:31:30Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+    if (locale == null) {
+      throw new NullPointerException("locale is null")
+    }
+    val ambiguousLen = if (EAST_ASIAN_LANGS.contains(locale.getLanguage())) 2 else 1


How about creating a separate helper function for the default width?

I can create the separate helper for the default width. A challenge is how we can decide the helper can be applied when we have got a string.
While I have been thinking about these conditions, I have not answers yet.

gatorsmile · 2016-12-02T00:34:20Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

      }
  }

+  val EAST_ASIAN_LANGS = Seq("ja", "vi", "kr", "zh")


Only these four?

gatorsmile · 2016-12-02T00:34:50Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+      val value = UCharacter.getIntPropertyValue(codePoint, UProperty.EAST_ASIAN_WIDTH)
+      len = len + (value match {
+        case UCharacter.EastAsianWidth.NARROW | UCharacter.EastAsianWidth.NEUTRAL |
+             UCharacter.EastAsianWidth.HALFWIDTH => 1


An indent issue.

gatorsmile · 2016-12-02T00:37:25Z

Now, showString become more complex for handling the Unicode width. I am neutral on this fix.

viirya · 2016-12-02T01:42:32Z

yeah, I would think this is a relatively rare use case. Need to consider if it is worth extra complexity.

rxin · 2016-12-02T06:48:16Z

I agree - don't think this is worth the complexity.

srowen · 2016-12-11T09:38:07Z

Unless someone vigorously objects, yes let's close this.

kiszk · 2016-12-12T02:51:02Z

I am thinking about an simpler approach. However, it is fine to close for now.

Closes apache#12968 Closes apache#16215 Closes apache#16212 Closes apache#16086 Closes apache#15713 Closes apache#16413 Closes apache#16396

kiszk force-pushed the SPARK-18653 branch from 4e71dc6 to 350e1ae Compare December 1, 2016 02:28

srowen reviewed Dec 1, 2016

View reviewed changes

viirya reviewed Dec 1, 2016

View reviewed changes

kiszk force-pushed the SPARK-18653 branch from 350e1ae to 60aa2cb Compare December 1, 2016 15:26

kiszk added 3 commits December 2, 2016 01:01

initial commit

14deeed

add jar for icu

a42547d

fix test failure

7049809

kiszk force-pushed the SPARK-18653 branch from 60aa2cb to 7049809 Compare December 1, 2016 16:02

kiszk added 2 commits December 2, 2016 01:22

addressed review comments

8e1b434

add one more line for the testsuite

e828ad9

gatorsmile reviewed Dec 2, 2016

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

}

}

val EAST_ASIAN_LANGS = Seq("ja", "vi", "kr", "zh")

Copy link

Member

gatorsmile Dec 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only these four?

gatorsmile reviewed Dec 2, 2016

View reviewed changes

srowen added a commit to srowen/spark that referenced this pull request Jan 1, 2017

Close stale PRs

ecaca37

Closes apache#12968 Closes apache#16215 Closes apache#16212 Closes apache#16086 Closes apache#15713 Closes apache#16413 Closes apache#16396

srowen mentioned this pull request Jan 1, 2017

[BUILD] Close stale PRs #16447

Closed

asfgit closed this in ba48812 Jan 2, 2017

kiszk mentioned this pull request Aug 9, 2018

[SPARK-25108][SQL] Fix the show method to display the wide character alignment problem #22048

Closed

[SPARK-18653][SQL] Fix incorrect space padding for unicode character at Dataset.show #16086

[SPARK-18653][SQL] Fix incorrect space padding for unicode character at Dataset.show #16086

Uh oh!

Conversation

kiszk commented Nov 30, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 30, 2016

Uh oh!

SparkQA commented Nov 30, 2016

Uh oh!

SparkQA commented Dec 1, 2016

Uh oh!

SparkQA commented Dec 1, 2016

Uh oh!

kiszk commented Dec 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 1, 2016

Uh oh!

SparkQA commented Dec 1, 2016

Uh oh!

SparkQA commented Dec 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Dec 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Dec 2, 2016

Uh oh!

viirya commented Dec 2, 2016

Uh oh!

rxin commented Dec 2, 2016

Uh oh!

srowen commented Dec 11, 2016

Uh oh!

kiszk commented Dec 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kiszk Dec 2, 2016 •

edited

Loading