Skip to content

Conversation

@kiszk
Copy link
Member

@kiszk kiszk commented Nov 30, 2016

What changes were proposed in this pull request?

This PR put correct space padding for unicode character at Dataset.show().
The reason of putting incorrect padding is to count string width by string.length. This PR counds string width by using East Asian Width.

Example program

case class UnicodeCaseClass(整数: Int, 実数: Double, s: String)
Seq(UnicodeCaseClass(1, 1.1, "文字列1")).toDS.show

Output without this PR

+---+---+----+
| 整数| 実数|   s|
+---+---+----+
|  1|1.1|文字列1|
+---+---+----+

Output with this PR

+----+----+-------+
|整数|実数|      s|
+----+----+-------+
|   1| 1.1|文字列1|
+----+----+-------+

How was this patch tested?

Add a test suite

@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69424 has finished for PR 16086 at commit 93379b6.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69425 has finished for PR 16086 at commit e4555f7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2016

Test build #69443 has finished for PR 16086 at commit 4e71dc6.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2016

Test build #69446 has finished for PR 16086 at commit 350e1ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Dec 1, 2016

@gatorsmile would it be possible to review this? You would be familiar with Kanji?

sql/core/pom.xml Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against it, but I'm a little hesitant to bring in all this weight to fix a basically cosmetic problem. This may already be included transitively though. WOrth checking the a) license of this library and b) whether it's already in use in the transitive dependencies?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.
a) I think there is no limitation in the licence
b) I cannot find this jar in the current transitive dependency

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yes that's a cat-A license so it's OK. With any dependency I'd also want to check whether it brings in anything else under a different license or whether it's particularly large, etc.

Disregard my other comment. I think I was thinking of the fact that Lucene already uses this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ICU is widely used, as shown in http://site.icu-project.org . A very useful package. Before, we used it for codepage conversion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why var?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a ";" at the end of this line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ident looks weird.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we replace StringUtils.leftPad/rightPad with repeatPadding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringUtils.leftPad/rightPad uses String.length. Since this usage causes the same problem, the new code does not use these methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. Got it.

For this purpose, current repeatPadding looks verbose. If you just want to create exact number of spaces, you can use " " * n.

@SparkQA
Copy link

SparkQA commented Dec 1, 2016

Test build #69483 has finished for PR 16086 at commit 60aa2cb.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2016

Test build #69487 has finished for PR 16086 at commit e828ad9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2016

Test build #69486 has finished for PR 16086 at commit 7049809.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (locale == null) {
throw new NullPointerException("locale is null")
}
val ambiguousLen = if (EAST_ASIAN_LANGS.contains(locale.getLanguage())) 2 else 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about creating a separate helper function for the default width?

Copy link
Member Author

@kiszk kiszk Dec 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can create the separate helper for the default width. A challenge is how we can decide the helper can be applied when we have got a string.
While I have been thinking about these conditions, I have not answers yet.

}
}

val EAST_ASIAN_LANGS = Seq("ja", "vi", "kr", "zh")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only these four?

val value = UCharacter.getIntPropertyValue(codePoint, UProperty.EAST_ASIAN_WIDTH)
len = len + (value match {
case UCharacter.EastAsianWidth.NARROW | UCharacter.EastAsianWidth.NEUTRAL |
UCharacter.EastAsianWidth.HALFWIDTH => 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An indent issue.

@gatorsmile
Copy link
Member

Now, showString become more complex for handling the Unicode width. I am neutral on this fix.

@viirya
Copy link
Member

viirya commented Dec 2, 2016

yeah, I would think this is a relatively rare use case. Need to consider if it is worth extra complexity.

@rxin
Copy link
Contributor

rxin commented Dec 2, 2016

I agree - don't think this is worth the complexity.

@srowen
Copy link
Member

srowen commented Dec 11, 2016

Unless someone vigorously objects, yes let's close this.

@kiszk
Copy link
Member Author

kiszk commented Dec 12, 2016

I am thinking about an simpler approach. However, it is fine to close for now.

srowen added a commit to srowen/spark that referenced this pull request Jan 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants