-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18653][SQL] Fix incorrect space padding for unicode character at Dataset.show #16086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #69424 has finished for PR 16086 at commit
|
|
Test build #69425 has finished for PR 16086 at commit
|
|
Test build #69443 has finished for PR 16086 at commit
|
|
Test build #69446 has finished for PR 16086 at commit
|
|
@gatorsmile would it be possible to review this? You would be familiar with Kanji? |
sql/core/pom.xml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against it, but I'm a little hesitant to bring in all this weight to fix a basically cosmetic problem. This may already be included transitively though. WOrth checking the a) license of this library and b) whether it's already in use in the transitive dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
a) I think there is no limitation in the licence
b) I cannot find this jar in the current transitive dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, yes that's a cat-A license so it's OK. With any dependency I'd also want to check whether it brings in anything else under a different license or whether it's particularly large, etc.
Disregard my other comment. I think I was thinking of the fact that Lucene already uses this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ICU is widely used, as shown in http://site.icu-project.org . A very useful package. Before, we used it for codepage conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a ";" at the end of this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ident looks weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we replace StringUtils.leftPad/rightPad with repeatPadding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringUtils.leftPad/rightPad uses String.length. Since this usage causes the same problem, the new code does not use these methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh. Got it.
For this purpose, current repeatPadding looks verbose. If you just want to create exact number of spaces, you can use " " * n.
|
Test build #69483 has finished for PR 16086 at commit
|
|
Test build #69487 has finished for PR 16086 at commit
|
|
Test build #69486 has finished for PR 16086 at commit
|
| if (locale == null) { | ||
| throw new NullPointerException("locale is null") | ||
| } | ||
| val ambiguousLen = if (EAST_ASIAN_LANGS.contains(locale.getLanguage())) 2 else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about creating a separate helper function for the default width?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can create the separate helper for the default width. A challenge is how we can decide the helper can be applied when we have got a string.
While I have been thinking about these conditions, I have not answers yet.
| } | ||
| } | ||
|
|
||
| val EAST_ASIAN_LANGS = Seq("ja", "vi", "kr", "zh") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only these four?
| val value = UCharacter.getIntPropertyValue(codePoint, UProperty.EAST_ASIAN_WIDTH) | ||
| len = len + (value match { | ||
| case UCharacter.EastAsianWidth.NARROW | UCharacter.EastAsianWidth.NEUTRAL | | ||
| UCharacter.EastAsianWidth.HALFWIDTH => 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An indent issue.
|
Now, |
|
yeah, I would think this is a relatively rare use case. Need to consider if it is worth extra complexity. |
|
I agree - don't think this is worth the complexity. |
|
Unless someone vigorously objects, yes let's close this. |
|
I am thinking about an simpler approach. However, it is fine to close for now. |
Closes apache#12968 Closes apache#16215 Closes apache#16212 Closes apache#16086 Closes apache#15713 Closes apache#16413 Closes apache#16396
What changes were proposed in this pull request?
This PR put correct space padding for unicode character at
Dataset.show().The reason of putting incorrect padding is to count string width by string.length. This PR counds string width by using East Asian Width.
Example program
Output without this PR
Output with this PR
How was this patch tested?
Add a test suite