Skip to content

Conversation

@xuejianbest
Copy link
Contributor

@xuejianbest xuejianbest commented Aug 9, 2018

This is not a perfect solution. It is designed to minimize complexity on the basis of solving problems.

It is effective for English, Chinese characters, Japanese, Korean and so on.

before:
+---+---------------------------+-------------+
|id |中国                         |s2           |
+---+---------------------------+-------------+
|1  |ab                         |[a]          |
|2  |null                       |[中国, abc]    |
|3  |ab1                        |[hello world]|
|4  |か行 きゃ(kya) きゅ(kyu) きょ(kyo) |[“中国]        |
|5  |中国(你好)a                    |[“中(国), 312] |
|6  |中国山(东)服务区                  |[“中(国)]      |
|7  |中国山东服务区                    |[中(国)]       |
|8  |                           |[中国]         |
+---+---------------------------+-------------+

after:
+---+-----------------------------------+----------------+
|id |中国                               |s2              |
+---+-----------------------------------+----------------+
|1  |ab                                 |[a]             |
|2  |null                               |[中国, abc]     |
|3  |ab1                                |[hello world]   |
|4  |か行 きゃ(kya) きゅ(kyu) きょ(kyo) |[“中国]         |
|5  |中国(你好)a                      |[“中(国), 312]|
|6  |中国山(东)服务区                   |[“中(国)]      |
|7  |中国山东服务区                     |[中(国)]        |
|8  |                                   |[中国]          |
+---+-----------------------------------+----------------+

What changes were proposed in this pull request?

When there are wide characters such as Chinese characters or Japanese characters in the data, the show method has a alignment problem.
Try to fix this problem.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

image

Please review http://spark.apache.org/contributing.html before opening a pull request.

before:
+---+---------------------------+-------------+
|id |中国                         |s2           |
+---+---------------------------+-------------+
|1  |ab                         |[a]          |
|2  |null                       |[中国, abc]    |
|3  |ab1                        |[hello world]|
|4  |か行 きゃ(kya) きゅ(kyu) きょ(kyo) |[“中国]        |
|5  |中国(你好)a                    |[“中(国), 312] |
|6  |中国山(东)服务区                  |[“中(国)]      |
|7  |中国山东服务区                    |[中(国)]       |
|8  |                           |[中国]         |
+---+---------------------------+-------------+

after:
+---+--------------------------+----------------+
| id|                      中国|              s2|
+---+--------------------------+----------------+
|  1|                        ab|             [a]|
|  2|                      null|     [中国, abc]|
|  3|                       ab1|   [hello world]|
|  4|か行 きゃ(kya) きゅ(kyu...|         [“中国]|
|  5|             中国(你好)a|[“中(国), 312]|
|  6|          中国山(东)服务区|      [“中(国)]|
|  7|            中国山东服务区|        [中(国)]|
|  8|                          |          [中国]|
+---+--------------------------+----------------+
@kiszk
Copy link
Member

kiszk commented Aug 9, 2018

Since this change does not look minor, could you please create a JIRA entry?

@kiszk
Copy link
Member

kiszk commented Aug 9, 2018

I created a PR to solve the same problem.
Can this PR handle East Asian Width correctly?

@xuejianbest
Copy link
Contributor Author

xuejianbest commented Aug 10, 2018

After testing, it is found that regular expressions are changed to the following.
val regex = """[^\x00-\u2e39]""".r

@kiszk
I solve this problem by regular expressions. The complexity and display effect are weighed.

@kiszk
Copy link
Member

kiszk commented Aug 13, 2018

@xuejianbest Could you please create a JIRA entry and add test cases to this PR?

@xuejianbest xuejianbest changed the title Fix the show method to display the wide character alignment problem [SPARK-25108]Fix the show method to display the wide character alignment problem Aug 14, 2018
@xuejianbest xuejianbest changed the title [SPARK-25108]Fix the show method to display the wide character alignment problem [SPARK-25108][SQL] Fix the show method to display the wide character alignment problem Aug 14, 2018
@xuejianbest
Copy link
Contributor Author

Below is the display effect I tested on Xshell5.
https://issues.apache.org/jira/secure/attachment/12935462/show.bmp

@xuejianbest
Copy link
Contributor Author

df.show

@kiszk
Copy link
Member

kiszk commented Aug 17, 2018

Thank you for creating a JIRA entry and for putting the result. The test case is not available yet.

// We set a minimum column width at '3'
val minimumColWidth = 3

val regex = """[^\x00-\u2e39]""".r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use a comment and slightly better name for the variable? I also wonder if a regex is a little bit slow for scanning every character.

However it's not clear this definition is accurate enough. According to things like https://stackoverflow.com/questions/13505075/analyzing-full-width-or-half-width-character-in-java we're really looking for the concept of "fullwidth" East Asian characters. The answer there provides a somewhat more precise definition, though others say you need something like icu4j for a proper definition.

Maybe at least adopt the alternative proposed in the SO answer? It's not necessary to be perfect here, as it's a cosmetic issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @srowen's comment. To make it clear, could you please add tests for more characters?

@srowen
Copy link
Member

srowen commented Aug 26, 2018

Ping @xuejianbest

Modify regular expressions to make them more precise.
Modify variable names and add comment.
@xuejianbest
Copy link
Contributor Author

I looked at all the 0x0000-0xFFFF characters (unicode) and showed them under Xshell, then found all the full width characters. Get the latest regular expression.
A new version has been submitted, and the variable name has been changed and comments added, please see below.
@srowen @kiszk

A new version is submitted, and the variable name is changed and an annotation is added. See below.

@kiszk
Copy link
Member

kiszk commented Aug 28, 2018

@xuejianbest thank you for your update. Would it be possible to commit test cases, too?

// We set a minimum column width at '3'
val minimumColWidth = 3

//Regular expression matching full width characters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very small nit that might fail scalastyle -- space after comment slashes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks.

val minimumColWidth = 3

//Regular expression matching full width characters
val fullWidthRegex = """[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if detecting this using a regex for each character is slow? probably not meaningfully enough to care about, but I wonder.

Copy link
Contributor Author

@xuejianbest xuejianbest Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generated 1000 strings, each consisting of 1000 characters with a random unicode of 0x0000-0xFFFF. (a total of 1 million characters.)
Then use this regular expression to find the full width character of these strings.
I tested 100 rounds and then averaged.
It takes 49 milliseconds to complete matching all 1000 strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, not worth optimizing now.

for (row <- rows) {
for ((cell, i) <- row.zipWithIndex) {
colWidths(i) = math.max(colWidths(i), cell.length)
colWidths(i) = math.max(colWidths(i), cell.length + fullWidthRegex.findAllIn(cell).size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a utility method for the x.length + fullWidthRegex.findAllIn(x).size expression that recurs here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I committed a new version. See if this is appropriate please ?
@srowen

Putting `fullWidthRegex` outside `stringHalfWidth` is to reduce the cost of compiling regular expressions.
Since compilation is expensive, frequently used Regexes should be constructed once.
// The number of half width of a string
def stringHalfWidth = (str: String) => {
str.length + fullWidthRegex.findAllIn(str).size
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to move this method into util.Utils or something as a helper function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to add tests for the helper function, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it into util.Utils.

val minimumColWidth = 3

// Regular expression matching full width characters
val fullWidthRegex = """[\u1100-\u115F\u2E80-\uA4CF\uAC00-\uD7A3\uF900-\uFAFF\uFE10-\uFE19\uFE30-\uFE6F\uFF00-\uFF60\uFFE0-\uFFE6]""".r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line goes over the limit, 100.

Move the stringHalfWidth method into util.Utils and add tests for it.
@xuejianbest
Copy link
Contributor Author

I have moved the method of calculating the half width of a string into util.Utils object.
@maropu @srowen

* A full width character occupies two half widths
*/
def stringHalfWidth(str: String): Int = {
if(str == null) 0 else str.length + fullWidthRegex.findAllIn(str).size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: space before if

/**
* Regular expression matching full width characters
*/
private lazy val fullWidthRegex = ("""[""" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't need to be lazy

@maropu
Copy link
Member

maropu commented Aug 29, 2018

Also, can you add tests in DatasetSuite?

@SparkQA
Copy link

SparkQA commented Aug 30, 2018

Test build #4305 has finished for PR 22048 at commit 3649de5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuejianbest
Copy link
Contributor Author

Why did it fail?
What else should I do?

@xuejianbest
Copy link
Contributor Author

Please help me @srowen
I don't know where I need to make changes.

@srowen
Copy link
Member

srowen commented Aug 31, 2018

Have a look at the test results, which show the style problem it flags:
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4305/console

[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2803:4: nonascii.message
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2804:4: nonascii.message
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2805:4: nonascii.message
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2806:4: nonascii.message
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2807:4: nonascii.message
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2808:4: nonascii.message
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2809:4: nonascii.message
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2810:4: nonascii.message

So it doesn't like non-ASCII chars after all! I wonder how they exist in the other parts of the source. I checked and other parts just disable the check for sections of code where it's appropriate to use unicode chars.

Just wrap these parts of the code in:

// scalastyle:off nonascii
...
// scalastyle:on nonascii

Sorry, didn't know this was needed.

@maropu
Copy link
Member

maropu commented Aug 31, 2018

style errors, we cannot include non-ascii characters in files;

Scalastyle checks failed at following occurrences:
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2803:4: nonascii.message
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2804:4: nonascii.message
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2805:4: nonascii.message
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2806:4: nonascii.message
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2807:4: nonascii.message
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2808:4: nonascii.message
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2809:4: nonascii.message
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/util/Utils.scala:2810:4: nonascii.message
[error] Total time: 16 s, completed Aug 30, 2018 6:48:19 AM
[error] running /home/jenkins/workspace/NewSparkPullRequestBuilder@2/dev/lint-scala ; received return code 1
Attempting to post to Github...
 > Post successful.

@maropu
Copy link
Member

maropu commented Aug 31, 2018

+1 for adding scalastyle:off nonascii in that part.

To solve style errors because of include non-ascii characters in files.
Disable the check for sections of code where it's appropriate to use unicode chars.
@xuejianbest
Copy link
Contributor Author

I see. A new commit has been done.
Thinks.

@kiszk
Copy link
Member

kiszk commented Aug 31, 2018

retest this please

@srowen
Copy link
Member

srowen commented Aug 31, 2018

Hm! so the warning triggers on unicode escape sequences? That seems like a bug in the style checker. Oh well, yes keep them. I'll retest ...

@SparkQA
Copy link

SparkQA commented Aug 31, 2018

Test build #4329 has finished for PR 22048 at commit 45ac272.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Aug 31, 2018

LGTM, then let me ask @gatorsmile

"""\uFE10-\uFE19""" +
"""\uFE30-\uFE6F""" +
"""\uFF00-\uFF60""" +
"""\uFFE0-\uFFE6""" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question.

  • How to get this Regex list? Any reference? It sounds like this should be a general problem
  • What is the performance impact?

Can you answer them and post them in the PR description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • How to get this Regex list? Any reference? It sounds like this should be a general problem

I looked at all the 0x0000-0xFFFF characters (unicode) and showed them under Xshell, then found all the full width characters. Get the regular expression.

  • What is the performance impact?

I generated 1000 strings, each consisting of 1000 characters with a random unicode of 0x0000-0xFFFF. (a total of 1 million characters.)
Then use this regular expression to find the full width character of these strings.
I tested 100 rounds and then averaged.
It takes 49 milliseconds to complete matching all 1000 strings.

@gatorsmile

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at all the 0x0000-0xFFFF characters (unicode) and showed them under Xshell, then found all the full width characters. Get the regular expression.

Can you describe them there and put a references to a public unicode document?
See the comment in UTF8String;
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L65

It takes 49 milliseconds to complete matching all 1000 strings.

How about some additional overheads when calling showString as compared to showString w/o this patch?

Copy link
Contributor Author

@xuejianbest xuejianbest Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe them there and put a references to a public unicode document?

This is a regular expression match using unicode, regardless of the specific encoding.
For example, the following string is encoded using gbk instead of utf8, and the match still works:

    val bytes = Array[Byte](0xd6.toByte, 0xd0.toByte, 0xB9.toByte, 0xFA.toByte)
    val s1 = new String(bytes, "gbk")    
    println(s1) //中国    
    val fullWidthRegex = ("""[""" +
    // scalastyle:off nonascii
    """\u1100-\u115F""" +
    """\u2E80-\uA4CF""" +
    """\uAC00-\uD7A3""" +
    """\uF900-\uFAFF""" +
    """\uFE10-\uFE19""" +
    """\uFE30-\uFE6F""" +
    """\uFF00-\uFF60""" +
    """\uFFE0-\uFFE6""" +
    // scalastyle:on nonascii
    """]""").r
    println(fullWidthRegex.findAllIn(s1).size) //2

This regular expression is obtained experimentally under a specific font.
I don't understand what you are going to do.

How about some additional overheads when calling showString as compared to showString w/o this patch?

I tested a Dataset consisting of 100 rows, each row has two columns, one column is the index (0-99), and the other column is a random string of length 100 characters, and then the showString display is called separately.
The original showString method (w/o this patch) took about 42ms, and the improved time took about 46ms, and the performance was about 10% worse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine. Just copy a summary of your comments here into the comments in the code. Yes this has nothing to do with UTF8 encoding directly. You are matching UCS2 really, 16bit char values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I need to merge the above commited into one commit,
Or add another new commit?
Or change the last commit comments ?
@srowen

@srowen
Copy link
Member

srowen commented Sep 4, 2018

Just add a new commit with a little info about what this does and how you generated it. The commits are squashed anyway.

…blem

Some characters (called full-width characters) occupy two
ordinary characters (called half-width characters) width
when displayed in a terminal such as Xshell.
Therefore, when the Dataset.show() method is called,
if there are full-width characters and half-width characters
at the same time, alignment problems will occur.

This commit adds a method to calculate the number of half
widths in a given string, so that it can get the correct fill
when calling show method, which gives the display a more
perfect alignment.

Performance impact:
Tested a Dataset consisting of 100 rows, each row has two columns,
one column is the index (0-99), and the other column is a random
string of length 100 characters, and then the showString display
is called separately.
The original show method (w/o this patch) took about 42ms, and the
improved time took about 46ms, the performance was about 10% worse.
@xuejianbest
Copy link
Contributor Author

I see, Thinks.

@asfgit asfgit closed this in f5817d8 Sep 6, 2018
@srowen
Copy link
Member

srowen commented Sep 6, 2018

Merging to master and 2.4, though for some reason I can't back-port with the merge script right now. Will try later.

@srowen
Copy link
Member

srowen commented Sep 6, 2018

@dongjoon-hyun could I ask a favor; can you try merging this? the merge script should say it's merged and offer to back-port to 2.4. However I keep getting errors about it not being mergeable, although all upstream branches and my clone all are up to date.

@dongjoon-hyun
Copy link
Member

@srowen . Sure. I'll try to backport this.

@dongjoon-hyun
Copy link
Member

Ur, it seems that I'm facing the same situation. In this case, should we just cherry-pick to branch-2.4?

asfgit pushed a commit that referenced this pull request Sep 6, 2018
…alignment problem

This is not a perfect solution. It is designed to minimize complexity on the basis of solving problems.

It is effective for English, Chinese characters, Japanese, Korean and so on.

```scala
before:
+---+---------------------------+-------------+
|id |中国                         |s2           |
+---+---------------------------+-------------+
|1  |ab                         |[a]          |
|2  |null                       |[中国, abc]    |
|3  |ab1                        |[hello world]|
|4  |か行 きゃ(kya) きゅ(kyu) きょ(kyo) |[“中国]        |
|5  |中国(你好)a                    |[“中(国), 312] |
|6  |中国山(东)服务区                  |[“中(国)]      |
|7  |中国山东服务区                    |[中(国)]       |
|8  |                           |[中国]         |
+---+---------------------------+-------------+

after:
+---+-----------------------------------+----------------+
|id |中国                               |s2              |
+---+-----------------------------------+----------------+
|1  |ab                                 |[a]             |
|2  |null                               |[中国, abc]     |
|3  |ab1                                |[hello world]   |
|4  |か行 きゃ(kya) きゅ(kyu) きょ(kyo) |[“中国]         |
|5  |中国(你好)a                      |[“中(国), 312]|
|6  |中国山(东)服务区                   |[“中(国)]      |
|7  |中国山东服务区                     |[中(国)]        |
|8  |                                   |[中国]          |
+---+-----------------------------------+----------------+
```

## What changes were proposed in this pull request?

When there are wide characters such as Chinese characters or Japanese characters in the data, the show method has a alignment problem.
Try to fix this problem.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

![image](https://user-images.githubusercontent.com/13044869/44250564-69f6b400-a227-11e8-88b2-6cf6960377ff.png)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #22048 from xuejianbest/master.

Authored-by: xuejianbest <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented Sep 6, 2018

OK, it isn't just me. We'll see if it's a broader problem.

I manually cherry-picked to branch-2.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants