Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions core/src/main/scala/org/apache/spark/util/Utils.scala
Original file line number Diff line number Diff line change
Expand Up @@ -2794,6 +2794,36 @@ private[spark] object Utils extends Logging {
}
}
}

/**
* Regular expression matching full width characters.
*
* Looked at all the 0x0000-0xFFFF characters (unicode) and showed them under Xshell.
* Found all the full width characters, then get the regular expression.
*/
private val fullWidthRegex = ("""[""" +
// scalastyle:off nonascii
"""\u1100-\u115F""" +
"""\u2E80-\uA4CF""" +
"""\uAC00-\uD7A3""" +
"""\uF900-\uFAFF""" +
"""\uFE10-\uFE19""" +
"""\uFE30-\uFE6F""" +
"""\uFF00-\uFF60""" +
"""\uFFE0-\uFFE6""" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question.

  • How to get this Regex list? Any reference? It sounds like this should be a general problem
  • What is the performance impact?

Can you answer them and post them in the PR description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • How to get this Regex list? Any reference? It sounds like this should be a general problem

I looked at all the 0x0000-0xFFFF characters (unicode) and showed them under Xshell, then found all the full width characters. Get the regular expression.

  • What is the performance impact?

I generated 1000 strings, each consisting of 1000 characters with a random unicode of 0x0000-0xFFFF. (a total of 1 million characters.)
Then use this regular expression to find the full width character of these strings.
I tested 100 rounds and then averaged.
It takes 49 milliseconds to complete matching all 1000 strings.

@gatorsmile

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at all the 0x0000-0xFFFF characters (unicode) and showed them under Xshell, then found all the full width characters. Get the regular expression.

Can you describe them there and put a references to a public unicode document?
See the comment in UTF8String;
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L65

It takes 49 milliseconds to complete matching all 1000 strings.

How about some additional overheads when calling showString as compared to showString w/o this patch?

Copy link
Contributor Author

@xuejianbest xuejianbest Sep 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe them there and put a references to a public unicode document?

This is a regular expression match using unicode, regardless of the specific encoding.
For example, the following string is encoded using gbk instead of utf8, and the match still works:

    val bytes = Array[Byte](0xd6.toByte, 0xd0.toByte, 0xB9.toByte, 0xFA.toByte)
    val s1 = new String(bytes, "gbk")    
    println(s1) //中国    
    val fullWidthRegex = ("""[""" +
    // scalastyle:off nonascii
    """\u1100-\u115F""" +
    """\u2E80-\uA4CF""" +
    """\uAC00-\uD7A3""" +
    """\uF900-\uFAFF""" +
    """\uFE10-\uFE19""" +
    """\uFE30-\uFE6F""" +
    """\uFF00-\uFF60""" +
    """\uFFE0-\uFFE6""" +
    // scalastyle:on nonascii
    """]""").r
    println(fullWidthRegex.findAllIn(s1).size) //2

This regular expression is obtained experimentally under a specific font.
I don't understand what you are going to do.

How about some additional overheads when calling showString as compared to showString w/o this patch?

I tested a Dataset consisting of 100 rows, each row has two columns, one column is the index (0-99), and the other column is a random string of length 100 characters, and then the showString display is called separately.
The original showString method (w/o this patch) took about 42ms, and the improved time took about 46ms, and the performance was about 10% worse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine. Just copy a summary of your comments here into the comments in the code. Yes this has nothing to do with UTF8 encoding directly. You are matching UCS2 really, 16bit char values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I need to merge the above commited into one commit,
Or add another new commit?
Or change the last commit comments ?
@srowen

// scalastyle:on nonascii
"""]""").r

/**
* Return the number of half widths in a given string. Note that a full width character
* occupies two half widths.
*
* For a string consisting of 1 million characters, the execution of this method requires
* about 50ms.
*/
def stringHalfWidth(str: String): Int = {
if (str == null) 0 else str.length + fullWidthRegex.findAllIn(str).size
}
}

private[util] object CallerContext extends Logging {
Expand Down
21 changes: 21 additions & 0 deletions core/src/test/scala/org/apache/spark/util/UtilsSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -1184,6 +1184,27 @@ class UtilsSuite extends SparkFunSuite with ResetSystemProperties with Logging {
assert(Utils.getSimpleName(classOf[MalformedClassObject.MalformedClass]) ===
"UtilsSuite$MalformedClassObject$MalformedClass")
}

test("stringHalfWidth") {
// scalastyle:off nonascii
assert(Utils.stringHalfWidth(null) == 0)
assert(Utils.stringHalfWidth("") == 0)
assert(Utils.stringHalfWidth("ab c") == 4)
assert(Utils.stringHalfWidth("1098") == 4)
assert(Utils.stringHalfWidth("mø") == 2)
assert(Utils.stringHalfWidth("γύρ") == 3)
assert(Utils.stringHalfWidth("pê") == 2)
assert(Utils.stringHalfWidth("ー") == 2)
assert(Utils.stringHalfWidth("测") == 2)
assert(Utils.stringHalfWidth("か") == 2)
assert(Utils.stringHalfWidth("걸") == 2)
assert(Utils.stringHalfWidth("à") == 1)
assert(Utils.stringHalfWidth("焼") == 2)
assert(Utils.stringHalfWidth("羍む") == 4)
assert(Utils.stringHalfWidth("뺭ᾘ") == 3)
assert(Utils.stringHalfWidth("\u0967\u0968\u0969") == 3)
// scalastyle:on nonascii
}
}

private class SimpleExtension
Expand Down
18 changes: 9 additions & 9 deletions sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
Original file line number Diff line number Diff line change
Expand Up @@ -301,16 +301,16 @@ class Dataset[T] private[sql](
// Compute the width of each column
for (row <- rows) {
for ((cell, i) <- row.zipWithIndex) {
colWidths(i) = math.max(colWidths(i), cell.length)
colWidths(i) = math.max(colWidths(i), Utils.stringHalfWidth(cell))
}
}

val paddedRows = rows.map { row =>
row.zipWithIndex.map { case (cell, i) =>
if (truncate > 0) {
StringUtils.leftPad(cell, colWidths(i))
StringUtils.leftPad(cell, colWidths(i) - Utils.stringHalfWidth(cell) + cell.length)
} else {
StringUtils.rightPad(cell, colWidths(i))
StringUtils.rightPad(cell, colWidths(i) - Utils.stringHalfWidth(cell) + cell.length)
}
}
}
Expand All @@ -332,12 +332,10 @@ class Dataset[T] private[sql](

// Compute the width of field name and data columns
val fieldNameColWidth = fieldNames.foldLeft(minimumColWidth) { case (curMax, fieldName) =>
math.max(curMax, fieldName.length)
math.max(curMax, Utils.stringHalfWidth(fieldName))
}
val dataColWidth = dataRows.foldLeft(minimumColWidth) { case (curMax, row) =>
math.max(curMax, row.map(_.length).reduceLeftOption[Int] { case (cellMax, cell) =>
math.max(cellMax, cell)
}.getOrElse(0))
math.max(curMax, row.map(cell => Utils.stringHalfWidth(cell)).max)
}

dataRows.zipWithIndex.foreach { case (row, i) =>
Expand All @@ -346,8 +344,10 @@ class Dataset[T] private[sql](
s"-RECORD $i", fieldNameColWidth + dataColWidth + 5, "-")
sb.append(rowHeader).append("\n")
row.zipWithIndex.map { case (cell, j) =>
val fieldName = StringUtils.rightPad(fieldNames(j), fieldNameColWidth)
val data = StringUtils.rightPad(cell, dataColWidth)
val fieldName = StringUtils.rightPad(fieldNames(j),
fieldNameColWidth - Utils.stringHalfWidth(fieldNames(j)) + fieldNames(j).length)
val data = StringUtils.rightPad(cell,
dataColWidth - Utils.stringHalfWidth(cell) + cell.length)
s" $fieldName | $data "
}.addString(sb, "", "\n", "\n")
}
Expand Down
49 changes: 49 additions & 0 deletions sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -969,6 +969,55 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
checkShowString(ds, expected)
}

test("SPARK-25108 Fix the show method to display the full width character alignment problem") {
// scalastyle:off nonascii
val df = Seq(
(0, null, 1),
(0, "", 1),
(0, "ab c", 1),
(0, "1098", 1),
(0, "mø", 1),
(0, "γύρ", 1),
(0, "pê", 1),
(0, "ー", 1),
(0, "测", 1),
(0, "か", 1),
(0, "걸", 1),
(0, "à", 1),
(0, "焼", 1),
(0, "羍む", 1),
(0, "뺭ᾘ", 1),
(0, "\u0967\u0968\u0969", 1)
).toDF("b", "a", "c")
// scalastyle:on nonascii
val ds = df.as[ClassData]
val expected =
// scalastyle:off nonascii
"""+---+----+---+
|| b| a| c|
|+---+----+---+
|| 0|null| 1|
|| 0| | 1|
|| 0|ab c| 1|
|| 0|1098| 1|
|| 0| mø| 1|
|| 0| γύρ| 1|
|| 0| pê| 1|
|| 0| ー| 1|
|| 0| 测| 1|
|| 0| か| 1|
|| 0| 걸| 1|
|| 0| à| 1|
|| 0| 焼| 1|
|| 0|羍む| 1|
|| 0| 뺭ᾘ| 1|
|| 0| १२३| 1|
|+---+----+---+
|""".stripMargin
// scalastyle:on nonascii
checkShowString(ds, expected)
}

test(
"SPARK-15112: EmbedDeserializerInFilter should not optimize plan fragment that changes schema"
) {
Expand Down