[SPARK-43105][CONNECT] Abbreviate Bytes and Strings in proto message #40750

zhengruifeng · 2023-04-12T02:35:48Z

What changes were proposed in this pull request?

To abbreviate the BYTES and STRING fields in proto message.

Note that the repeated and map<...> fields are always skipped for now

Why are the changes needed?

1, for abbreviation:

In [6]: spark.createDataFrame(range(0, 1000)).show()

In [7]: query = "SELECT /* " + "bla" * 8192 + " */ 1"

In [8]: spark.sql(query).show()

before:

after:

2, Message.toString may cause OOM when the message is large
This PR try to abbreviate the bytes and string, which are the main parts of LocalRelation and PythonUDF

Does this PR introduce any user-facing change?

yes, when BYTES and STRING fields are too long, abbreviate them and show the size

How was this patch tested?

manually check

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala

HyukjinKwon · 2023-04-12T06:39:34Z

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala

I think the method name should be like abbreviateBytes

we can also change other fields in this method via pattern matching, but right now it only abbreviate the bytes fields, so abbreviateBytes LGTM

zhengruifeng · 2023-04-12T06:59:43Z

...t/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala

I removed this try-catch for test purpose (A PyTorch UT failed due to OOM before), will add it back to be more conservative

I think it's fine to remove it (since you added it to address this specific case before?). I don't mind removing it

yes, I added it for that PyTorch test case, in which the size of UDF is 47mb and cause OOM

but I am not very sure whether there are some other unknown edge cases that can also cause failure, so I personally prefer adding the try-catch back before merge it.

zhengruifeng · 2023-04-12T09:06:00Z

also cc @grundprinzip

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala

grundprinzip · 2023-04-12T09:21:14Z

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala

This creates at least one additional copy of the string that we might be able to reduce by passing the bytestring directly into the createByteString method?

The ByteString doesn't provide a slicing or view method, so I think we have to copy.
But we just copy a few (8 here) bytes, so should be fine

grundprinzip · 2023-04-12T09:25:06Z

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala

I'm confused about this logic, assume a short string with size < NUM_FIRST_BYTES this goes into the else branch and now createByteString will return ********(redacted, size=23) is this really expected? Shouldn't this just show the short string instead?

it is to avoid showing all the raw data in LocalRelation

ok, will just show the original short string in this case.

grundprinzip

I really appreciate the change, but I think there might be a bug in the logic.

zhengruifeng · 2023-04-12T11:04:18Z

ok, on second thought, I think we should narrow this PR to abbreviation only.

I think we can support redaction as followings in the future:

{
...
      case (field: FieldDescriptor, relation: proto.LocalRelation)
          if field.getJavaType == FieldDescriptor.JavaType.MESSAGE && relation != null =>
        builder.setField(field, redactLocalRelation(relation))

      case (field: FieldDescriptor, msg: Message)
          if field.getJavaType == FieldDescriptor.JavaType.MESSAGE && msg != null =>
      ...
...
}

private def redactLocalRelation(relation: proto.LocalRelation): proto.LocalRelation = {

....
}

grundprinzip · 2023-04-12T12:48:49Z

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala

since this no longer "redacted", what about making the format string something like:

"[truncated(size=XXX)]"

Yeah, let's just focus on abbreviating instead of redacting. This code path would likely have to change before 4.0 for better UI in any event.

grundprinzip · 2023-04-12T12:49:08Z

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala

Similar here

"$prefix[truncated(size=XXX)]"

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala

init init init init

HyukjinKwon · 2023-04-14T00:59:15Z

Merged to master.

github-actions bot added CONNECT SQL labels Apr 12, 2023

zhengruifeng commented Apr 12, 2023

View reviewed changes

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Apr 12, 2023

View reviewed changes

zhengruifeng commented Apr 12, 2023

View reviewed changes

zhengruifeng changed the title ~~[WIP][CONNECT] Redact the proto message~~ [WIP][CONNECT] Abbreviate Bytes in proto message's debug string Apr 12, 2023

zhengruifeng changed the title ~~[WIP][CONNECT] Abbreviate Bytes in proto message's debug string~~ [SPARK-43105][CONNECT] Abbreviate Bytes in proto message's debug string Apr 12, 2023

zhengruifeng marked this pull request as ready for review April 12, 2023 08:58

zhengruifeng force-pushed the connect_redact branch from d92180a to 4bc2aa7 Compare April 12, 2023 08:58

grundprinzip reviewed Apr 12, 2023

View reviewed changes

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala Outdated Show resolved Hide resolved

grundprinzip reviewed Apr 12, 2023

View reviewed changes

zhengruifeng changed the title ~~[SPARK-43105][CONNECT] Abbreviate Bytes in proto message's debug string~~ [SPARK-43105][CONNECT] Abbreviate Bytes and Strings in proto message Apr 12, 2023

grundprinzip reviewed Apr 12, 2023

View reviewed changes

zhengruifeng force-pushed the connect_redact branch 2 times, most recently from 6012351 to 0af622d Compare April 13, 2023 03:08

HyukjinKwon approved these changes Apr 13, 2023

View reviewed changes

HyukjinKwon reviewed Apr 13, 2023

View reviewed changes

connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala Outdated Show resolved Hide resolved

zhengruifeng force-pushed the connect_redact branch 2 times, most recently from 071e7fc to fc7be14 Compare April 13, 2023 09:56

zhengruifeng added 5 commits April 13, 2023 19:25

init

8bb3bf8

init init init init

address comments

326d783

address comments

ef70885

address comments

ff40fbf

add a ticket

1798614

zhengruifeng force-pushed the connect_redact branch from fc7be14 to 1798614 Compare April 13, 2023 11:26

HyukjinKwon closed this in e330c48 Apr 14, 2023

zhengruifeng deleted the connect_redact branch April 14, 2023 01:37

[SPARK-43105][CONNECT] Abbreviate Bytes and Strings in proto message #40750

[SPARK-43105][CONNECT] Abbreviate Bytes and Strings in proto message #40750

Uh oh!

Conversation

zhengruifeng commented Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Apr 12, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grundprinzip left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HyukjinKwon commented Apr 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhengruifeng commented Apr 12, 2023 •

edited

Loading

zhengruifeng Apr 12, 2023 •

edited

Loading

zhengruifeng Apr 12, 2023 •

edited

Loading

zhengruifeng commented Apr 12, 2023 •

edited

Loading