-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-43105][CONNECT] Abbreviate Bytes and Strings in proto message #40750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the method name should be like abbreviateBytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also change other fields in this method via pattern matching, but right now it only abbreviate the bytes fields, so abbreviateBytes LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this try-catch for test purpose (A PyTorch UT failed due to OOM before), will add it back to be more conservative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to remove it (since you added it to address this specific case before?). I don't mind removing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I added it for that PyTorch test case, in which the size of UDF is 47mb and cause OOM
but I am not very sure whether there are some other unknown edge cases that can also cause failure, so I personally prefer adding the try-catch back before merge it.
d92180a to
4bc2aa7
Compare
|
also cc @grundprinzip |
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates at least one additional copy of the string that we might be able to reduce by passing the bytestring directly into the createByteString method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ByteString doesn't provide a slicing or view method, so I think we have to copy.
But we just copy a few (8 here) bytes, so should be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about this logic, assume a short string with size < NUM_FIRST_BYTES this goes into the else branch and now createByteString will return ********(redacted, size=23) is this really expected? Shouldn't this just show the short string instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is to avoid showing all the raw data in LocalRelation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will just show the original short string in this case.
grundprinzip
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really appreciate the change, but I think there might be a bug in the logic.
|
ok, on second thought, I think we should narrow this PR to abbreviation only. I think we can support redaction as followings in the future: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this no longer "redacted", what about making the format string something like:
"[truncated(size=XXX)]"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's just focus on abbreviating instead of redacting. This code path would likely have to change before 4.0 for better UI in any event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here
"$prefix[truncated(size=XXX)]"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
6012351 to
0af622d
Compare
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala
Outdated
Show resolved
Hide resolved
071e7fc to
fc7be14
Compare
fc7be14 to
1798614
Compare
|
Merged to master. |
What changes were proposed in this pull request?
To abbreviate the
BYTESandSTRINGfields in proto message.Note that the
repeatedandmap<...>fields are always skipped for nowWhy are the changes needed?
1, for abbreviation:
before:

after:

2,
Message.toStringmay cause OOM when the message is largeThis PR try to abbreviate the bytes and string, which are the main parts of
LocalRelationandPythonUDFDoes this PR introduce any user-facing change?
yes, when
BYTESandSTRINGfields are too long, abbreviate them and show the sizeHow was this patch tested?
manually check