Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Apr 12, 2023

What changes were proposed in this pull request?

To abbreviate the BYTES and STRING fields in proto message.

Note that the repeated and map<...> fields are always skipped for now

Why are the changes needed?

1, for abbreviation:

In [6]: spark.createDataFrame(range(0, 1000)).show()

In [7]: query = "SELECT /* " + "bla" * 8192 + " */ 1"

In [8]: spark.sql(query).show()

before:
image

after:
image

2, Message.toString may cause OOM when the message is large
This PR try to abbreviate the bytes and string, which are the main parts of LocalRelation and PythonUDF

Does this PR introduce any user-facing change?

yes, when BYTES and STRING fields are too long, abbreviate them and show the size

How was this patch tested?

manually check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the method name should be like abbreviateBytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also change other fields in this method via pattern matching, but right now it only abbreviate the bytes fields, so abbreviateBytes LGTM

Copy link
Contributor Author

@zhengruifeng zhengruifeng Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this try-catch for test purpose (A PyTorch UT failed due to OOM before), will add it back to be more conservative

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to remove it (since you added it to address this specific case before?). I don't mind removing it

Copy link
Contributor Author

@zhengruifeng zhengruifeng Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I added it for that PyTorch test case, in which the size of UDF is 47mb and cause OOM

but I am not very sure whether there are some other unknown edge cases that can also cause failure, so I personally prefer adding the try-catch back before merge it.

@zhengruifeng zhengruifeng changed the title [WIP][CONNECT] Redact the proto message [WIP][CONNECT] Abbreviate Bytes in proto message's debug string Apr 12, 2023
@zhengruifeng zhengruifeng changed the title [WIP][CONNECT] Abbreviate Bytes in proto message's debug string [SPARK-43105][CONNECT] Abbreviate Bytes in proto message's debug string Apr 12, 2023
@zhengruifeng zhengruifeng marked this pull request as ready for review April 12, 2023 08:58
@zhengruifeng
Copy link
Contributor Author

also cc @grundprinzip

Comment on lines 35 to 43
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates at least one additional copy of the string that we might be able to reduce by passing the bytestring directly into the createByteString method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ByteString doesn't provide a slicing or view method, so I think we have to copy.
But we just copy a few (8 here) bytes, so should be fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this logic, assume a short string with size < NUM_FIRST_BYTES this goes into the else branch and now createByteString will return ********(redacted, size=23) is this really expected? Shouldn't this just show the short string instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is to avoid showing all the raw data in LocalRelation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will just show the original short string in this case.

Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really appreciate the change, but I think there might be a bug in the logic.

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Apr 12, 2023

ok, on second thought, I think we should narrow this PR to abbreviation only.

I think we can support redaction as followings in the future:

{
...
      case (field: FieldDescriptor, relation: proto.LocalRelation)
          if field.getJavaType == FieldDescriptor.JavaType.MESSAGE && relation != null =>
        builder.setField(field, redactLocalRelation(relation))

      case (field: FieldDescriptor, msg: Message)
          if field.getJavaType == FieldDescriptor.JavaType.MESSAGE && msg != null =>
      ...
...
}

private def redactLocalRelation(relation: proto.LocalRelation): proto.LocalRelation = {

....
}

@zhengruifeng zhengruifeng changed the title [SPARK-43105][CONNECT] Abbreviate Bytes in proto message's debug string [SPARK-43105][CONNECT] Abbreviate Bytes and Strings in proto message Apr 12, 2023
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this no longer "redacted", what about making the format string something like:

"[truncated(size=XXX)]"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's just focus on abbreviating instead of redacting. This code path would likely have to change before 4.0 for better UI in any event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here

"$prefix[truncated(size=XXX)]"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@zhengruifeng zhengruifeng force-pushed the connect_redact branch 2 times, most recently from 6012351 to 0af622d Compare April 13, 2023 03:08
@zhengruifeng zhengruifeng force-pushed the connect_redact branch 2 times, most recently from 071e7fc to fc7be14 Compare April 13, 2023 09:56
@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the connect_redact branch April 14, 2023 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants