-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-53152][CORE][K8S][YARN] Use Java Files.readString instead of Files.asCharSource
#51881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ba41284 to
6ba85eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
java.nio.file.Files.readString will be replaced to Files.readString when we finish the migration in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they the same (in the effect)? What migration it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For read side, the performance is the same.
- For write side, Java is faster than Google library. My very next PR will provide a simple perf comparison.
This is a subtask of the following migration towards modern and faster Java APIs.
- SPARK-53047 Mordernize Spark to use the latest Java features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, here are the perf improvement examples.
- [SPARK-53075][CORE][TESTS] Use Java
Files.readAllLines/writeinstead ofFileUtils.(read|write)Lines#51787 (3rd Party library to Java API example) - [SPARK-53043][CORE][SQL][K8S] Use Java
transferToinstead ofIOUtils.copy#51751 (3rd Party library to Java API example) - [SPARK-53035][CORE][SQL][K8S][MLLIB] Use
String.repeatinstead of Scala string multiplication #51740 (Scala API to Java API example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dongjoon-hyun
… `Files.asCharSource`
6ba85eb to
0488a44
Compare
|
Could you review this PR when you have some time, @viirya ? |
|
Thank you so much, @viirya ! |
|
|
||
| val entry = is.getNextEntry | ||
| assert(entry != null) | ||
| val actual = new String(ByteStreams.toByteArray(is), StandardCharsets.UTF_8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, these test suites are testing the compressed event logs (LZ4, Snappy, ZSD) which cannot fit to UTF_8. This is a kind of bug fix.
| val fileName = entry.getName.stripPrefix(logPath.getName + "/") | ||
| assert(allFileNames.contains(fileName)) | ||
|
|
||
| val actual = new String(ByteStreams.toByteArray(is), StandardCharsets.UTF_8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
|
The first commit passed all relevant tests except the Spark History Event logs due to the existing test code bug. It's fixed at the second commit and I verified like the following. |
|
Merged to master for Apache Spark 4.1.0. |
What changes were proposed in this pull request?
This PR aims to use Java
Files.readStringinstead ofcom.google.common.io.Files.asCharSource.Why are the changes needed?
To use a simpler built-in Java API.
coreandyarnmodules use this in testingkubernetesmodule uses to read small config files or YAML template files in addition to testing.Does this PR introduce any user-facing change?
No behavior changes.
How was this patch tested?
Pass the CIs.
Was this patch authored or co-authored using generative AI tooling?
No.