-
Notifications
You must be signed in to change notification settings - Fork 28.9k
Added a FastByteArrayOutputStream that exposes the underlying array to avoid unnecessary mem copy. #397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…o avoid unnecessary mem copy.
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14073/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still entails a copy, right? I don't see that this improves the situation by itself.
|
I actually meant something like this: Please take a look at needCompact and toByteBuffer (replace direct creation of ByteBuffer from byte array with this method). |
|
So I think I agree part of @mridulm 's direction here, but want to make a few comments to clarify why. Apologies if I'm stating the obvious. The management of the byte array here is not what's solving the immediate problem. Where this stuff is used in the code, the calling code wants to write N bytes in pieces, and then get a What's needed is an abstraction that can take a Although this is not For that reason, and because calling code in one case already wants a However I'd also say that this doesn't need reimplementing I understand the idea of this code is to implement a compaction mechanism, but that's a separate issue really. (When does this help? I understand it can free up some heap, at the cost of a new second allocation and copy, and could be helpful if this object were sticking around a long time. But it's not, right?) |
|
Your summarization is fairly accurate @srowen except for two cases :
To add, my initial approach was to subclass ByteArrayOutputStream to minimize code :-) Before going into details, please note that the main purpose of this stream is to actually get a ByteBuffer out of the data which is subsequently used - which need not span the entire byte[]- can can work within a region of it. And what is not mentioned above is that the actual use of this class is within another class which multiplexes over these baos instances - so that we are not limited to 2Gb limit : we have an Sequence of these streams : which will be read in order to get to the actual data (the wrapper OutputStream moves from first to next as required; and returns a Seq[ByteBuffer] when we are done writing to it). |
|
You could deprecate and override What's the compaction for? If you've got a series of ~2GB containers, I'd assume you'd fill them each pretty completely and transparently split a big write across the existing and next buffer. It saves a huge allocation, which could fail. (In the grow() method, you would have to check that the new doubled size hasn't overflowed!) I agree with use of A simpler step in your direction could be the basis for the change that this PR is trying for. That's why I wonder if this piece could have a simpler, stand-alone purpose. |
|
There are two issues here: b) More importantly, we do not get to control how array growth happens, what the thresholds are, etc. For example, relying on Integer.MAX_VALUE always is not the best option - we would need to have that configurable, to control what the maximum block size in spark can be per ByteBuffer : this has perf implications in terms of mmap'ing files, reading/writing to other ByteBuffers/sockets, etc. |
|
Sure, I myself was not suggesting that we should make them throw exceptions. If one really wanted to prohibit their use, that would be a way to do so even when subclassing, but I don't suggest they must be prohibited. To me, the methods aren't broken or anything, and the resulting class can be reused for the purpose @rxin has in mind in this PR, if these methods are available. Agree that the max size of such a buffer could be configurable. However the limit can't be more than (There's another tangent here about whether the buffer class should even enforce a limit -- what does it do, fail with The gist of my strawman suggestion is: what if we had a simple subclass of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can avoid using trim (/an array copy) here by using RateLimitedOutputStream#write(bytes, offset, length)
|
Having a toByteBuffer method definitely seems reasonable to me, the only issue is that ByteBuffer does not provide a good stream-compatible API. So it would either still have to use write(bytes, offset, length), by getting these parameters from the ByteBuffer (and being careful that .array() returns the entire underlying array), or by using some API external to both ByteBuffer and our fast output stream. What if FastBufferedOutputStream has two methods: where the latter returns the offset and index to be passed in to any stream APIs. In order to provide a lower level API, we often have to sacrifice some code niceties, but this at least provides pretty good safety for the user to not misuse it. |
|
@aarondav I personally like your second method. That alone is probably just what is needed. Callers who actually want a |
|
Ok pushed a new version that avoids the extra trim. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14082/ |
|
@rxin hm looks like this RAT exclude isn't working. Can take another crack at it later tonight. https://github.com/apache/spark/blob/master/.rat-excludes#L43 |
|
Jenkins, retest this please. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
Yeah this looks a lot like what I personally have in mind. I think you could simply subclass |
|
Eh the other reason fastutil implements the FastByteArrayOutputStream without subclassing ByteArrayOutputStream was to get rid of the synchronized writes. To do this properly, we should probably benchmark every corner case to see how well the jvm can do away with the lock that is always running in a single threaded mode. However, we are not adding much code anyway so I think this is good as is. |
|
Yes, even if the lock is not removed (and it should be) its overhead is trivial compared to other operations here. Up to you. |
|
Apparently the JVM implements "biased locking" (http://www.oracle.com/technetwork/java/6-performance-137236.html#2.1.1) when an object's monitor is uncontended, which eliminates all synchronization overhead (at the cost of an extremely expensive "bias revocation" operation if any contention appears). So we would expect there to be no relevant performance loss of this synchronization in this case, except perhaps for very small streams which are completed before being biased. However, the API of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need this method, as it eliminates the whole purpose of this stream. You might as well use ByteArrayOutputStream if you need trim().
|
Yes I knew about biased locking. That's why I said "how well the jvm can do away with the lock that is always running in a single threaded mode". However, it is hard to test how well it works with all the random cases (large streams, small streams, and even with biased locking it is not free) we have for this and quantify the impact. And given how small and simple this code is, I think it is fine as is. I removed the trim method. |
|
Build triggered. |
|
Build started. |
|
Build finished. All automated tests passed. |
|
All automated tests passed. |
|
Looking at our usages of ByteBuffer#array in Spark, most of them do not handle ByteBuffers where capacity != limit correctly. This change could turn these oversights into actual bugs, if the ByteBuffers returned here are used later as inputs. See TachyonStore for an example, where all the logging and accounting looks at limit(), but the actual bytes used will be capacity(). This change also has the behavior of actually storing up to 2x the amount of bytes as actual data, since before we could throw away the backing array and just keep the data. After all, the original solution using fastutil just called trim(), which means it wasn't really faster than ByteArrayOutputStream, but it did preserve the safety and memory characteristics that this implementation does not have. This change makes the most sense for RawTextSender, where we immediately turn around and write the bytes to a stream, so we don't need to store them and hope no one misuses it later. It makes less sense when we actually want ByteBuffers later -- I think a copy is really the only reasonable solution there. The ideal solution in terms of API might actually just be @srowen's suggestion: The only downside is that this still has the synchronized behavior which may or may not be impactful, but it saves 100 lines of code and would allow RawTextSender to go without a copy. As fasr as I can tell, neither Task nor Serializer really benefited from the fastutil FastByteArrayOutputStream as both copied the data anyway. |
|
Alright you guys convinced me. Let's close this one for now. When we clean up the downstream consumers of the byte buffers, we can revisit this (and maybe do crazier things like reusing byte arrays instead of constantly re-allocating new ones). |
Remove now un-needed hostPort option I noticed this was logging some scary error messages in various places. After I looked into it, this is no longer really used. I removed the option and re-wrote the one remaining use case (it was unnecessary there anyways).
…raform-provider-openstack-test Enable OS_DEBUG=1 to get the log details
This should fix the extra memory copy introduced by #266.
@mridulm @pwendell @mateiz
Note that because I created the util.io package, I also moved ByteBufferInputStream into the package.