Skip to content

Conversation

@oyvindln
Copy link
Contributor

@oyvindln oyvindln commented Aug 1, 2017

This helps with vectorization in some cases, such as (0..u16::MAX).collect::<Vec>(),
as LLVM is able to change the loop condition to use equality instead of less than and should help with #43124. (See also my last comment there.) This PR makes collect on ranges of u16, i16, i8, and u8 significantly faster (at least on x86-64 and i686), and pretty close, though not quite equivalent to a manual unsafe implementation. 32 ( and 64-bit values on x86-64) bit values were already vectorized without this change, and they still are. This PR doesn't seem to help with 64-bit values on i686, as they still don't vectorize well compared to doing a manual loop.

I'm a bit unsure if this was the best way of implementing this, I tried to do it with as little changes as possible and avoided changing the step trait and the behavior in RangeFrom (I'll leave that for others like #43127 to discuss wider changes to the trait). I tried simply changing the comparison to self.start != self.end though that made the compiler segfault when compiling stage0, so I went with this method instead for now.

As for next_back(), reverse ranges seem to optimise properly already.

This helps with vectorization in some cases, such as (0..u16::MAX).collect::<Vec<u16>>(),
 as LLVM is able to change the loop condition to use equality instead of less than
@rust-highfive
Copy link
Contributor

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @BurntSushi (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@oyvindln oyvindln changed the title Add an overflow check in the Iter::next() impl for Range<_> Add an overflow check in the Iter::next() impl for Range<_> to help with vectorization. Aug 1, 2017
@oyvindln
Copy link
Contributor Author

oyvindln commented Aug 1, 2017

A slight downside to this change is that it makes the function slightly slower in non-optimized builds due to the extra check.

@scottmcm
Copy link
Member

scottmcm commented Aug 3, 2017

If the problem here is that external iteration is slow, as it needs to recheck things, should vec's SpecExtend switch to using the new Iterator::for_each? Then the ranges could override that to have one reachability check, then the != loop...

@aidanhs aidanhs added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 3, 2017
@oyvindln
Copy link
Contributor Author

oyvindln commented Aug 3, 2017

@scottmcm I wasn't aware of Iterator::for:each but that sounds like a good idea in any case, maybe it will make it possible to add some check or something outside the loop that would help llvm to allow vectorization, or maybe doing a != loop would work in this case. I'm not entirely certain that's what causes trouble for llvm, but that was the main difference between u32 and u16 in the llvm output with vectorization disabled at least. (The segmentation fault that I got when trying that in .next is something else that may be worth investigating...).

I suppose using for_each may also be helpful for performance in non-optimized builds, depending on how closures work before optimizations. The change suggested in your pr (#43351) may also help a bit on this front, given how swap does a bunch of stuff that's not needed for the small types used in ranges, but I haven't done any tests on that yet. If we want somewhat fast iteration over ranges in non-optimized builds I think having some specializations on ranges primitive integers that avoids all the indirection would make sense. Anyhow, all this may be better suited for a later PR.

As for this PR, I still think it makes sense to have this change in .next for now though as it may help with vectorization in other cases as well, and I think a 20(+)x speedup in release mode is worth a 10-20% slowdown in non-optimized builds (that has a fair bit of potential for improvement later anyhow.) I might have a look at the other suggested improvements to vector extends in particular later, maybe we can also find out what is the cause of the remaining difference between this and the unsafe implementation. It's going to be a few days until I get home and can put in some larger amount of work on this anyhow.

@oyvindln
Copy link
Contributor Author

oyvindln commented Aug 3, 2017

I did a small test by creating a new Iterator type with the relevant parts of the Range struct, and using != as the condition does indeed do the same job as an overflow check but avoids having an extra check in non-optimized mode. This was provided the struct implemented TrustedLen like Range so that the specialization for those types is used. (Interestingly, when using <, not implementing TrustedLen made the collect function twice as fast.) So that could be used for Range::next() for a for_each or other specialization where one can be sure start hasn't been set higher than end (which is possible as the fields are public).

@arielb1
Copy link
Contributor

arielb1 commented Aug 8, 2017

r? @aturon

@arielb1 arielb1 assigned aturon and unassigned BurntSushi Aug 8, 2017
@aturon
Copy link
Contributor

aturon commented Aug 8, 2017

@bors: r+

Thanks for this PR!

@bors
Copy link
Collaborator

bors commented Aug 8, 2017

📌 Commit 4bb9a8b has been approved by aturon

@bors
Copy link
Collaborator

bors commented Aug 9, 2017

⌛ Testing commit 4bb9a8b with merge 0f9317d...

bors added a commit that referenced this pull request Aug 9, 2017
Add an overflow check in the Iter::next() impl for Range<_> to help with vectorization.

This helps with vectorization in some cases, such as (0..u16::MAX).collect::<Vec<u16>>(),
 as LLVM is able to change the loop condition to use equality instead of less than and should help with #43124. (See also my [last comment](#43124 (comment)) there.) This PR makes collect on ranges of u16, i16, i8, and u8 **significantly** faster (at least on x86-64 and i686), and pretty close, though not quite equivalent to a [manual unsafe implementation](https://is.gd/nkoecB). 32 ( and 64-bit values on x86-64) bit values were already vectorized without this change, and they still are. This PR doesn't seem to help with 64-bit values on i686, as they still don't vectorize well compared to doing a manual loop.

I'm a bit unsure if this was the best way of implementing this, I tried to do it with as little changes as possible and avoided changing the step trait and the behavior in RangeFrom (I'll leave that for others like #43127 to discuss wider changes to the trait). I tried simply changing the comparison to `self.start != self.end` though that made the compiler segfault when compiling stage0, so I went with this method instead for now.

As for `next_back()`, reverse ranges seem to optimise properly already.
@bors
Copy link
Collaborator

bors commented Aug 9, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: aturon
Pushing 0f9317d to master...

@bors bors merged commit 4bb9a8b into rust-lang:master Aug 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants