Investigate perf difference between multiple `SetAllVector128` and a `Load + Permute` sequence

As per the comment here: https://github.com/dotnet/corefx/pull/31779#discussion_r210330631

I would expect that an explicit `Load + four Permute` operations (for four sequential memory addresses) would be faster than (or at least as fast as) four `SetAllVector128` (which should be equivalent to four loads and four permutes).

Investigate the codegen between the two to see if there is some bug blocking this optimization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate perf difference between multiple `SetAllVector128` and a `Load + Permute` sequence #27166

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate perf difference between multiple SetAllVector128 and a Load + Permute sequence #27166

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Investigate perf difference between multiple `SetAllVector128` and a `Load + Permute` sequence #27166