Skip to content

Investigate perf difference between multiple SetAllVector128 and a Load + Permute sequence #27166

@tannergooding

Description

@tannergooding

As per the comment here: dotnet/corefx#31779 (comment)

I would expect that an explicit Load + four Permute operations (for four sequential memory addresses) would be faster than (or at least as fast as) four SetAllVector128 (which should be equivalent to four loads and four permutes).

Investigate the codegen between the two to see if there is some bug blocking this optimization.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions