As per the comment here: dotnet/corefx#31779 (comment)
I would expect that an explicit Load + four Permute operations (for four sequential memory addresses) would be faster than (or at least as fast as) four SetAllVector128 (which should be equivalent to four loads and four permutes).
Investigate the codegen between the two to see if there is some bug blocking this optimization.