Explicit vectorized loads/stores

In CUDA C you can explicitly request vectorized loads/stores using the special vector types (`float2`, `float4`). Sometimes I found those useful to squeeze out the last bit of performance. This definitely isn't high priority, but I was wondering how hard would be to add something similar to `CUDAnative`.

JuliaGPU/CUDAnative.jl#174 is related, but maybe some of the problems have been solved ?