- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 5.7k
 
Description
In https://discourse.julialang.org/t/rfc-ann-restacker-jl-a-workaround-for-the-heap-allocated-immutable-problem/35037 @tkf that copying a view manually at that tart of the function causes a sizable performance improvement. It seems different from the stack allocation of immutable that contain heap references (x-ref: #34126). From my brief investigation it seems like the performance difference is due to a vectorization failure. The example I looked at is:
@noinline function f!(ys, xs)
         @inbounds for i in eachindex(ys, xs)
           x = xs[i]
           if -0.5 < x < 0.5
             ys[i] = 2x
           end
        end
end
"Restacking" ys causes this function to be vectorizable.
Looking at the optimized LLVM IR I noticed
   %40 = add i64 %39, %28, !dbg !67
   %41 = getelementptr inbounds double, double addrspace(13)* %31, i64 %40, !dbg !67
   %42 = load double, double addrspace(13)* %41, align 8, !dbg !67, !tbaa !70
...
   %47 = add i64 %39, %36, !dbg !81
   %48 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %38, align 8, !dbg !81, !tbaa !59, !nonnull !4
   %49 = getelementptr inbounds double, double addrspace(13)* %48, i64 %47, !dbg !81
   store double %46, double addrspace(13)* %49, align 8, !dbg !81, !tbaa !70
Where %42 = load is the getindex from xs and store double is the setindex! into ys. Notice how in %48 we are reloading the base pointer of the array at each iteration.
So how are we deriving %38 and %48, why don't we have to reload the pointer we are loading from?
The series of operations we do to obtain the base pointer for the getpointer
  %24 = bitcast %jl_value_t addrspace(11)* %11 to %jl_value_t addrspace(10)* addrspace(11)*
  %25 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %24, align 8, !tbaa !41, !nonnull !4, !dereferenceable !57, !align !58
  %26 = getelementptr inbounds i8, i8 addrspace(11)* %12, i64 16
  %27 = bitcast i8 addrspace(11)* %26 to i64 addrspace(11)*
  %28 = load i64, i64 addrspace(11)* %27, align 8, !tbaa !41
  %29 = addrspacecast %jl_value_t addrspace(10)* %25 to %jl_value_t addrspace(11)*
  %30 = bitcast %jl_value_t addrspace(11)* %29 to double addrspace(13)* addrspace(11)*
  %31 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %30, align 8, !tbaa !59, !nonnull !4
the operations we do to obtain the pointer from which to load the base pointer.
  %32 = bitcast %jl_value_t addrspace(11)* %8 to %jl_value_t addrspace(10)* addrspace(11)*
  %33 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %32, align 8
  %34 = getelementptr inbounds i8, i8 addrspace(11)* %9, i64 16
  %35 = bitcast i8 addrspace(11)* %34 to i64 addrspace(11)*
  %36 = load i64, i64 addrspace(11)* %35, align 8
  %37 = addrspacecast %jl_value_t addrspace(10)* %33 to %jl_value_t addrspace(11)*
  %38 = bitcast %jl_value_t addrspace(11)* %37 to double addrspace(13)* addrspace(11)*
Notice how in the later case we are missing all TBAA information. The two interesting tbaa trees are below.
!41 = !{!42, !42, i64 0}
!42 = !{!"jtbaa_immut", !43, i64 0}
!43 = !{!"jtbaa_value", !44, i64 0}
!44 = !{!"jtbaa_data", !45, i64 0}
!45 = !{!"jtbaa", !46, i64 0}
!46 = !{!"jtbaa"}
!59 = !{!60, !60, i64 0}
!60 = !{!"jtbaa_arrayptr", !61, i64 0}
!61 = !{!"jtbaa_array", !45, i64 0}
I don't understand our TBAA annotations well enough to conclude that this is a bug, but shouldn't the view be marked immutable either way?