- 
                Notifications
    
You must be signed in to change notification settings  - Fork 5.2k
 
Enable FEATURE_MULTICASTSTUB_AS_IL for Windows x86 #104192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
           @EgorBot -intel -amd -arm64 using BenchmarkDotNet.Attributes;
using System;
namespace BenchmarkGround
{
    public struct GCStruct
    {
        public object a, b, c, d, e, f, h, g;
    }
    public class Bench
    {
        private readonly object obj = new object();
        private readonly Guid guid = Guid.NewGuid();
        private readonly GCStruct gcStruct = new GCStruct
        {
            a = new object(),
            g = new object()
        };
        private readonly Action<int> singleArg = (Action<int>)delegate { } + delegate { };
        private readonly Action<int> manyCast = (Action<int>)delegate { }
        +
            delegate { }
        + delegate { }
        + delegate { }
        + delegate { }
        + delegate { };
        private readonly Func<int, double, Guid, GCStruct, object, double> manyArg_RetFPU = (Func<int, double, Guid, GCStruct, object, double>)
            delegate { return 123.456; }
        + delegate { return 654.321; };
        [Benchmark]
        public void SingleArg() => singleArg(42);
        [Benchmark]
        public void ManyCast() => manyCast(42);
        [Benchmark]
        public double ManyArg_RetFPU() => manyArg_RetFPU(42, 123.0, guid, gcStruct, obj);
    }
} | 
    
| 
           Tagging subscribers to this area: @mangod9  | 
    
          Benchmark results on Intel
  | 
    
          Benchmark results on Arm64
  | 
    
          Benchmark results on Amd
  | 
    
        
          
                src/coreclr/vm/comdelegate.cpp
              
                Outdated
          
        
      | pCode->EmitBRTRUE(invokeTraceHelper); | ||
| pCode->EmitBR(debuggerCheckEnd); // Tune branch prediction to prefer non-debugging path | ||
| 
               | 
          ||
| pCode->EmitLabel(invokeTraceHelper); | ||
| 
               | 
          ||
| pCode->EmitLoadThis(); | ||
| pCode->EmitLDLOC(dwLoopCounterNum); | ||
| pCode->EmitCALL(METHOD__STUBHELPERS__MULTICAST_DEBUGGER_TRACE_HELPER, 2, 0); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pCode->EmitBRTRUE(invokeTraceHelper); | |
| pCode->EmitBR(debuggerCheckEnd); // Tune branch prediction to prefer non-debugging path | |
| pCode->EmitLabel(invokeTraceHelper); | |
| pCode->EmitLoadThis(); | |
| pCode->EmitLDLOC(dwLoopCounterNum); | |
| pCode->EmitCALL(METHOD__STUBHELPERS__MULTICAST_DEBUGGER_TRACE_HELPER, 2, 0); | |
| pCode->EmitBRTRUE(invokeTraceHelper); | 
And move the debugging path to be after RET:
        pCode->EmitRET();
#ifdef DEBUGGING_SUPPORTED
        // Emit debugging support at the end of the method for better perf
        pCode->EmitLabel(invokeTraceHelper);
        pCode->EmitLoadThis();
        pCode->EmitLDLOC(dwLoopCounterNum);
        pCode->EmitCALL(METHOD__STUBHELPERS__MULTICAST_DEBUGGER_TRACE_HELPER, 2, 0);
        pCode->EmitBR(debuggerCheckEnd);
#endif
This should be even better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was exactly what I did locally, and resulted in exact same codegen with current, at least for x86. The compiled asm code block order did follow IL block order exactly.
@dotnet/jit-contrib Do you have any suggestion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code has atypical loop. It looks like the JIT tried to reorder the basic blocks to turn it a more regular loop. I am not sure whether there is anything to fix in the JIT (the JIT would have to have profile data to do better).
It may be still worth it to move TraceHelper call to be at the end in IL. It provides stronger hint about the desired code layout to the JIT and makes this optimization a bit less fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well normalizing the loop closer to a for loop results in about 5% improvement for ManyArg case, but 5% regression for ManyCast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codegen for x64 now:
; Assembly listing for method System.Action:IL_STUB_MulticastDelegate_Invoke():this (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; partially interruptible
; with Synthesized PGO: fgCalledCount is 100
; No PGO data
G_M000_IG01:                ;; offset=0x0000
       push     rsi
       push     rbx
       sub      rsp, 40
       mov      rbx, rcx
G_M000_IG02:                ;; offset=0x0009
       xor      esi, esi
       cmp      qword ptr [rbx+0x30], 0
       jle      SHORT G_M000_IG05
G_M000_IG03:                ;; offset=0x0012
       test     dword ptr [(reloc 0x7ff824c55210)], 512
       jne      SHORT G_M000_IG06
G_M000_IG04:                ;; offset=0x001E
       mov      rcx, gword ptr [rbx+0x28]
       cmp      esi, dword ptr [rcx+0x08]
       jae      SHORT G_M000_IG07
       mov      rax, gword ptr [rcx+8*rsi+0x10]
       mov      rcx, gword ptr [rax+0x08]
       call     [rax+0x18]System.Action:Invoke():this
       inc      esi
       movsxd   rcx, esi
       cmp      rcx, qword ptr [rbx+0x30]
       jl       SHORT G_M000_IG03
G_M000_IG05:                ;; offset=0x003E
       add      rsp, 40
       pop      rbx
       pop      rsi
       ret
G_M000_IG06:                ;; offset=0x0045
       mov      rcx, rbx
       mov      edx, esi
       call     System.StubHelpers.StubHelpers:MulticastDebuggerTraceHelper(System.Object,int)
       jmp      SHORT G_M000_IG04
G_M000_IG07:                ;; offset=0x0051
       call     CORINFO_HELP_RNGCHKFAIL
       int3
; Total bytes of code 87There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise. Nice simplification!
        
          
                src/coreclr/vm/comdelegate.cpp
              
                Outdated
          
        
      | pCode->EmitBRTRUE(invokeTraceHelper); | ||
| pCode->EmitBR(debuggerCheckEnd); // Tune branch prediction to prefer non-debugging path | ||
| 
               | 
          ||
| pCode->EmitLabel(invokeTraceHelper); | ||
| 
               | 
          ||
| pCode->EmitLoadThis(); | ||
| pCode->EmitLDLOC(dwLoopCounterNum); | ||
| pCode->EmitCALL(METHOD__STUBHELPERS__MULTICAST_DEBUGGER_TRACE_HELPER, 2, 0); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code has atypical loop. It looks like the JIT tried to reorder the basic blocks to turn it a more regular loop. I am not sure whether there is anything to fix in the JIT (the JIT would have to have profile data to do better).
It may be still worth it to move TraceHelper call to be at the end in IL. It provides stronger hint about the desired code layout to the JIT and makes this optimization a bit less fragile.
| pCode->EmitLDC(0); | ||
| pCode->EmitSTLOC(dwLoopCounterNum); | ||
| 
               | 
          ||
| // Make the shape of the loop similar to what C# compiler emits | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new shape seems to be missing the invocation of the trace helper at the end. E.g. if the invocation count is 2, the trace helper should be called 3 times. It is only called 2 times if I am reading the code correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Result codegen:
; Assembly listing for method System.Action:IL_STUB_MulticastDelegate_Invoke():this (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; partially interruptible
; with Synthesized PGO: fgCalledCount is 100
; No PGO data
G_M000_IG01:                ;; offset=0x0000
       push     rsi
       push     rbx
       sub      rsp, 40
       mov      rbx, rcx
G_M000_IG02:                ;; offset=0x0009
       xor      esi, esi
       jmp      SHORT G_M000_IG04
G_M000_IG03:                ;; offset=0x000D
       mov      rcx, gword ptr [rbx+0x28]
       cmp      esi, dword ptr [rcx+0x08]
       jae      SHORT G_M000_IG08
       mov      rax, gword ptr [rcx+8*rsi+0x10]
       mov      rcx, gword ptr [rax+0x08]
       call     [rax+0x18]System.Action:Invoke():this
       inc      esi
G_M000_IG04:                ;; offset=0x0024
       test     dword ptr [(reloc 0x7ff84ea05210)], 512
       jne      SHORT G_M000_IG06
G_M000_IG05:                ;; offset=0x0030
       movsxd   rcx, esi
       cmp      rcx, qword ptr [rbx+0x30]
       jl       SHORT G_M000_IG03
       jmp      SHORT G_M000_IG07
G_M000_IG06:                ;; offset=0x003B
       mov      rcx, rbx
       mov      edx, esi
       call     System.StubHelpers.StubHelpers:MulticastDebuggerTraceHelper(System.Object,int)
       jmp      SHORT G_M000_IG05
G_M000_IG07:                ;; offset=0x0047
       add      rsp, 40
       pop      rbx
       pop      rsi
       ret
G_M000_IG08:                ;; offset=0x004E
       call     CORINFO_HELP_RNGCHKFAIL
       int3
; Total bytes of code 84JIT reorders the debugging block in front of the ret. Benchmark shows no regression. It's just a bit strange with the unconditional jumps.
Closes #103958 .
Local benchmark result for
Action<int>on x86: