The documentation for GpuKernel_sched states that it may return LS*GS > N and your code should be able to handle that (which is fine), but I found that it's actually returning values where LS*GS < N.
Is this intended? For a specific example, calling GpuKernel_sched() w/N=273280 on a Titan X (target_g=768, target_l=512) returns LS=352 GS=768, LS*GS = 270336.
It looks like the function tries to make sure LS*GS >= N here:
|
*ls = ((n / min_l) / *gs) * min_l; |
but the code doesn't do that in this case.