v2.5.3
What's changed
- Eliminate volatile writes in
ConcurrentLruinternal bookkeeping code for pure reads, improving concurrent read throughput by 175%. - Vectorize the hot methods in
CmSketchusing Neon intrinsics for ARM CPUs. This results in slightly betterConcurrentLfucache throughput measured on Apple M series and Azure Cobalt 100 CPUs. - Unroll loops in the hot methods in
CmSketch. This results in slightly betterConcurrentLfuthroughput on CPUs without vector support (i.e. neither x86 AVX2 nor Arm Neon). - On vectorized code paths (AVX2 and Neon),
CmSketchallocates the internal buffer using the pinned object heap on .NET6 or newer. Use of the fixed statement is removed, eliminating a very small overhead. Sketch block pointers are then aligned to 64 bytes, guaranteeing each block is always on the same CPU cache line. This provides a small speedup for theConcurrentLfumaintenance thread by reducing CPU cache misses. - Minor improvements to the AVX2 JITted code via
MethodImpl(MethodImplOptions.AggressiveInlining)and removal of local variables to improve performance on .NET8/9 and dynamic PGO.
Full changelog: v2.5.2...v2.5.3