Skip to content

Conversation

@Nicoshev
Copy link
Contributor

Summary:
Adding NEON translation of FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf, used by Ads

Performance improves by an order of magnitude:

Before:

bit_rate rows, cols, elems_per_usec, GB/Sec
2, 100, 16, 211.26, 0.85
2, 100, 64, 210.96, 0.84
2, 100, 128, 204.26, 0.82
2, 100, 256, 200.47, 0.80
2, 100, 512, 194.19, 0.78
2, 100, 1024, 190.98, 0.76
2, 100, 2048, 186.85, 0.75
2, 120, 16, 206.88, 0.83
2, 120, 64, 211.64, 0.85
2, 120, 128, 203.97, 0.82
2, 120, 256, 200.22, 0.80
2, 120, 512, 194.97, 0.78
2, 120, 1024, 191.76, 0.77
2, 120, 2048, 187.45, 0.75
2, 1000, 16, 205.10, 0.82
2, 1000, 64, 214.15, 0.86
2, 1000, 128, 205.43, 0.82
2, 1000, 256, 200.34, 0.80
2, 1000, 512, 196.62, 0.79
2, 1000, 1024, 194.64, 0.78
2, 1000, 2048, 187.54, 0.75
4, 100, 16, 197.97, 0.79
4, 100, 64, 200.02, 0.80
4, 100, 128, 191.06, 0.76
4, 100, 256, 186.58, 0.75
4, 100, 512, 180.76, 0.72
4, 100, 1024, 176.65, 0.71
4, 100, 2048, 175.00, 0.70
4, 120, 16, 198.93, 0.80
4, 120, 64, 201.74, 0.81
4, 120, 128, 190.95, 0.76
4, 120, 256, 186.79, 0.75
4, 120, 512, 181.32, 0.73
4, 120, 1024, 177.54, 0.71
4, 120, 2048, 174.69, 0.70
4, 1000, 16, 194.63, 0.78
4, 1000, 64, 201.64, 0.81
4, 1000, 128, 191.78, 0.77
4, 1000, 256, 186.87, 0.75
4, 1000, 512, 182.91, 0.73
4, 1000, 1024, 180.66, 0.72
4, 1000, 2048, 175.04, 0.70
8, 100, 16, 171.01, 0.68
8, 100, 64, 177.53, 0.71
8, 100, 128, 168.92, 0.68
8, 100, 256, 165.23, 0.66
8, 100, 512, 162.25, 0.65
8, 100, 1024, 158.87, 0.64
8, 100, 2048, 155.39, 0.62
8, 120, 16, 173.77, 0.70
8, 120, 64, 178.34, 0.71
8, 120, 128, 168.66, 0.67
8, 120, 256, 165.60, 0.66
8, 120, 512, 162.30, 0.65
8, 120, 1024, 159.38, 0.64
8, 120, 2048, 156.17, 0.62
8, 1000, 16, 171.34, 0.69
8, 1000, 64, 178.96, 0.72
8, 1000, 128, 169.71, 0.68
8, 1000, 256, 165.62, 0.66
8, 1000, 512, 162.98, 0.65
8, 1000, 1024, 161.59, 0.65
8, 1000, 2048, 157.16, 0.63

After:

bit_rate rows, cols, elems_per_usec, GB/Sec
2, 100, 16, 1006.83, 4.03
2, 100, 64, 1542.11, 6.17
2, 100, 128, 1882.99, 7.53
2, 100, 256, 2063.71, 8.25
2, 100, 512, 2232.29, 8.93
2, 100, 1024, 2298.69, 9.19
2, 100, 2048, 2333.73, 9.33
2, 120, 16, 1016.40, 4.07
2, 120, 64, 1524.36, 6.10
2, 120, 128, 1853.40, 7.41
2, 120, 256, 2158.92, 8.64
2, 120, 512, 2321.61, 9.29
2, 120, 1024, 2353.80, 9.42
2, 120, 2048, 2332.84, 9.33
2, 1000, 16, 1129.08, 4.52
2, 1000, 64, 1606.46, 6.43
2, 1000, 128, 2095.33, 8.38
2, 1000, 256, 2470.88, 9.88
2, 1000, 512, 2746.67, 10.99
2, 1000, 1024, 2882.32, 11.53
2, 1000, 2048, 2447.96, 9.79
4, 100, 16, 999.05, 4.00
4, 100, 64, 1666.00, 6.66
4, 100, 128, 2062.08, 8.25
4, 100, 256, 2226.33, 8.91
4, 100, 512, 2481.11, 9.92
4, 100, 1024, 2717.50, 10.87
4, 100, 2048, 2656.00, 10.62
4, 120, 16, 1056.31, 4.23
4, 120, 64, 1651.95, 6.61
4, 120, 128, 2058.65, 8.23
4, 120, 256, 2339.64, 9.36
4, 120, 512, 2570.03, 10.28
4, 120, 1024, 2788.24, 11.15
4, 120, 2048, 2701.20, 10.80
4, 1000, 16, 1184.28, 4.74
4, 1000, 64, 1765.47, 7.06
4, 1000, 128, 2348.17, 9.39
4, 1000, 256, 2852.72, 11.41
4, 1000, 512, 3249.46, 13.00
4, 1000, 1024, 3418.46, 13.67
4, 1000, 2048, 2841.77, 11.37
8, 100, 16, 1176.35, 4.71
8, 100, 64, 1902.76, 7.61
8, 100, 128, 2196.23, 8.78
8, 100, 256, 2596.55, 10.39
8, 100, 512, 2814.30, 11.26
8, 100, 1024, 3175.49, 12.70
8, 100, 2048, 3334.41, 13.34
8, 120, 16, 1213.55, 4.85
8, 120, 64, 1806.19, 7.22
8, 120, 128, 2390.64, 9.56
8, 120, 256, 2736.11, 10.94
8, 120, 512, 3015.86, 12.06
8, 120, 1024, 3332.53, 13.33
8, 120, 2048, 3319.50, 13.28
8, 1000, 16, 1362.12, 5.45
8, 1000, 64, 2029.25, 8.12
8, 1000, 128, 2759.50, 11.04
8, 1000, 256, 3532.71, 14.13
8, 1000, 512, 4014.48, 16.06
8, 1000, 1024, 4240.49, 16.96
8, 1000, 2048, 3440.59, 13.76

Differential Revision: D86774172

@netlify
Copy link

netlify bot commented Nov 11, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 5c7487a
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/69136d61918a1a0008fbb1e8
😎 Deploy Preview https://deploy-preview-5115--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 11, 2025

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86774172.

@meta-cla meta-cla bot added the cla signed label Nov 11, 2025
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 13, 2025
…lf (pytorch#5115)

Summary:
X-link: facebookresearch/FBGEMM#2121


Adding NEON translation of FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf, used by Ads

Performance improves by an order of magnitude:

Before:

  bit_rate rows,   cols,    elems_per_usec,    GB/Sec
       2,   100,     16,          211.26,       0.85
       2,   100,     64,          210.96,       0.84
       2,   100,    128,          204.26,       0.82
       2,   100,    256,          200.47,       0.80
       2,   100,    512,          194.19,       0.78
       2,   100,   1024,          190.98,       0.76
       2,   100,   2048,          186.85,       0.75
       2,   120,     16,          206.88,       0.83
       2,   120,     64,          211.64,       0.85
       2,   120,    128,          203.97,       0.82
       2,   120,    256,          200.22,       0.80
       2,   120,    512,          194.97,       0.78
       2,   120,   1024,          191.76,       0.77
       2,   120,   2048,          187.45,       0.75
       2,  1000,     16,          205.10,       0.82
       2,  1000,     64,          214.15,       0.86
       2,  1000,    128,          205.43,       0.82
       2,  1000,    256,          200.34,       0.80
       2,  1000,    512,          196.62,       0.79
       2,  1000,   1024,          194.64,       0.78
       2,  1000,   2048,          187.54,       0.75
       4,   100,     16,          197.97,       0.79
       4,   100,     64,          200.02,       0.80
       4,   100,    128,          191.06,       0.76
       4,   100,    256,          186.58,       0.75
       4,   100,    512,          180.76,       0.72
       4,   100,   1024,          176.65,       0.71
       4,   100,   2048,          175.00,       0.70
       4,   120,     16,          198.93,       0.80
       4,   120,     64,          201.74,       0.81
       4,   120,    128,          190.95,       0.76
       4,   120,    256,          186.79,       0.75
       4,   120,    512,          181.32,       0.73
       4,   120,   1024,          177.54,       0.71
       4,   120,   2048,          174.69,       0.70
       4,  1000,     16,          194.63,       0.78
       4,  1000,     64,          201.64,       0.81
       4,  1000,    128,          191.78,       0.77
       4,  1000,    256,          186.87,       0.75
       4,  1000,    512,          182.91,       0.73
       4,  1000,   1024,          180.66,       0.72
       4,  1000,   2048,          175.04,       0.70
       8,   100,     16,          171.01,       0.68
       8,   100,     64,          177.53,       0.71
       8,   100,    128,          168.92,       0.68
       8,   100,    256,          165.23,       0.66
       8,   100,    512,          162.25,       0.65
       8,   100,   1024,          158.87,       0.64
       8,   100,   2048,          155.39,       0.62
       8,   120,     16,          173.77,       0.70
       8,   120,     64,          178.34,       0.71
       8,   120,    128,          168.66,       0.67
       8,   120,    256,          165.60,       0.66
       8,   120,    512,          162.30,       0.65
       8,   120,   1024,          159.38,       0.64
       8,   120,   2048,          156.17,       0.62
       8,  1000,     16,          171.34,       0.69
       8,  1000,     64,          178.96,       0.72
       8,  1000,    128,          169.71,       0.68
       8,  1000,    256,          165.62,       0.66
       8,  1000,    512,          162.98,       0.65
       8,  1000,   1024,          161.59,       0.65
       8,  1000,   2048,          157.16,       0.63

After:

  bit_rate rows,   cols,    elems_per_usec,    GB/Sec
       2,   100,     16,         1006.83,       4.03
       2,   100,     64,         1542.11,       6.17
       2,   100,    128,         1882.99,       7.53
       2,   100,    256,         2063.71,       8.25
       2,   100,    512,         2232.29,       8.93
       2,   100,   1024,         2298.69,       9.19
       2,   100,   2048,         2333.73,       9.33
       2,   120,     16,         1016.40,       4.07
       2,   120,     64,         1524.36,       6.10
       2,   120,    128,         1853.40,       7.41
       2,   120,    256,         2158.92,       8.64
       2,   120,    512,         2321.61,       9.29
       2,   120,   1024,         2353.80,       9.42
       2,   120,   2048,         2332.84,       9.33
       2,  1000,     16,         1129.08,       4.52
       2,  1000,     64,         1606.46,       6.43
       2,  1000,    128,         2095.33,       8.38
       2,  1000,    256,         2470.88,       9.88
       2,  1000,    512,         2746.67,      10.99
       2,  1000,   1024,         2882.32,      11.53
       2,  1000,   2048,         2447.96,       9.79
       4,   100,     16,          999.05,       4.00
       4,   100,     64,         1666.00,       6.66
       4,   100,    128,         2062.08,       8.25
       4,   100,    256,         2226.33,       8.91
       4,   100,    512,         2481.11,       9.92
       4,   100,   1024,         2717.50,      10.87
       4,   100,   2048,         2656.00,      10.62
       4,   120,     16,         1056.31,       4.23
       4,   120,     64,         1651.95,       6.61
       4,   120,    128,         2058.65,       8.23
       4,   120,    256,         2339.64,       9.36
       4,   120,    512,         2570.03,      10.28
       4,   120,   1024,         2788.24,      11.15
       4,   120,   2048,         2701.20,      10.80
       4,  1000,     16,         1184.28,       4.74
       4,  1000,     64,         1765.47,       7.06
       4,  1000,    128,         2348.17,       9.39
       4,  1000,    256,         2852.72,      11.41
       4,  1000,    512,         3249.46,      13.00
       4,  1000,   1024,         3418.46,      13.67
       4,  1000,   2048,         2841.77,      11.37
       8,   100,     16,         1176.35,       4.71
       8,   100,     64,         1902.76,       7.61
       8,   100,    128,         2196.23,       8.78
       8,   100,    256,         2596.55,      10.39
       8,   100,    512,         2814.30,      11.26
       8,   100,   1024,         3175.49,      12.70
       8,   100,   2048,         3334.41,      13.34
       8,   120,     16,         1213.55,       4.85
       8,   120,     64,         1806.19,       7.22
       8,   120,    128,         2390.64,       9.56
       8,   120,    256,         2736.11,      10.94
       8,   120,    512,         3015.86,      12.06
       8,   120,   1024,         3332.53,      13.33
       8,   120,   2048,         3319.50,      13.28
       8,  1000,     16,         1362.12,       5.45
       8,  1000,     64,         2029.25,       8.12
       8,  1000,    128,         2759.50,      11.04
       8,  1000,    256,         3532.71,      14.13
       8,  1000,    512,         4014.48,      16.06
       8,  1000,   1024,         4240.49,      16.96
       8,  1000,   2048,         3440.59,      13.76

Differential Revision: D86774172
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 13, 2025
…lf (pytorch#5115)

Summary:
X-link: facebookresearch/FBGEMM#2121


Adding NEON translation of FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf, used by Ads

Performance improves by an order of magnitude:

Before:

  bit_rate rows,   cols,    elems_per_usec,    GB/Sec
       2,   100,     16,          211.26,       0.85
       2,   100,     64,          210.96,       0.84
       2,   100,    128,          204.26,       0.82
       2,   100,    256,          200.47,       0.80
       2,   100,    512,          194.19,       0.78
       2,   100,   1024,          190.98,       0.76
       2,   100,   2048,          186.85,       0.75
       2,   120,     16,          206.88,       0.83
       2,   120,     64,          211.64,       0.85
       2,   120,    128,          203.97,       0.82
       2,   120,    256,          200.22,       0.80
       2,   120,    512,          194.97,       0.78
       2,   120,   1024,          191.76,       0.77
       2,   120,   2048,          187.45,       0.75
       2,  1000,     16,          205.10,       0.82
       2,  1000,     64,          214.15,       0.86
       2,  1000,    128,          205.43,       0.82
       2,  1000,    256,          200.34,       0.80
       2,  1000,    512,          196.62,       0.79
       2,  1000,   1024,          194.64,       0.78
       2,  1000,   2048,          187.54,       0.75
       4,   100,     16,          197.97,       0.79
       4,   100,     64,          200.02,       0.80
       4,   100,    128,          191.06,       0.76
       4,   100,    256,          186.58,       0.75
       4,   100,    512,          180.76,       0.72
       4,   100,   1024,          176.65,       0.71
       4,   100,   2048,          175.00,       0.70
       4,   120,     16,          198.93,       0.80
       4,   120,     64,          201.74,       0.81
       4,   120,    128,          190.95,       0.76
       4,   120,    256,          186.79,       0.75
       4,   120,    512,          181.32,       0.73
       4,   120,   1024,          177.54,       0.71
       4,   120,   2048,          174.69,       0.70
       4,  1000,     16,          194.63,       0.78
       4,  1000,     64,          201.64,       0.81
       4,  1000,    128,          191.78,       0.77
       4,  1000,    256,          186.87,       0.75
       4,  1000,    512,          182.91,       0.73
       4,  1000,   1024,          180.66,       0.72
       4,  1000,   2048,          175.04,       0.70
       8,   100,     16,          171.01,       0.68
       8,   100,     64,          177.53,       0.71
       8,   100,    128,          168.92,       0.68
       8,   100,    256,          165.23,       0.66
       8,   100,    512,          162.25,       0.65
       8,   100,   1024,          158.87,       0.64
       8,   100,   2048,          155.39,       0.62
       8,   120,     16,          173.77,       0.70
       8,   120,     64,          178.34,       0.71
       8,   120,    128,          168.66,       0.67
       8,   120,    256,          165.60,       0.66
       8,   120,    512,          162.30,       0.65
       8,   120,   1024,          159.38,       0.64
       8,   120,   2048,          156.17,       0.62
       8,  1000,     16,          171.34,       0.69
       8,  1000,     64,          178.96,       0.72
       8,  1000,    128,          169.71,       0.68
       8,  1000,    256,          165.62,       0.66
       8,  1000,    512,          162.98,       0.65
       8,  1000,   1024,          161.59,       0.65
       8,  1000,   2048,          157.16,       0.63

After:

  bit_rate rows,   cols,    elems_per_usec,    GB/Sec
       2,   100,     16,         1006.83,       4.03
       2,   100,     64,         1542.11,       6.17
       2,   100,    128,         1882.99,       7.53
       2,   100,    256,         2063.71,       8.25
       2,   100,    512,         2232.29,       8.93
       2,   100,   1024,         2298.69,       9.19
       2,   100,   2048,         2333.73,       9.33
       2,   120,     16,         1016.40,       4.07
       2,   120,     64,         1524.36,       6.10
       2,   120,    128,         1853.40,       7.41
       2,   120,    256,         2158.92,       8.64
       2,   120,    512,         2321.61,       9.29
       2,   120,   1024,         2353.80,       9.42
       2,   120,   2048,         2332.84,       9.33
       2,  1000,     16,         1129.08,       4.52
       2,  1000,     64,         1606.46,       6.43
       2,  1000,    128,         2095.33,       8.38
       2,  1000,    256,         2470.88,       9.88
       2,  1000,    512,         2746.67,      10.99
       2,  1000,   1024,         2882.32,      11.53
       2,  1000,   2048,         2447.96,       9.79
       4,   100,     16,          999.05,       4.00
       4,   100,     64,         1666.00,       6.66
       4,   100,    128,         2062.08,       8.25
       4,   100,    256,         2226.33,       8.91
       4,   100,    512,         2481.11,       9.92
       4,   100,   1024,         2717.50,      10.87
       4,   100,   2048,         2656.00,      10.62
       4,   120,     16,         1056.31,       4.23
       4,   120,     64,         1651.95,       6.61
       4,   120,    128,         2058.65,       8.23
       4,   120,    256,         2339.64,       9.36
       4,   120,    512,         2570.03,      10.28
       4,   120,   1024,         2788.24,      11.15
       4,   120,   2048,         2701.20,      10.80
       4,  1000,     16,         1184.28,       4.74
       4,  1000,     64,         1765.47,       7.06
       4,  1000,    128,         2348.17,       9.39
       4,  1000,    256,         2852.72,      11.41
       4,  1000,    512,         3249.46,      13.00
       4,  1000,   1024,         3418.46,      13.67
       4,  1000,   2048,         2841.77,      11.37
       8,   100,     16,         1176.35,       4.71
       8,   100,     64,         1902.76,       7.61
       8,   100,    128,         2196.23,       8.78
       8,   100,    256,         2596.55,      10.39
       8,   100,    512,         2814.30,      11.26
       8,   100,   1024,         3175.49,      12.70
       8,   100,   2048,         3334.41,      13.34
       8,   120,     16,         1213.55,       4.85
       8,   120,     64,         1806.19,       7.22
       8,   120,    128,         2390.64,       9.56
       8,   120,    256,         2736.11,      10.94
       8,   120,    512,         3015.86,      12.06
       8,   120,   1024,         3332.53,      13.33
       8,   120,   2048,         3319.50,      13.28
       8,  1000,     16,         1362.12,       5.45
       8,  1000,     64,         2029.25,       8.12
       8,  1000,    128,         2759.50,      11.04
       8,  1000,    256,         3532.71,      14.13
       8,  1000,    512,         4014.48,      16.06
       8,  1000,   1024,         4240.49,      16.96
       8,  1000,   2048,         3440.59,      13.76

Differential Revision: D86774172
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 14, 2025
…lf (pytorch#5115)

Summary:
X-link: facebookresearch/FBGEMM#2121


Adding NEON translation of FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf, used by Ads

Performance improves by an order of magnitude:

Before:

  bit_rate rows,   cols,    elems_per_usec,    GB/Sec
       2,   100,     16,          211.26,       0.85
       2,   100,     64,          210.96,       0.84
       2,   100,    128,          204.26,       0.82
       2,   100,    256,          200.47,       0.80
       2,   100,    512,          194.19,       0.78
       2,   100,   1024,          190.98,       0.76
       2,   100,   2048,          186.85,       0.75
       2,   120,     16,          206.88,       0.83
       2,   120,     64,          211.64,       0.85
       2,   120,    128,          203.97,       0.82
       2,   120,    256,          200.22,       0.80
       2,   120,    512,          194.97,       0.78
       2,   120,   1024,          191.76,       0.77
       2,   120,   2048,          187.45,       0.75
       2,  1000,     16,          205.10,       0.82
       2,  1000,     64,          214.15,       0.86
       2,  1000,    128,          205.43,       0.82
       2,  1000,    256,          200.34,       0.80
       2,  1000,    512,          196.62,       0.79
       2,  1000,   1024,          194.64,       0.78
       2,  1000,   2048,          187.54,       0.75
       4,   100,     16,          197.97,       0.79
       4,   100,     64,          200.02,       0.80
       4,   100,    128,          191.06,       0.76
       4,   100,    256,          186.58,       0.75
       4,   100,    512,          180.76,       0.72
       4,   100,   1024,          176.65,       0.71
       4,   100,   2048,          175.00,       0.70
       4,   120,     16,          198.93,       0.80
       4,   120,     64,          201.74,       0.81
       4,   120,    128,          190.95,       0.76
       4,   120,    256,          186.79,       0.75
       4,   120,    512,          181.32,       0.73
       4,   120,   1024,          177.54,       0.71
       4,   120,   2048,          174.69,       0.70
       4,  1000,     16,          194.63,       0.78
       4,  1000,     64,          201.64,       0.81
       4,  1000,    128,          191.78,       0.77
       4,  1000,    256,          186.87,       0.75
       4,  1000,    512,          182.91,       0.73
       4,  1000,   1024,          180.66,       0.72
       4,  1000,   2048,          175.04,       0.70
       8,   100,     16,          171.01,       0.68
       8,   100,     64,          177.53,       0.71
       8,   100,    128,          168.92,       0.68
       8,   100,    256,          165.23,       0.66
       8,   100,    512,          162.25,       0.65
       8,   100,   1024,          158.87,       0.64
       8,   100,   2048,          155.39,       0.62
       8,   120,     16,          173.77,       0.70
       8,   120,     64,          178.34,       0.71
       8,   120,    128,          168.66,       0.67
       8,   120,    256,          165.60,       0.66
       8,   120,    512,          162.30,       0.65
       8,   120,   1024,          159.38,       0.64
       8,   120,   2048,          156.17,       0.62
       8,  1000,     16,          171.34,       0.69
       8,  1000,     64,          178.96,       0.72
       8,  1000,    128,          169.71,       0.68
       8,  1000,    256,          165.62,       0.66
       8,  1000,    512,          162.98,       0.65
       8,  1000,   1024,          161.59,       0.65
       8,  1000,   2048,          157.16,       0.63

After:

  bit_rate rows,   cols,    elems_per_usec,    GB/Sec
       2,   100,     16,         1006.83,       4.03
       2,   100,     64,         1542.11,       6.17
       2,   100,    128,         1882.99,       7.53
       2,   100,    256,         2063.71,       8.25
       2,   100,    512,         2232.29,       8.93
       2,   100,   1024,         2298.69,       9.19
       2,   100,   2048,         2333.73,       9.33
       2,   120,     16,         1016.40,       4.07
       2,   120,     64,         1524.36,       6.10
       2,   120,    128,         1853.40,       7.41
       2,   120,    256,         2158.92,       8.64
       2,   120,    512,         2321.61,       9.29
       2,   120,   1024,         2353.80,       9.42
       2,   120,   2048,         2332.84,       9.33
       2,  1000,     16,         1129.08,       4.52
       2,  1000,     64,         1606.46,       6.43
       2,  1000,    128,         2095.33,       8.38
       2,  1000,    256,         2470.88,       9.88
       2,  1000,    512,         2746.67,      10.99
       2,  1000,   1024,         2882.32,      11.53
       2,  1000,   2048,         2447.96,       9.79
       4,   100,     16,          999.05,       4.00
       4,   100,     64,         1666.00,       6.66
       4,   100,    128,         2062.08,       8.25
       4,   100,    256,         2226.33,       8.91
       4,   100,    512,         2481.11,       9.92
       4,   100,   1024,         2717.50,      10.87
       4,   100,   2048,         2656.00,      10.62
       4,   120,     16,         1056.31,       4.23
       4,   120,     64,         1651.95,       6.61
       4,   120,    128,         2058.65,       8.23
       4,   120,    256,         2339.64,       9.36
       4,   120,    512,         2570.03,      10.28
       4,   120,   1024,         2788.24,      11.15
       4,   120,   2048,         2701.20,      10.80
       4,  1000,     16,         1184.28,       4.74
       4,  1000,     64,         1765.47,       7.06
       4,  1000,    128,         2348.17,       9.39
       4,  1000,    256,         2852.72,      11.41
       4,  1000,    512,         3249.46,      13.00
       4,  1000,   1024,         3418.46,      13.67
       4,  1000,   2048,         2841.77,      11.37
       8,   100,     16,         1176.35,       4.71
       8,   100,     64,         1902.76,       7.61
       8,   100,    128,         2196.23,       8.78
       8,   100,    256,         2596.55,      10.39
       8,   100,    512,         2814.30,      11.26
       8,   100,   1024,         3175.49,      12.70
       8,   100,   2048,         3334.41,      13.34
       8,   120,     16,         1213.55,       4.85
       8,   120,     64,         1806.19,       7.22
       8,   120,    128,         2390.64,       9.56
       8,   120,    256,         2736.11,      10.94
       8,   120,    512,         3015.86,      12.06
       8,   120,   1024,         3332.53,      13.33
       8,   120,   2048,         3319.50,      13.28
       8,  1000,     16,         1362.12,       5.45
       8,  1000,     64,         2029.25,       8.12
       8,  1000,    128,         2759.50,      11.04
       8,  1000,    256,         3532.71,      14.13
       8,  1000,    512,         4014.48,      16.06
       8,  1000,   1024,         4240.49,      16.96
       8,  1000,   2048,         3440.59,      13.76

Reviewed By: mcfi

Differential Revision: D86774172
@Nicoshev Nicoshev force-pushed the export-D86774172 branch 2 times, most recently from 5341dfd to 2a4a5f5 Compare November 17, 2025 14:01
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 17, 2025
…lf (pytorch#5115)

Summary:
X-link: facebookresearch/FBGEMM#2121


Adding NEON translation of FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf, used by Ads

Performance improves by an order of magnitude:

Before:

  bit_rate rows,   cols,    elems_per_usec,    GB/Sec
       2,   100,     16,          211.26,       0.85
       2,   100,     64,          210.96,       0.84
       2,   100,    128,          204.26,       0.82
       2,   100,    256,          200.47,       0.80
       2,   100,    512,          194.19,       0.78
       2,   100,   1024,          190.98,       0.76
       2,   100,   2048,          186.85,       0.75
       2,   120,     16,          206.88,       0.83
       2,   120,     64,          211.64,       0.85
       2,   120,    128,          203.97,       0.82
       2,   120,    256,          200.22,       0.80
       2,   120,    512,          194.97,       0.78
       2,   120,   1024,          191.76,       0.77
       2,   120,   2048,          187.45,       0.75
       2,  1000,     16,          205.10,       0.82
       2,  1000,     64,          214.15,       0.86
       2,  1000,    128,          205.43,       0.82
       2,  1000,    256,          200.34,       0.80
       2,  1000,    512,          196.62,       0.79
       2,  1000,   1024,          194.64,       0.78
       2,  1000,   2048,          187.54,       0.75
       4,   100,     16,          197.97,       0.79
       4,   100,     64,          200.02,       0.80
       4,   100,    128,          191.06,       0.76
       4,   100,    256,          186.58,       0.75
       4,   100,    512,          180.76,       0.72
       4,   100,   1024,          176.65,       0.71
       4,   100,   2048,          175.00,       0.70
       4,   120,     16,          198.93,       0.80
       4,   120,     64,          201.74,       0.81
       4,   120,    128,          190.95,       0.76
       4,   120,    256,          186.79,       0.75
       4,   120,    512,          181.32,       0.73
       4,   120,   1024,          177.54,       0.71
       4,   120,   2048,          174.69,       0.70
       4,  1000,     16,          194.63,       0.78
       4,  1000,     64,          201.64,       0.81
       4,  1000,    128,          191.78,       0.77
       4,  1000,    256,          186.87,       0.75
       4,  1000,    512,          182.91,       0.73
       4,  1000,   1024,          180.66,       0.72
       4,  1000,   2048,          175.04,       0.70
       8,   100,     16,          171.01,       0.68
       8,   100,     64,          177.53,       0.71
       8,   100,    128,          168.92,       0.68
       8,   100,    256,          165.23,       0.66
       8,   100,    512,          162.25,       0.65
       8,   100,   1024,          158.87,       0.64
       8,   100,   2048,          155.39,       0.62
       8,   120,     16,          173.77,       0.70
       8,   120,     64,          178.34,       0.71
       8,   120,    128,          168.66,       0.67
       8,   120,    256,          165.60,       0.66
       8,   120,    512,          162.30,       0.65
       8,   120,   1024,          159.38,       0.64
       8,   120,   2048,          156.17,       0.62
       8,  1000,     16,          171.34,       0.69
       8,  1000,     64,          178.96,       0.72
       8,  1000,    128,          169.71,       0.68
       8,  1000,    256,          165.62,       0.66
       8,  1000,    512,          162.98,       0.65
       8,  1000,   1024,          161.59,       0.65
       8,  1000,   2048,          157.16,       0.63

After:

  bit_rate rows,   cols,    elems_per_usec,    GB/Sec
       2,   100,     16,         1006.83,       4.03
       2,   100,     64,         1542.11,       6.17
       2,   100,    128,         1882.99,       7.53
       2,   100,    256,         2063.71,       8.25
       2,   100,    512,         2232.29,       8.93
       2,   100,   1024,         2298.69,       9.19
       2,   100,   2048,         2333.73,       9.33
       2,   120,     16,         1016.40,       4.07
       2,   120,     64,         1524.36,       6.10
       2,   120,    128,         1853.40,       7.41
       2,   120,    256,         2158.92,       8.64
       2,   120,    512,         2321.61,       9.29
       2,   120,   1024,         2353.80,       9.42
       2,   120,   2048,         2332.84,       9.33
       2,  1000,     16,         1129.08,       4.52
       2,  1000,     64,         1606.46,       6.43
       2,  1000,    128,         2095.33,       8.38
       2,  1000,    256,         2470.88,       9.88
       2,  1000,    512,         2746.67,      10.99
       2,  1000,   1024,         2882.32,      11.53
       2,  1000,   2048,         2447.96,       9.79
       4,   100,     16,          999.05,       4.00
       4,   100,     64,         1666.00,       6.66
       4,   100,    128,         2062.08,       8.25
       4,   100,    256,         2226.33,       8.91
       4,   100,    512,         2481.11,       9.92
       4,   100,   1024,         2717.50,      10.87
       4,   100,   2048,         2656.00,      10.62
       4,   120,     16,         1056.31,       4.23
       4,   120,     64,         1651.95,       6.61
       4,   120,    128,         2058.65,       8.23
       4,   120,    256,         2339.64,       9.36
       4,   120,    512,         2570.03,      10.28
       4,   120,   1024,         2788.24,      11.15
       4,   120,   2048,         2701.20,      10.80
       4,  1000,     16,         1184.28,       4.74
       4,  1000,     64,         1765.47,       7.06
       4,  1000,    128,         2348.17,       9.39
       4,  1000,    256,         2852.72,      11.41
       4,  1000,    512,         3249.46,      13.00
       4,  1000,   1024,         3418.46,      13.67
       4,  1000,   2048,         2841.77,      11.37
       8,   100,     16,         1176.35,       4.71
       8,   100,     64,         1902.76,       7.61
       8,   100,    128,         2196.23,       8.78
       8,   100,    256,         2596.55,      10.39
       8,   100,    512,         2814.30,      11.26
       8,   100,   1024,         3175.49,      12.70
       8,   100,   2048,         3334.41,      13.34
       8,   120,     16,         1213.55,       4.85
       8,   120,     64,         1806.19,       7.22
       8,   120,    128,         2390.64,       9.56
       8,   120,    256,         2736.11,      10.94
       8,   120,    512,         3015.86,      12.06
       8,   120,   1024,         3332.53,      13.33
       8,   120,   2048,         3319.50,      13.28
       8,  1000,     16,         1362.12,       5.45
       8,  1000,     64,         2029.25,       8.12
       8,  1000,    128,         2759.50,      11.04
       8,  1000,    256,         3532.71,      14.13
       8,  1000,    512,         4014.48,      16.06
       8,  1000,   1024,         4240.49,      16.96
       8,  1000,   2048,         3440.59,      13.76

Differential Revision: D86774172
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 17, 2025
…lf (pytorch#5115)

Summary:
X-link: facebookresearch/FBGEMM#2121


Adding NEON translation of FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf, used by Ads

Performance improves by an order of magnitude:

Before:

  bit_rate rows,   cols,    elems_per_usec,    GB/Sec
       2,   100,     16,          211.26,       0.85
       2,   100,     64,          210.96,       0.84
       2,   100,    128,          204.26,       0.82
       2,   100,    256,          200.47,       0.80
       2,   100,    512,          194.19,       0.78
       2,   100,   1024,          190.98,       0.76
       2,   100,   2048,          186.85,       0.75
       2,   120,     16,          206.88,       0.83
       2,   120,     64,          211.64,       0.85
       2,   120,    128,          203.97,       0.82
       2,   120,    256,          200.22,       0.80
       2,   120,    512,          194.97,       0.78
       2,   120,   1024,          191.76,       0.77
       2,   120,   2048,          187.45,       0.75
       2,  1000,     16,          205.10,       0.82
       2,  1000,     64,          214.15,       0.86
       2,  1000,    128,          205.43,       0.82
       2,  1000,    256,          200.34,       0.80
       2,  1000,    512,          196.62,       0.79
       2,  1000,   1024,          194.64,       0.78
       2,  1000,   2048,          187.54,       0.75
       4,   100,     16,          197.97,       0.79
       4,   100,     64,          200.02,       0.80
       4,   100,    128,          191.06,       0.76
       4,   100,    256,          186.58,       0.75
       4,   100,    512,          180.76,       0.72
       4,   100,   1024,          176.65,       0.71
       4,   100,   2048,          175.00,       0.70
       4,   120,     16,          198.93,       0.80
       4,   120,     64,          201.74,       0.81
       4,   120,    128,          190.95,       0.76
       4,   120,    256,          186.79,       0.75
       4,   120,    512,          181.32,       0.73
       4,   120,   1024,          177.54,       0.71
       4,   120,   2048,          174.69,       0.70
       4,  1000,     16,          194.63,       0.78
       4,  1000,     64,          201.64,       0.81
       4,  1000,    128,          191.78,       0.77
       4,  1000,    256,          186.87,       0.75
       4,  1000,    512,          182.91,       0.73
       4,  1000,   1024,          180.66,       0.72
       4,  1000,   2048,          175.04,       0.70
       8,   100,     16,          171.01,       0.68
       8,   100,     64,          177.53,       0.71
       8,   100,    128,          168.92,       0.68
       8,   100,    256,          165.23,       0.66
       8,   100,    512,          162.25,       0.65
       8,   100,   1024,          158.87,       0.64
       8,   100,   2048,          155.39,       0.62
       8,   120,     16,          173.77,       0.70
       8,   120,     64,          178.34,       0.71
       8,   120,    128,          168.66,       0.67
       8,   120,    256,          165.60,       0.66
       8,   120,    512,          162.30,       0.65
       8,   120,   1024,          159.38,       0.64
       8,   120,   2048,          156.17,       0.62
       8,  1000,     16,          171.34,       0.69
       8,  1000,     64,          178.96,       0.72
       8,  1000,    128,          169.71,       0.68
       8,  1000,    256,          165.62,       0.66
       8,  1000,    512,          162.98,       0.65
       8,  1000,   1024,          161.59,       0.65
       8,  1000,   2048,          157.16,       0.63

After:

  bit_rate rows,   cols,    elems_per_usec,    GB/Sec
       2,   100,     16,         1006.83,       4.03
       2,   100,     64,         1542.11,       6.17
       2,   100,    128,         1882.99,       7.53
       2,   100,    256,         2063.71,       8.25
       2,   100,    512,         2232.29,       8.93
       2,   100,   1024,         2298.69,       9.19
       2,   100,   2048,         2333.73,       9.33
       2,   120,     16,         1016.40,       4.07
       2,   120,     64,         1524.36,       6.10
       2,   120,    128,         1853.40,       7.41
       2,   120,    256,         2158.92,       8.64
       2,   120,    512,         2321.61,       9.29
       2,   120,   1024,         2353.80,       9.42
       2,   120,   2048,         2332.84,       9.33
       2,  1000,     16,         1129.08,       4.52
       2,  1000,     64,         1606.46,       6.43
       2,  1000,    128,         2095.33,       8.38
       2,  1000,    256,         2470.88,       9.88
       2,  1000,    512,         2746.67,      10.99
       2,  1000,   1024,         2882.32,      11.53
       2,  1000,   2048,         2447.96,       9.79
       4,   100,     16,          999.05,       4.00
       4,   100,     64,         1666.00,       6.66
       4,   100,    128,         2062.08,       8.25
       4,   100,    256,         2226.33,       8.91
       4,   100,    512,         2481.11,       9.92
       4,   100,   1024,         2717.50,      10.87
       4,   100,   2048,         2656.00,      10.62
       4,   120,     16,         1056.31,       4.23
       4,   120,     64,         1651.95,       6.61
       4,   120,    128,         2058.65,       8.23
       4,   120,    256,         2339.64,       9.36
       4,   120,    512,         2570.03,      10.28
       4,   120,   1024,         2788.24,      11.15
       4,   120,   2048,         2701.20,      10.80
       4,  1000,     16,         1184.28,       4.74
       4,  1000,     64,         1765.47,       7.06
       4,  1000,    128,         2348.17,       9.39
       4,  1000,    256,         2852.72,      11.41
       4,  1000,    512,         3249.46,      13.00
       4,  1000,   1024,         3418.46,      13.67
       4,  1000,   2048,         2841.77,      11.37
       8,   100,     16,         1176.35,       4.71
       8,   100,     64,         1902.76,       7.61
       8,   100,    128,         2196.23,       8.78
       8,   100,    256,         2596.55,      10.39
       8,   100,    512,         2814.30,      11.26
       8,   100,   1024,         3175.49,      12.70
       8,   100,   2048,         3334.41,      13.34
       8,   120,     16,         1213.55,       4.85
       8,   120,     64,         1806.19,       7.22
       8,   120,    128,         2390.64,       9.56
       8,   120,    256,         2736.11,      10.94
       8,   120,    512,         3015.86,      12.06
       8,   120,   1024,         3332.53,      13.33
       8,   120,   2048,         3319.50,      13.28
       8,  1000,     16,         1362.12,       5.45
       8,  1000,     64,         2029.25,       8.12
       8,  1000,    128,         2759.50,      11.04
       8,  1000,    256,         3532.71,      14.13
       8,  1000,    512,         4014.48,      16.06
       8,  1000,   1024,         4240.49,      16.96
       8,  1000,   2048,         3440.59,      13.76

Differential Revision: D86774172
@Nicoshev Nicoshev force-pushed the export-D86774172 branch 2 times, most recently from 1f61f25 to 16a2f5a Compare November 18, 2025 17:41
Nicoshev added a commit to Nicoshev/FBGEMM that referenced this pull request Nov 18, 2025
…lf (pytorch#5115)

Summary:
X-link: facebookresearch/FBGEMM#2121


Adding NEON translation of FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf, used by Ads

Performance improves by an order of magnitude:

Before:

  bit_rate rows,   cols,    elems_per_usec,    GB/Sec
       2,   100,     16,          211.26,       0.85
       2,   100,     64,          210.96,       0.84
       2,   100,    128,          204.26,       0.82
       2,   100,    256,          200.47,       0.80
       2,   100,    512,          194.19,       0.78
       2,   100,   1024,          190.98,       0.76
       2,   100,   2048,          186.85,       0.75
       2,   120,     16,          206.88,       0.83
       2,   120,     64,          211.64,       0.85
       2,   120,    128,          203.97,       0.82
       2,   120,    256,          200.22,       0.80
       2,   120,    512,          194.97,       0.78
       2,   120,   1024,          191.76,       0.77
       2,   120,   2048,          187.45,       0.75
       2,  1000,     16,          205.10,       0.82
       2,  1000,     64,          214.15,       0.86
       2,  1000,    128,          205.43,       0.82
       2,  1000,    256,          200.34,       0.80
       2,  1000,    512,          196.62,       0.79
       2,  1000,   1024,          194.64,       0.78
       2,  1000,   2048,          187.54,       0.75
       4,   100,     16,          197.97,       0.79
       4,   100,     64,          200.02,       0.80
       4,   100,    128,          191.06,       0.76
       4,   100,    256,          186.58,       0.75
       4,   100,    512,          180.76,       0.72
       4,   100,   1024,          176.65,       0.71
       4,   100,   2048,          175.00,       0.70
       4,   120,     16,          198.93,       0.80
       4,   120,     64,          201.74,       0.81
       4,   120,    128,          190.95,       0.76
       4,   120,    256,          186.79,       0.75
       4,   120,    512,          181.32,       0.73
       4,   120,   1024,          177.54,       0.71
       4,   120,   2048,          174.69,       0.70
       4,  1000,     16,          194.63,       0.78
       4,  1000,     64,          201.64,       0.81
       4,  1000,    128,          191.78,       0.77
       4,  1000,    256,          186.87,       0.75
       4,  1000,    512,          182.91,       0.73
       4,  1000,   1024,          180.66,       0.72
       4,  1000,   2048,          175.04,       0.70
       8,   100,     16,          171.01,       0.68
       8,   100,     64,          177.53,       0.71
       8,   100,    128,          168.92,       0.68
       8,   100,    256,          165.23,       0.66
       8,   100,    512,          162.25,       0.65
       8,   100,   1024,          158.87,       0.64
       8,   100,   2048,          155.39,       0.62
       8,   120,     16,          173.77,       0.70
       8,   120,     64,          178.34,       0.71
       8,   120,    128,          168.66,       0.67
       8,   120,    256,          165.60,       0.66
       8,   120,    512,          162.30,       0.65
       8,   120,   1024,          159.38,       0.64
       8,   120,   2048,          156.17,       0.62
       8,  1000,     16,          171.34,       0.69
       8,  1000,     64,          178.96,       0.72
       8,  1000,    128,          169.71,       0.68
       8,  1000,    256,          165.62,       0.66
       8,  1000,    512,          162.98,       0.65
       8,  1000,   1024,          161.59,       0.65
       8,  1000,   2048,          157.16,       0.63

After:

  bit_rate rows,   cols,    elems_per_usec,    GB/Sec
       2,   100,     16,         1006.83,       4.03
       2,   100,     64,         1542.11,       6.17
       2,   100,    128,         1882.99,       7.53
       2,   100,    256,         2063.71,       8.25
       2,   100,    512,         2232.29,       8.93
       2,   100,   1024,         2298.69,       9.19
       2,   100,   2048,         2333.73,       9.33
       2,   120,     16,         1016.40,       4.07
       2,   120,     64,         1524.36,       6.10
       2,   120,    128,         1853.40,       7.41
       2,   120,    256,         2158.92,       8.64
       2,   120,    512,         2321.61,       9.29
       2,   120,   1024,         2353.80,       9.42
       2,   120,   2048,         2332.84,       9.33
       2,  1000,     16,         1129.08,       4.52
       2,  1000,     64,         1606.46,       6.43
       2,  1000,    128,         2095.33,       8.38
       2,  1000,    256,         2470.88,       9.88
       2,  1000,    512,         2746.67,      10.99
       2,  1000,   1024,         2882.32,      11.53
       2,  1000,   2048,         2447.96,       9.79
       4,   100,     16,          999.05,       4.00
       4,   100,     64,         1666.00,       6.66
       4,   100,    128,         2062.08,       8.25
       4,   100,    256,         2226.33,       8.91
       4,   100,    512,         2481.11,       9.92
       4,   100,   1024,         2717.50,      10.87
       4,   100,   2048,         2656.00,      10.62
       4,   120,     16,         1056.31,       4.23
       4,   120,     64,         1651.95,       6.61
       4,   120,    128,         2058.65,       8.23
       4,   120,    256,         2339.64,       9.36
       4,   120,    512,         2570.03,      10.28
       4,   120,   1024,         2788.24,      11.15
       4,   120,   2048,         2701.20,      10.80
       4,  1000,     16,         1184.28,       4.74
       4,  1000,     64,         1765.47,       7.06
       4,  1000,    128,         2348.17,       9.39
       4,  1000,    256,         2852.72,      11.41
       4,  1000,    512,         3249.46,      13.00
       4,  1000,   1024,         3418.46,      13.67
       4,  1000,   2048,         2841.77,      11.37
       8,   100,     16,         1176.35,       4.71
       8,   100,     64,         1902.76,       7.61
       8,   100,    128,         2196.23,       8.78
       8,   100,    256,         2596.55,      10.39
       8,   100,    512,         2814.30,      11.26
       8,   100,   1024,         3175.49,      12.70
       8,   100,   2048,         3334.41,      13.34
       8,   120,     16,         1213.55,       4.85
       8,   120,     64,         1806.19,       7.22
       8,   120,    128,         2390.64,       9.56
       8,   120,    256,         2736.11,      10.94
       8,   120,    512,         3015.86,      12.06
       8,   120,   1024,         3332.53,      13.33
       8,   120,   2048,         3319.50,      13.28
       8,  1000,     16,         1362.12,       5.45
       8,  1000,     64,         2029.25,       8.12
       8,  1000,    128,         2759.50,      11.04
       8,  1000,    256,         3532.71,      14.13
       8,  1000,    512,         4014.48,      16.06
       8,  1000,   1024,         4240.49,      16.96
       8,  1000,   2048,         3440.59,      13.76

Differential Revision: D86774172
…lf (pytorch#5115)

Summary:
X-link: facebookresearch/FBGEMM#2121


Adding NEON translation of FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf, used by Ads

Performance improves by an order of magnitude:

Before:

  bit_rate rows,   cols,    elems_per_usec,    GB/Sec
       2,   100,     16,          211.26,       0.85
       2,   100,     64,          210.96,       0.84
       2,   100,    128,          204.26,       0.82
       2,   100,    256,          200.47,       0.80
       2,   100,    512,          194.19,       0.78
       2,   100,   1024,          190.98,       0.76
       2,   100,   2048,          186.85,       0.75
       2,   120,     16,          206.88,       0.83
       2,   120,     64,          211.64,       0.85
       2,   120,    128,          203.97,       0.82
       2,   120,    256,          200.22,       0.80
       2,   120,    512,          194.97,       0.78
       2,   120,   1024,          191.76,       0.77
       2,   120,   2048,          187.45,       0.75
       2,  1000,     16,          205.10,       0.82
       2,  1000,     64,          214.15,       0.86
       2,  1000,    128,          205.43,       0.82
       2,  1000,    256,          200.34,       0.80
       2,  1000,    512,          196.62,       0.79
       2,  1000,   1024,          194.64,       0.78
       2,  1000,   2048,          187.54,       0.75
       4,   100,     16,          197.97,       0.79
       4,   100,     64,          200.02,       0.80
       4,   100,    128,          191.06,       0.76
       4,   100,    256,          186.58,       0.75
       4,   100,    512,          180.76,       0.72
       4,   100,   1024,          176.65,       0.71
       4,   100,   2048,          175.00,       0.70
       4,   120,     16,          198.93,       0.80
       4,   120,     64,          201.74,       0.81
       4,   120,    128,          190.95,       0.76
       4,   120,    256,          186.79,       0.75
       4,   120,    512,          181.32,       0.73
       4,   120,   1024,          177.54,       0.71
       4,   120,   2048,          174.69,       0.70
       4,  1000,     16,          194.63,       0.78
       4,  1000,     64,          201.64,       0.81
       4,  1000,    128,          191.78,       0.77
       4,  1000,    256,          186.87,       0.75
       4,  1000,    512,          182.91,       0.73
       4,  1000,   1024,          180.66,       0.72
       4,  1000,   2048,          175.04,       0.70
       8,   100,     16,          171.01,       0.68
       8,   100,     64,          177.53,       0.71
       8,   100,    128,          168.92,       0.68
       8,   100,    256,          165.23,       0.66
       8,   100,    512,          162.25,       0.65
       8,   100,   1024,          158.87,       0.64
       8,   100,   2048,          155.39,       0.62
       8,   120,     16,          173.77,       0.70
       8,   120,     64,          178.34,       0.71
       8,   120,    128,          168.66,       0.67
       8,   120,    256,          165.60,       0.66
       8,   120,    512,          162.30,       0.65
       8,   120,   1024,          159.38,       0.64
       8,   120,   2048,          156.17,       0.62
       8,  1000,     16,          171.34,       0.69
       8,  1000,     64,          178.96,       0.72
       8,  1000,    128,          169.71,       0.68
       8,  1000,    256,          165.62,       0.66
       8,  1000,    512,          162.98,       0.65
       8,  1000,   1024,          161.59,       0.65
       8,  1000,   2048,          157.16,       0.63

After:

  bit_rate rows,   cols,    elems_per_usec,    GB/Sec
       2,   100,     16,         1006.83,       4.03
       2,   100,     64,         1542.11,       6.17
       2,   100,    128,         1882.99,       7.53
       2,   100,    256,         2063.71,       8.25
       2,   100,    512,         2232.29,       8.93
       2,   100,   1024,         2298.69,       9.19
       2,   100,   2048,         2333.73,       9.33
       2,   120,     16,         1016.40,       4.07
       2,   120,     64,         1524.36,       6.10
       2,   120,    128,         1853.40,       7.41
       2,   120,    256,         2158.92,       8.64
       2,   120,    512,         2321.61,       9.29
       2,   120,   1024,         2353.80,       9.42
       2,   120,   2048,         2332.84,       9.33
       2,  1000,     16,         1129.08,       4.52
       2,  1000,     64,         1606.46,       6.43
       2,  1000,    128,         2095.33,       8.38
       2,  1000,    256,         2470.88,       9.88
       2,  1000,    512,         2746.67,      10.99
       2,  1000,   1024,         2882.32,      11.53
       2,  1000,   2048,         2447.96,       9.79
       4,   100,     16,          999.05,       4.00
       4,   100,     64,         1666.00,       6.66
       4,   100,    128,         2062.08,       8.25
       4,   100,    256,         2226.33,       8.91
       4,   100,    512,         2481.11,       9.92
       4,   100,   1024,         2717.50,      10.87
       4,   100,   2048,         2656.00,      10.62
       4,   120,     16,         1056.31,       4.23
       4,   120,     64,         1651.95,       6.61
       4,   120,    128,         2058.65,       8.23
       4,   120,    256,         2339.64,       9.36
       4,   120,    512,         2570.03,      10.28
       4,   120,   1024,         2788.24,      11.15
       4,   120,   2048,         2701.20,      10.80
       4,  1000,     16,         1184.28,       4.74
       4,  1000,     64,         1765.47,       7.06
       4,  1000,    128,         2348.17,       9.39
       4,  1000,    256,         2852.72,      11.41
       4,  1000,    512,         3249.46,      13.00
       4,  1000,   1024,         3418.46,      13.67
       4,  1000,   2048,         2841.77,      11.37
       8,   100,     16,         1176.35,       4.71
       8,   100,     64,         1902.76,       7.61
       8,   100,    128,         2196.23,       8.78
       8,   100,    256,         2596.55,      10.39
       8,   100,    512,         2814.30,      11.26
       8,   100,   1024,         3175.49,      12.70
       8,   100,   2048,         3334.41,      13.34
       8,   120,     16,         1213.55,       4.85
       8,   120,     64,         1806.19,       7.22
       8,   120,    128,         2390.64,       9.56
       8,   120,    256,         2736.11,      10.94
       8,   120,    512,         3015.86,      12.06
       8,   120,   1024,         3332.53,      13.33
       8,   120,   2048,         3319.50,      13.28
       8,  1000,     16,         1362.12,       5.45
       8,  1000,     64,         2029.25,       8.12
       8,  1000,    128,         2759.50,      11.04
       8,  1000,    256,         3532.71,      14.13
       8,  1000,    512,         4014.48,      16.06
       8,  1000,   1024,         4240.49,      16.96
       8,  1000,   2048,         3440.59,      13.76

Differential Revision: D86774172
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 19, 2025

This pull request has been merged in 23d7a88.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants