-
Notifications
You must be signed in to change notification settings - Fork 13.2k
Enable Intel AMX acceleration while in CPU/GPU hybrid with new "--amx" toggle. #16310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Let me know if you have any questions. |
Nvm that, that option does the opposite. I think the better solution would be to add an option to disable host buffer types in |
I have played with that little, but found I couldn't get it to work / work as expected. I think it is due to how the extra bufts are implemented as part of the original AMX PR. Not all the CPU weights go into the CPU_REPACK / AMX bufts, so I think we need to maintain the CPU_Mapped model buffer + the extra bufts CPU_REPACK and AMX? Is that what you meant? |
What I mean is adding an option to skip adding the host buffer types here: Lines 328 to 340 in bd0af02
The reason the extra buffer types don't get used when there is a GPU, is because the host buffer types have higher priority. Alternatively, the option could give repack buffers higher priority, but still keep the host buffer types. |
I will make the change and update the PR |
@slaren any feedback on what the "opt-in" switch should be called? I can keep it "--amx" or I can make it more generic, something like "--xbuffers" in case there are any other extra buffers added in the future? |
I am not sure what would be the opt-in switch. What I am proposing is a flag to disable host buffer types, and it should be called something like |
@slaren all changes have been made. All feedback welcomed, and thank you for all your help. |
This change adds a new toggle, "--amx" that will allow the extra buft to remain functional when a GPU is present, enabling AMX operations when in a CPU/GPU hybrid. If the "--amx" toggle is not present, current behavior is maintained.
This change allows significant performance increases on the CPU offloaded layers / moe while in hybrid operations; especially in prompt eval, where with 100%-150%+ performance uplifts are common:
Examples:
Base command:
No AMX (Current behavior):
W/ "--amx":
Results:
Prompt Evaluation | 119.54 tps | 255.18 tps | +135.64 | +113.47%
Token Evaluation | 34.96 tps | 40.13 tps | +5.17 | +14.79%
Overall Inference | 37.90 tps | 43.93 tps | +6.02 | +15.90%
Sampling | 10123.18 tps | 10185.00 tps | +61.82 | +0.61%
With "--cpu-moe":
No AMX (Current behavior):
W/ "--amx":
Results:
Prompt Evaluation | 101.78 tps | 230.11 tps | +128.33 | +126.06%
Token Evaluation | 32.86 tps | 37.65 tps | +4.79 | +14.58%
Overall Inference | 35.18 tps | 41.24 tps | +6.06 | +17.23%
Sampling | 9961.96 tps | 10232.84 tps | +270.88 | +2.72%