You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expose method name as part of backend init context (#6622)
Summary:
Provide the method name to backend so they can load the corresponding method name accordingly.
The most immediate need is that the qnn context binary can include two methods, one for prefill and one for decode. Since we don't allow backend access multi methods at the moment, we do it in a hacky way via following
## AOT:
```
class LLama_transformer():
def prefill()
def decode()
```
Then we will have two custom ops from two to_backends ops, and both will have two context binary
```
QAT (prefill) -> to_backend(...) => prefill.qcir flatbuffers
QAT (decode) -> to_backend(...) => decode.qcir flatbuffers
=>
graph prefill(
custom_op_prefill() -> context_binary (two graphs)
)
graph decode()
custom_op_decode() -> context_binary (two graphs)
)
```
Since two context binary from these two customs ops will be exactly the same and they can be deduplicate during emit via these two lines https://github.com/pytorch/executorch/blob/d4a9ca01eb5bb786ecbfbcd8302253eb7797e8bb/exir/emit/_emitter.py#L136 and here https://github.com/pytorch/executorch/blob/d4a9ca01eb5bb786ecbfbcd8302253eb7797e8bb/exir/emit/_emitter.py#L1065-L1066
```
.pte instrucions
[
"prefill" [instructions: call_delegate(prefill_input)]
"decode": [instructions: call_delegate(decode_input)]
"delegate_payload:: Dict[bytes, index])
]
```
## Runtime
After we expose the method name via this change, the backend can access the method name, and load the same method as the top level method
```
Result<DelegateHandle*> QNNBackend::init(
BackendInitContext& context,
FreeableBuffer* processed,
ArrayRef<CompileSpec> compile_specs) {
const char* method_name = context.get_method_name() // for example, "prefill"
handle = qnn_backend.load(method_name)
return handle
}
```
This is to unblock sharing weight between prefill and decode for using htp backend.
Reviewed By: dbort
Differential Revision: D65386597
0 commit comments