-
Notifications
You must be signed in to change notification settings - Fork 52
Description
There a number of things we can do to speedup Python-to-Python calls, without changing the stack layout.
Faster creation of interpreter frames
Our fastest Python-to-Python call is _INIT_CALL_PY_EXACT_ARGS
which is reasonably efficient, but could definitely be made faster.
There are few issues with it.
- It contains a variable length loop.
- The inlined call to
_PyFrame_PushUnchecked
also contains a variable length loop.
We can make the loops fixed length by:
- Unconditionally copying
self_or_null
- Only adjust the pointer, not the count if
self
is not NULL.
Ifself_or_null
isNULL
it will then be overwritten. - Break
_INIT_CALL_PY_EXACT_ARGS
into two parts, one to initialize the arguments and
one toNULL
out the remaining locals. Both can be markedreplicate
to avoid the loop.
Better optimization of other Py-to-Py calls in tier 2
We currently specialize the remaining Py-to-Py calls into "with defaults" and do not specialize
for "code complex parameters".
We should treat both the same in tier 1 as "CALL_PY", and expand the call sequence in tier2 to
produce an optimal sequence of instructions.
This will probably make no difference to T1 performance, the "with defaults" case will get a tiny bit slower and the other cases might be a bit faster.
Remove f_globals
and f_builtins
from the interpreter frame
In tier 2, we have largely eliminated access to f_globals
and f_builtins
.
We can speedup calls, without slowing down access to globals, in tier 2 if we
were to remove these fields.
Doing this will slowdown access to globals in tier 1, however.
In order to get an overall speedup the ratio of tier 2 to tier 1 code will need to increase.
Once the ratio of T2 to T1 code is 3:1 or better, it should be profitable to remove these fields.