1- The bytecode interpreter
2- ========================
3-
4- Overview
5- --------
1+ # The bytecode interpreter
62
73This document describes the workings and implementation of the bytecode
84interpreter, the part of python that executes compiled Python code. Its
@@ -47,8 +43,7 @@ simply calls [`_PyEval_EvalFrameDefault()`] to execute the frame. However, as pe
4743` _PyEval_EvalFrameDefault() ` .
4844
4945
50- Instruction decoding
51- --------------------
46+ ## Instruction decoding
5247
5348The first task of the interpreter is to decode the bytecode instructions.
5449Bytecode is stored as an array of 16-bit code units (` _Py_CODEUNIT ` ).
@@ -110,8 +105,7 @@ snippet decode a complete instruction:
110105For various reasons we'll get to later (mostly efficiency, given that ` EXTENDED_ARG `
111106is rare) the actual code is different.
112107
113- Jumps
114- =====
108+ ## Jumps
115109
116110Note that when the ` switch ` statement is reached, ` next_instr ` (the "instruction offset")
117111already points to the next instruction.
@@ -120,15 +114,14 @@ Thus, jump instructions can be implemented by manipulating `next_instr`:
120114- A jump forward (` JUMP_FORWARD ` ) sets ` next_instr += oparg ` .
121115- A jump backward sets ` next_instr -= oparg ` .
122116
123- Inline cache entries
124- ====================
117+ ## Inline cache entries
125118
126119Some (specialized or specializable) instructions have an associated "inline cache".
127120The inline cache consists of one or more two-byte entries included in the bytecode
128121array as additional words following the ` opcode ` /` oparg ` pair.
129122The size of the inline cache for a particular instruction is fixed by its ` opcode ` .
130123Moreover, the inline cache size for all instructions in a
131- [ family of specialized/specializable instructions] ( adaptive.md )
124+ [ family of specialized/specializable instructions] ( #Specialization )
132125(for example, ` LOAD_ATTR ` , ` LOAD_ATTR_SLOT ` , ` LOAD_ATTR_MODULE ` ) must all be
133126the same. Cache entries are reserved by the compiler and initialized with zeros.
134127Although they are represented by code units, cache entries do not conform to the
@@ -153,8 +146,7 @@ Serializing non-zero cache entries would present a problem because the serializa
153146More information about the use of inline caches can be found in
154147[ PEP 659] ( https://peps.python.org/pep-0659/#ancillary-data ) .
155148
156- The evaluation stack
157- --------------------
149+ ## The evaluation stack
158150
159151Most instructions read or write some data in the form of object references (` PyObject * ` ).
160152The CPython bytecode interpreter is a stack machine, meaning that its instructions operate
@@ -193,16 +185,14 @@ For example, the following sequence is illegal, because it keeps pushing items o
193185> Do not confuse the evaluation stack with the call stack, which is used to implement calling
194186> and returning from functions.
195187
196- Error handling
197- --------------
188+ ## Error handling
198189
199190When the implementation of an opcode raises an exception, it jumps to the
200191` exception_unwind ` label in [ Python/ceval.c] ( ../Python/ceval.c ) .
201192The exception is then handled as described in the
202193[ ` exception handling documentation ` ] ( exception_handling.md#handling-exceptions ) .
203194
204- Python-to-Python calls
205- ----------------------
195+ ## Python-to-Python calls
206196
207197The ` _PyEval_EvalFrameDefault() ` function is recursive, because sometimes
208198the interpreter calls some C function that calls back into the interpreter.
@@ -227,8 +217,7 @@ returns from `_PyEval_EvalFrameDefault()` altogether, to a C caller.
227217
228218A similar check is performed when an unhandled exception occurs.
229219
230- The call stack
231- --------------
220+ ## The call stack
232221
233222Up through 3.10, the call stack was implemented as a singly-linked list of
234223[ frame objects] ( frames.md ) . This was expensive because each call would require a
@@ -262,8 +251,7 @@ See also the [generators](generators.md) section.
262251
263252<!--
264253
265- All sorts of variables
266- ----------------------
254+ ## All sorts of variables
267255
268256The bytecode compiler determines the scope in which each variable name is defined,
269257and generates instructions accordingly. For example, loading a local variable
@@ -297,8 +285,7 @@ Other topics
297285
298286-->
299287
300- Introducing a new bytecode instruction
301- --------------------------------------
288+ ## Introducing a new bytecode instruction
302289
303290It is occasionally necessary to add a new opcode in order to implement
304291a new feature or change the way that existing features are compiled.
@@ -355,6 +342,169 @@ new bytecode properly. Run `make regen-importlib` for updating the
355342bytecode of frozen importlib files. You have to run ` make ` again after this
356343to recompile the generated C files.
357344
345+ ## Specialization
346+
347+ Bytecode specialization, which was introduced in
348+ [ PEP 659] ( https://peps.python.org/pep-0659/ ) , speeds up program execution by
349+ rewriting instructions based on runtime information. This is done by replacing
350+ a generic instruction with a faster version that works for the case that this
351+ program encounters. Each specializable instruction is responsible for rewriting
352+ itself, using its [ inline caches] ( #inline-cache-entries ) for
353+ bookkeeping.
354+
355+ When an adaptive instruction executes, it may attempt to specialize itself,
356+ depending on the argument and the contents of its cache. This is done
357+ by calling one of the ` _Py_Specialize_XXX ` functions in
358+ [ ` Python/specialize.c ` ] ( ../Python/specialize.c ) .
359+
360+
361+ The specialized instructions are responsible for checking that the special-case
362+ assumptions still apply, and de-optimizing back to the generic version if not.
363+
364+ ## Families of instructions
365+
366+ A * family* of instructions consists of an adaptive instruction along with the
367+ specialized instruction that it can be replaced by.
368+ It has the following fundamental properties:
369+
370+ * It corresponds to a single instruction in the code
371+ generated by the bytecode compiler.
372+ * It has a single adaptive instruction that records an execution count and,
373+ at regular intervals, attempts to specialize itself. If not specializing,
374+ it executes the base implementation.
375+ * It has at least one specialized form of the instruction that is tailored
376+ for a particular value or set of values at runtime.
377+ * All members of the family must have the same number of inline cache entries,
378+ to ensure correct execution.
379+ Individual family members do not need to use all of the entries,
380+ but must skip over any unused entries when executing.
381+
382+ The current implementation also requires the following,
383+ although these are not fundamental and may change:
384+
385+ * All families use one or more inline cache entries,
386+ the first entry is always the counter.
387+ * All instruction names should start with the name of the adaptive
388+ instruction.
389+ * Specialized forms should have names describing their specialization.
390+
391+ ## Example family
392+
393+ The ` LOAD_GLOBAL ` instruction (in [ Python/bytecodes.c] ( ../Python/bytecodes.c ) )
394+ already has an adaptive family that serves as a relatively simple example.
395+
396+ The ` LOAD_GLOBAL ` instruction performs adaptive specialization,
397+ calling ` _Py_Specialize_LoadGlobal() ` when the counter reaches zero.
398+
399+ There are two specialized instructions in the family, ` LOAD_GLOBAL_MODULE `
400+ which is specialized for global variables in the module, and
401+ ` LOAD_GLOBAL_BUILTIN ` which is specialized for builtin variables.
402+
403+ ## Performance analysis
404+
405+ The benefit of a specialization can be assessed with the following formula:
406+ ` Tbase/Tadaptive ` .
407+
408+ Where ` Tbase ` is the mean time to execute the base instruction,
409+ and ` Tadaptive ` is the mean time to execute the specialized and adaptive forms.
410+
411+ ` Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss) `
412+
413+ ` Ti ` is the time to execute the ` i ` th instruction in the family and ` Ni ` is
414+ the number of times that instruction is executed.
415+ ` Tmiss ` is the time to process a miss, including de-optimzation
416+ and the time to execute the base instruction.
417+
418+ The ideal situation is where misses are rare and the specialized
419+ forms are much faster than the base instruction.
420+ ` LOAD_GLOBAL ` is near ideal, ` Nmiss/sum(Ni) ≈ 0 ` .
421+ In which case we have ` Tadaptive ≈ sum(Ti*Ni) ` .
422+ Since we can expect the specialized forms ` LOAD_GLOBAL_MODULE ` and
423+ ` LOAD_GLOBAL_BUILTIN ` to be much faster than the adaptive base instruction,
424+ we would expect the specialization of ` LOAD_GLOBAL ` to be profitable.
425+
426+ ## Design considerations
427+
428+ While ` LOAD_GLOBAL ` may be ideal, instructions like ` LOAD_ATTR ` and
429+ ` CALL_FUNCTION ` are not. For maximum performance we want to keep ` Ti `
430+ low for all specialized instructions and ` Nmiss ` as low as possible.
431+
432+ Keeping ` Nmiss ` low means that there should be specializations for almost
433+ all values seen by the base instruction. Keeping ` sum(Ti*Ni) ` low means
434+ keeping ` Ti ` low which means minimizing branches and dependent memory
435+ accesses (pointer chasing). These two objectives may be in conflict,
436+ requiring judgement and experimentation to design the family of instructions.
437+
438+ The size of the inline cache should as small as possible,
439+ without impairing performance, to reduce the number of
440+ ` EXTENDED_ARG ` jumps, and to reduce pressure on the CPU's data cache.
441+
442+ ### Gathering data
443+
444+ Before choosing how to specialize an instruction, it is important to gather
445+ some data. What are the patterns of usage of the base instruction?
446+ Data can best be gathered by instrumenting the interpreter. Since a
447+ specialization function and adaptive instruction are going to be required,
448+ instrumentation can most easily be added in the specialization function.
449+
450+ ### Choice of specializations
451+
452+ The performance of the specializing adaptive interpreter relies on the
453+ quality of specialization and keeping the overhead of specialization low.
454+
455+ Specialized instructions must be fast. In order to be fast,
456+ specialized instructions should be tailored for a particular
457+ set of values that allows them to:
458+
459+ 1 . Verify that incoming value is part of that set with low overhead.
460+ 2 . Perform the operation quickly.
461+
462+ This requires that the set of values is chosen such that membership can be
463+ tested quickly and that membership is sufficient to allow the operation to
464+ performed quickly.
465+
466+ For example, ` LOAD_GLOBAL_MODULE ` is specialized for ` globals() `
467+ dictionaries that have a keys with the expected version.
468+
469+ This can be tested quickly:
470+
471+ * ` globals->keys->dk_version == expected_version `
472+
473+ and the operation can be performed quickly:
474+
475+ * ` value = entries[cache->index].me_value; ` .
476+
477+ Because it is impossible to measure the performance of an instruction without
478+ also measuring unrelated factors, the assessment of the quality of a
479+ specialization will require some judgement.
480+
481+ As a general rule, specialized instructions should be much faster than the
482+ base instruction.
483+
484+ ### Implementation of specialized instructions
485+
486+ In general, specialized instructions should be implemented in two parts:
487+
488+ 1 . A sequence of guards, each of the form
489+ ` DEOPT_IF(guard-condition-is-false, BASE_NAME) ` .
490+ 2 . The operation, which should ideally have no branches and
491+ a minimum number of dependent memory accesses.
492+
493+ In practice, the parts may overlap, as data required for guards
494+ can be re-used in the operation.
495+
496+ If there are branches in the operation, then consider further specialization
497+ to eliminate the branches.
498+
499+ ### Maintaining stats
500+
501+ Finally, take care that stats are gathered correctly.
502+ After the last ` DEOPT_IF ` has passed, a hit should be recorded with
503+ ` STAT_INC(BASE_INSTRUCTION, hit) ` .
504+ After an optimization has been deferred in the adaptive instruction,
505+ that should be recorded with ` STAT_INC(BASE_INSTRUCTION, deferred) ` .
506+
507+
358508Additional resources
359509--------------------
360510
0 commit comments