Skip to content

Commit 0c96eb2

Browse files
committed
merge adaptive.md into interpreter.md
1 parent e5802d7 commit 0c96eb2

File tree

5 files changed

+184
-41
lines changed

5 files changed

+184
-41
lines changed

InternalDocs/README.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,7 @@ Runtime Objects
3434
Program Execution
3535
---
3636

37-
- [The Basic Interpreter](interpreter.md)
38-
39-
- [The Specializing Interpreter](adaptive.md)
37+
- [The Bytecode Interpreter](interpreter.md)
4038

4139
- [The Tier 2 Interpreter and JIT](tier2.md)
4240

InternalDocs/code_objects.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,11 @@ Code objects are typically produced by the bytecode [compiler](compiler.md),
1818
although they are often written to disk by one process and read back in by another.
1919
The disk version of a code object is serialized using the
2020
[marshal](https://docs.python.org/dev/library/marshal.html) protocol.
21+
When a [`CodeObject`](code_objects.md) is created, the function
22+
`_PyCode_Quicken()` from [`Python/specialize.c`](../Python/specialize.c) is
23+
called to initialize the caches of all adaptive instructions. This is
24+
required because the on-disk format is a sequence of bytes, and
25+
some of the caches need to be initialized with 16-bit values.
2126

2227
Code objects are nominally immutable.
2328
Some fields (including `co_code_adaptive` and fields for runtime

InternalDocs/compiler.md

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -595,16 +595,6 @@ Objects
595595
* [Exception Handling](exception_handling.md): Describes the exception table
596596

597597

598-
Specializing Adaptive Interpreter
599-
=================================
600-
601-
Adding a specializing, adaptive interpreter to CPython will bring significant
602-
performance improvements. These documents provide more information:
603-
604-
* [PEP 659: Specializing Adaptive Interpreter](https://peps.python.org/pep-0659/).
605-
* [Adding or extending a family of adaptive instructions](adaptive.md)
606-
607-
608598
References
609599
==========
610600

InternalDocs/interpreter.md

Lines changed: 174 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,4 @@
1-
The bytecode interpreter
2-
========================
3-
4-
Overview
5-
--------
1+
# The bytecode interpreter
62

73
This document describes the workings and implementation of the bytecode
84
interpreter, the part of python that executes compiled Python code. Its
@@ -47,8 +43,7 @@ simply calls [`_PyEval_EvalFrameDefault()`] to execute the frame. However, as pe
4743
`_PyEval_EvalFrameDefault()`.
4844

4945

50-
Instruction decoding
51-
--------------------
46+
## Instruction decoding
5247

5348
The first task of the interpreter is to decode the bytecode instructions.
5449
Bytecode is stored as an array of 16-bit code units (`_Py_CODEUNIT`).
@@ -110,8 +105,7 @@ snippet decode a complete instruction:
110105
For various reasons we'll get to later (mostly efficiency, given that `EXTENDED_ARG`
111106
is rare) the actual code is different.
112107

113-
Jumps
114-
=====
108+
## Jumps
115109

116110
Note that when the `switch` statement is reached, `next_instr` (the "instruction offset")
117111
already points to the next instruction.
@@ -120,15 +114,14 @@ Thus, jump instructions can be implemented by manipulating `next_instr`:
120114
- A jump forward (`JUMP_FORWARD`) sets `next_instr += oparg`.
121115
- A jump backward sets `next_instr -= oparg`.
122116

123-
Inline cache entries
124-
====================
117+
## Inline cache entries
125118

126119
Some (specialized or specializable) instructions have an associated "inline cache".
127120
The inline cache consists of one or more two-byte entries included in the bytecode
128121
array as additional words following the `opcode`/`oparg` pair.
129122
The size of the inline cache for a particular instruction is fixed by its `opcode`.
130123
Moreover, the inline cache size for all instructions in a
131-
[family of specialized/specializable instructions](adaptive.md)
124+
[family of specialized/specializable instructions](#Specialization)
132125
(for example, `LOAD_ATTR`, `LOAD_ATTR_SLOT`, `LOAD_ATTR_MODULE`) must all be
133126
the same. Cache entries are reserved by the compiler and initialized with zeros.
134127
Although they are represented by code units, cache entries do not conform to the
@@ -153,8 +146,7 @@ Serializing non-zero cache entries would present a problem because the serializa
153146
More information about the use of inline caches can be found in
154147
[PEP 659](https://peps.python.org/pep-0659/#ancillary-data).
155148

156-
The evaluation stack
157-
--------------------
149+
## The evaluation stack
158150

159151
Most instructions read or write some data in the form of object references (`PyObject *`).
160152
The CPython bytecode interpreter is a stack machine, meaning that its instructions operate
@@ -193,16 +185,14 @@ For example, the following sequence is illegal, because it keeps pushing items o
193185
> Do not confuse the evaluation stack with the call stack, which is used to implement calling
194186
> and returning from functions.
195187
196-
Error handling
197-
--------------
188+
## Error handling
198189

199190
When the implementation of an opcode raises an exception, it jumps to the
200191
`exception_unwind` label in [Python/ceval.c](../Python/ceval.c).
201192
The exception is then handled as described in the
202193
[`exception handling documentation`](exception_handling.md#handling-exceptions).
203194

204-
Python-to-Python calls
205-
----------------------
195+
## Python-to-Python calls
206196

207197
The `_PyEval_EvalFrameDefault()` function is recursive, because sometimes
208198
the interpreter calls some C function that calls back into the interpreter.
@@ -227,8 +217,7 @@ returns from `_PyEval_EvalFrameDefault()` altogether, to a C caller.
227217

228218
A similar check is performed when an unhandled exception occurs.
229219

230-
The call stack
231-
--------------
220+
## The call stack
232221

233222
Up through 3.10, the call stack was implemented as a singly-linked list of
234223
[frame objects](frames.md). This was expensive because each call would require a
@@ -262,8 +251,7 @@ See also the [generators](generators.md) section.
262251

263252
<!--
264253
265-
All sorts of variables
266-
----------------------
254+
## All sorts of variables
267255
268256
The bytecode compiler determines the scope in which each variable name is defined,
269257
and generates instructions accordingly. For example, loading a local variable
@@ -297,8 +285,7 @@ Other topics
297285
298286
-->
299287

300-
Introducing a new bytecode instruction
301-
--------------------------------------
288+
## Introducing a new bytecode instruction
302289

303290
It is occasionally necessary to add a new opcode in order to implement
304291
a new feature or change the way that existing features are compiled.
@@ -355,6 +342,169 @@ new bytecode properly. Run `make regen-importlib` for updating the
355342
bytecode of frozen importlib files. You have to run `make` again after this
356343
to recompile the generated C files.
357344

345+
## Specialization
346+
347+
Bytecode specialization, which was introduced in
348+
[PEP 659](https://peps.python.org/pep-0659/), speeds up program execution by
349+
rewriting instructions based on runtime information. This is done by replacing
350+
a generic instruction with a faster version that works for the case that this
351+
program encounters. Each specializable instruction is responsible for rewriting
352+
itself, using its [inline caches](#inline-cache-entries) for
353+
bookkeeping.
354+
355+
When an adaptive instruction executes, it may attempt to specialize itself,
356+
depending on the argument and the contents of its cache. This is done
357+
by calling one of the `_Py_Specialize_XXX` functions in
358+
[`Python/specialize.c`](../Python/specialize.c).
359+
360+
361+
The specialized instructions are responsible for checking that the special-case
362+
assumptions still apply, and de-optimizing back to the generic version if not.
363+
364+
## Families of instructions
365+
366+
A *family* of instructions consists of an adaptive instruction along with the
367+
specialized instruction that it can be replaced by.
368+
It has the following fundamental properties:
369+
370+
* It corresponds to a single instruction in the code
371+
generated by the bytecode compiler.
372+
* It has a single adaptive instruction that records an execution count and,
373+
at regular intervals, attempts to specialize itself. If not specializing,
374+
it executes the base implementation.
375+
* It has at least one specialized form of the instruction that is tailored
376+
for a particular value or set of values at runtime.
377+
* All members of the family must have the same number of inline cache entries,
378+
to ensure correct execution.
379+
Individual family members do not need to use all of the entries,
380+
but must skip over any unused entries when executing.
381+
382+
The current implementation also requires the following,
383+
although these are not fundamental and may change:
384+
385+
* All families use one or more inline cache entries,
386+
the first entry is always the counter.
387+
* All instruction names should start with the name of the adaptive
388+
instruction.
389+
* Specialized forms should have names describing their specialization.
390+
391+
## Example family
392+
393+
The `LOAD_GLOBAL` instruction (in [Python/bytecodes.c](../Python/bytecodes.c))
394+
already has an adaptive family that serves as a relatively simple example.
395+
396+
The `LOAD_GLOBAL` instruction performs adaptive specialization,
397+
calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero.
398+
399+
There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE`
400+
which is specialized for global variables in the module, and
401+
`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables.
402+
403+
## Performance analysis
404+
405+
The benefit of a specialization can be assessed with the following formula:
406+
`Tbase/Tadaptive`.
407+
408+
Where `Tbase` is the mean time to execute the base instruction,
409+
and `Tadaptive` is the mean time to execute the specialized and adaptive forms.
410+
411+
`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)`
412+
413+
`Ti` is the time to execute the `i`th instruction in the family and `Ni` is
414+
the number of times that instruction is executed.
415+
`Tmiss` is the time to process a miss, including de-optimzation
416+
and the time to execute the base instruction.
417+
418+
The ideal situation is where misses are rare and the specialized
419+
forms are much faster than the base instruction.
420+
`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`.
421+
In which case we have `Tadaptive ≈ sum(Ti*Ni)`.
422+
Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and
423+
`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction,
424+
we would expect the specialization of `LOAD_GLOBAL` to be profitable.
425+
426+
## Design considerations
427+
428+
While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and
429+
`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti`
430+
low for all specialized instructions and `Nmiss` as low as possible.
431+
432+
Keeping `Nmiss` low means that there should be specializations for almost
433+
all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means
434+
keeping `Ti` low which means minimizing branches and dependent memory
435+
accesses (pointer chasing). These two objectives may be in conflict,
436+
requiring judgement and experimentation to design the family of instructions.
437+
438+
The size of the inline cache should as small as possible,
439+
without impairing performance, to reduce the number of
440+
`EXTENDED_ARG` jumps, and to reduce pressure on the CPU's data cache.
441+
442+
### Gathering data
443+
444+
Before choosing how to specialize an instruction, it is important to gather
445+
some data. What are the patterns of usage of the base instruction?
446+
Data can best be gathered by instrumenting the interpreter. Since a
447+
specialization function and adaptive instruction are going to be required,
448+
instrumentation can most easily be added in the specialization function.
449+
450+
### Choice of specializations
451+
452+
The performance of the specializing adaptive interpreter relies on the
453+
quality of specialization and keeping the overhead of specialization low.
454+
455+
Specialized instructions must be fast. In order to be fast,
456+
specialized instructions should be tailored for a particular
457+
set of values that allows them to:
458+
459+
1. Verify that incoming value is part of that set with low overhead.
460+
2. Perform the operation quickly.
461+
462+
This requires that the set of values is chosen such that membership can be
463+
tested quickly and that membership is sufficient to allow the operation to
464+
performed quickly.
465+
466+
For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()`
467+
dictionaries that have a keys with the expected version.
468+
469+
This can be tested quickly:
470+
471+
* `globals->keys->dk_version == expected_version`
472+
473+
and the operation can be performed quickly:
474+
475+
* `value = entries[cache->index].me_value;`.
476+
477+
Because it is impossible to measure the performance of an instruction without
478+
also measuring unrelated factors, the assessment of the quality of a
479+
specialization will require some judgement.
480+
481+
As a general rule, specialized instructions should be much faster than the
482+
base instruction.
483+
484+
### Implementation of specialized instructions
485+
486+
In general, specialized instructions should be implemented in two parts:
487+
488+
1. A sequence of guards, each of the form
489+
`DEOPT_IF(guard-condition-is-false, BASE_NAME)`.
490+
2. The operation, which should ideally have no branches and
491+
a minimum number of dependent memory accesses.
492+
493+
In practice, the parts may overlap, as data required for guards
494+
can be re-used in the operation.
495+
496+
If there are branches in the operation, then consider further specialization
497+
to eliminate the branches.
498+
499+
### Maintaining stats
500+
501+
Finally, take care that stats are gathered correctly.
502+
After the last `DEOPT_IF` has passed, a hit should be recorded with
503+
`STAT_INC(BASE_INSTRUCTION, hit)`.
504+
After an optimization has been deferred in the adaptive instruction,
505+
that should be recorded with `STAT_INC(BASE_INSTRUCTION, deferred)`.
506+
507+
358508
Additional resources
359509
--------------------
360510

InternalDocs/tier2.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,10 @@
33
The [basic interpreter](interpreter.md), also referred to as the `tier 1`
44
interpreter, consists of a main loop that executes the bytecode instructions
55
generated by the [bytecode compiler](compiler.md) and their
6-
[specializations](adaptive.md). Runtime optimization in tier 1 can only be
7-
done for one instruction at a time. The `tier 2` interpreter is based on a
8-
mechanism to replace an entire sequence of bytecode instructions, and this
9-
enables optimizations that span multiple instructions.
6+
[specializations](interpreter.md#Specialization). Runtime optimization in tier 1
7+
can only be done for one instruction at a time. The `tier 2` interpreter is
8+
based on a mechanism to replace an entire sequence of bytecode instructions,
9+
and this enables optimizations that span multiple instructions.
1010

1111
## The Optimizer and Executors
1212

0 commit comments

Comments
 (0)