Skip to content

Calls to memcpy are slow for 2^k+1 elements (AMD-specific) #137121

@bjorn-martinsson

Description

@bjorn-martinsson

Bug report

Bug description:

Consider the following two Python programs that make 14 copies of a big list

([0] * 262145) * 14
([0] * 265665) * 14

On both of my windows computers, running the latest release of CPython, version 3.13.5, it takes 5 times longer to run the first program (even though its size is smaller). But on older versions of CPython, such as 3.10.11, both programs run equally fast. From my testing, this seems to be an issue only affecting windows. More investigation points to this being an issue for certain AMD processors on both windows and linux.

I also noticed that PyPy has the same issue, starting with version PyPy v7.3.17. I've not been able to reproduce it in versions <= PyPy v7.3.16.

As for what is happening, I am not sure. I think it has to do with 262145=2^18 + 1 being one off from a power of two, which somehow makes copying the list super slow.

Edit: The culprit seems to be calls to memcpy(dst, src, count) when dst-src is slightly larger than a power of two. The algorithm used by the current versions of both CPython and PyPy to copy lists causes this to happen over and over whenever a list of size 2^k + 1 is being copied, while the previous algorithm did not. This explains why I've been unable to reproduce the bug in older versions of CPython and PyPy. See microsoft/STL#5658 for more details about the issue of memcpy being slow.

CPython versions tested on:

3.14, 3.13

Operating systems tested on:

Windows, Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    interpreter-core(Objects, Python, Grammar, and Parser dirs)pendingThe issue will be closed if no feedback is providedperformancePerformance or resource usagetype-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions