-
Notifications
You must be signed in to change notification settings - Fork 203
Use calloc rather than malloc + memset #519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended as an optimization for page-size-or-bigger allocations, because of the expectation that calloc will mmap the memory?
assert(count != 0 && size != 0); | ||
|
||
/* When malloc_limit is 0 (unlimited), malloc_limit - 1 will be SIZE_MAX. */ | ||
if (unlikely(s->malloc_size + (count * size) > s->malloc_limit - 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count * size
can overflow, you should probably handle that:
if (size > 0) // I know you have an assert a few lines up but just in case
if (count != (count * size) / size)
return NULL;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An interesting tidbit... I was going to suggest to do the overflow check like this:
if (size > 0)
if (count > SIZE_MAX / size)
return NULL;
...on the assumption it's faster because it's replacing a variable (count
) with a constant (SIZE_MAX
) but... it's not!
gcc at -O3 compiles it to mul + jo (jump if overflow) in both cases, but in the SIZE_MAX code, it prepends a test + jz (jump if zero). Must be a missed optimization because omitting the test + jz doesn't affect correctness.
Even more interestingly: gcc replaces malloc+memset with calloc calls!
Code:
void *alloc(size_t m, size_t n) {
#ifdef DIVIDE_CONSTANT
if (n > 0 && m > (size_t)-1 / n) return 0;
#else
if (n > 0 && m != (m * n) / n) return 0;
#endif
void *p = malloc(m*n);
if (p) memset(p, 0, m*n);
return p;
}
Assembly:
// -DDIVIDE_CONSTANT
0000000000001170 <alloc>:
1170: f3 0f 1e fa endbr64
1174: 48 85 f6 test %rsi,%rsi
1177: 74 08 je 1181 <alloc+0x11>
1179: 48 89 f8 mov %rdi,%rax
117c: 48 f7 e6 mul %rsi
117f: 70 0e jo 118f <alloc+0x1f>
1181: 48 0f af fe imul %rsi,%rdi
1185: be 01 00 00 00 mov $0x1,%esi
118a: e9 c1 fe ff ff jmp 1050 <calloc@plt>
118f: 31 c0 xor %eax,%eax
1191: c3 ret
// no -DDIVIDE_CONSTANT
0000000000001170 <alloc>:
1170: f3 0f 1e fa endbr64
1174: 48 89 f0 mov %rsi,%rax
1177: 48 f7 e7 mul %rdi
117a: 48 89 c7 mov %rax,%rdi
117d: 70 0a jo 1189 <alloc+0x19>
117f: be 01 00 00 00 mov $0x1,%esi
1184: e9 c7 fe ff ff jmp 1050 <calloc@plt>
1189: 31 c0 xor %eax,%eax
118b: c3 ret
assert(count != 0 && size != 0); | ||
|
||
/* When malloc_limit is 0 (unlimited), malloc_limit - 1 will be SIZE_MAX. */ | ||
if (unlikely(s->malloc_size + (count * size) > s->malloc_limit - 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
Some (non-scientific) results:
## macOS before
kraken-1.0/ai-astar-data.js (0.0487 seconds) Total (ms): 18922.3160 kraken-1.1/ai-astar-data.js (0.0396 seconds) Total (ms): 18878.1750 sunspider-1.0/3d-cube.js (0.0338 seconds) Total (ms): 766.0570 Richards : 1253
|
KInda yeah, that since at least one such allocation is a potential hot path (object shape allocation) maybe this would help. My results (see comment) don't seem conclusive though! |
Looks promising.
|
Thank you! |
The way libcs implement calloc can deceive short-lived benchmarks. Canonical implementation: void *calloc(size_t sz, size_t n) {
// overflow checking elided
void *p = malloc(sz * n);
if (p && (isreused(p) || !ismmapped(p))) memset(p, 0, sz * n);
return p;
} That is, libcs know when the allocation comes from a fresh all-zeroes page and omit the memset. Short-lived benchmarks are going to hit that condition often, making calloc look faster than it really it is vs. malloc+memset. Short-lived programs are a good use case for quickjs though, so it's likely still a worthwhile optimization. Bursty programs (allocate a lot, release a lot, in cycles) probably stand to benefit, too, if libc unmaps the memory between cycles. |
Thanks for the explainer! I'll fix the suggestions then! |
0f261df
to
0437e0b
Compare
@bnoordhuis Updated! |
This is good investigative work, we should document the implementation of |
I'll add a link to this issue as a comment. |
Call to our free wrapper when size is 0.
No description provided.