Skip to content

Conversation

@cfriedt
Copy link
Member

@cfriedt cfriedt commented Dec 12, 2022

The previous method used to calculate seconds in clock_gettime() seemed to have an inaccuracy that grew with time, causing the seconds to be off by an order of magnitude when ticks would roll over.

Edit: technically, this is likely a bug in z_tmcvt().

Fixes #52975

@zephyrbot zephyrbot added area: Kernel area: POSIX POSIX API Library labels Dec 12, 2022
@cfriedt cfriedt force-pushed the add-test-for-clock-gettime-overflows branch 2 times, most recently from add4518 to 1e4c037 Compare December 12, 2022 14:38
@cfriedt cfriedt requested a review from nordic-krch December 12, 2022 14:40
@cfriedt
Copy link
Member Author

cfriedt commented Dec 12, 2022

cc @AndreyDodonov-EH, @too1

@cfriedt cfriedt force-pushed the add-test-for-clock-gettime-overflows branch 3 times, most recently from a449e12 to f49e7bc Compare December 12, 2022 17:40
@cfriedt
Copy link
Member Author

cfriedt commented Dec 12, 2022

  • woops - turned off automatic code formatting to reduce the scope of the diff, but then left a dangling {. It's fixed now.

@cfriedt cfriedt force-pushed the add-test-for-clock-gettime-overflows branch from f49e7bc to ae2ddbc Compare December 15, 2022 19:00
@zephyrbot zephyrbot added the area: Testsuite Testsuite label Dec 15, 2022
@cfriedt cfriedt force-pushed the add-test-for-clock-gettime-overflows branch 2 times, most recently from 1976dc3 to 9fb9849 Compare December 15, 2022 19:15
@cfriedt cfriedt force-pushed the add-test-for-clock-gettime-overflows branch from 9fb9849 to 5a43db6 Compare December 16, 2022 05:03
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right. The conversion function should be precise and reduce to the division already on all systems where that's possible. The bug is in the second line I think, where it feeds the result back into the reverse conversion; that's going to lose bits in configurations where the tick rate isn't divisible by ms. Can you check that just the second line fixes the issue?

Also... I haven't pulled up the whole file to check, but I have a hard time believing that that nremainder value is anything but a footgun in an implementation of clock_gettime(). There shouldn't be any state held, and if we want higher precision values we should start there instead of back-converting from coarser units like seconds.

My notion of an idealized implementation here should be as simple as uint64_t ns = k_ticks_to_us_floor64(k_uptime_ticks()), and then just doing division/modulus ops to extract tv_nsec and tv_sec.

Copy link
Member Author

@cfriedt cfriedt Dec 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyross - with just the second line, we get the failure below, which is off by an order of magnitude (on qemu_x86_64). Perhaps that's an indicator that z_tmcvt() needs a testsuite (Edit: I added a minimal one under tests/unit/time_units).

START - test_clock_gettime_rollover
CONFIG_SYS_CLOCK_TICKS_PER_SEC: 100
rollover_s: 184467440737095516
t-1: {18446744073709551, 596000000}

To me, straight division to get elapsed_secs makes sense.

The ticks value must always be >= elapsed_secs * CONFIG_SYS_CLOCK_TICKS_PER_SEC, so it looks like the nremainder calculation is a (potentially) cheaper way to get ticks % CONFIG_SYS_CLOCK_TICKS_PER_SEC.

Directly below, we use k_ticks_to_ns_floor32() to convert nremainder to nanoseconds, which seems ok to me.

Can you explain your issue in a bit more detail? And actually, this is the sort of thing that we should probably have a test for, so I'll implement a test for the scenario you outline if you can provide some additional detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, the routines in time_units.h should have the set of k_ticks_to_secs_*() functions. It's milli, micro, and nano, but not full seconds. I guess the thought was that conversion to seconds is simple enough, one should write out the division, as is done here. But as soon as you toss in rates determined at run time or non-integral clock speeds, it would be nice to not make everyone write their own code for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My notion of an idealized implementation here should be as simple as uint64_t ns = k_ticks_to_us_floor64(k_uptime_ticks()),

You meant _to_ns of course. This expression does overflow before a 64-bit cycle counter overflows, for clock rates < 1 GHz. It does take some 584 years to do this. So realistically, it ought to be good enough for any reasonable value of a timeout or time delta or timestamp.

However, @cfriedt's test looks at values around where the cycle counter overflows, which is large enough to trigger the overflow.

@cfriedt cfriedt force-pushed the add-test-for-clock-gettime-overflows branch 3 times, most recently from 32c85b6 to fc4e2e0 Compare December 17, 2022 22:17
@cfriedt
Copy link
Member Author

cfriedt commented Dec 17, 2022

@cfriedt cfriedt force-pushed the add-test-for-clock-gettime-overflows branch from fc4e2e0 to 4cad493 Compare December 17, 2022 22:42
@cfriedt cfriedt requested a review from andyross December 18, 2022 17:06
The previous method used to calculate seconds in `clock_gettime()`
seemed to have an inaccuracy that grew with time causing the
seconds to be off by an order of magnitude when ticks would roll
over.

This change fixes the method used to calculate seconds.

Signed-off-by: Chris Friedt <[email protected]>
@cfriedt cfriedt force-pushed the add-test-for-clock-gettime-overflows branch from 4cad493 to 53d9b05 Compare December 19, 2022 12:38
@cfriedt
Copy link
Member Author

cfriedt commented Dec 19, 2022

@cfriedt
Copy link
Member Author

cfriedt commented Dec 20, 2022

@andyross - if there is a doubt in your mind, this should probably help. Almost looks like a Christmas tree, so Merry Christmas (if you celebrate)! 🎅 ❄️ ☃️ 🎄
Screen Shot 2022-12-20 at 9 54 01 AM
generator.cpp.txt

That's with CONFIG_SYS_CLOCK_TICKS_PER_SEC=100. The error factor vanishes with CONFIG_SYS_CLOCK_TICKS_PER_SEC=1000 or CONFIG_SYS_CLOCK_TICKS_PER_SEC=10000. However, it is exacerbated when CONFIG_SYS_CLOCK_TICKS_PER_SEC=10.

So with ticks/s = 100, the error factor is 10, but when ticks/s = 10, the error factor is 100. There is no error when using the method in this PR.

I'm kind of concerned about where the order of magnitude difference comes from in the old method.

cc @AndreyDodonov-EH, @too1

@cfriedt cfriedt added bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug labels Dec 20, 2022
Chris Friedt added 2 commits December 26, 2022 07:20
Accurate timekeeping is something that is often taken for granted.

However, reliability of timekeeping code is critical for most core
and subsystem code. Furthermore, Many higher-level timekeeping
utilities in Zephyr work off of ticks but there is no way to modify
ticks directly which would require either unnecessary delays in
test code or non-ideal compromises in test coverage.

Since timekeeping is so critical, there should be as few barriers
to testing timekeeping code as possible, while preserving
integrity of the kernel's public interface.

With this, we expose `sys_clock_tick_set()` as a system call only
when `CONFIG_ZTEST` is set, declared within the ztest framework.

Signed-off-by: Chris Friedt <[email protected]>
Add a test to verify previous fixes when calculating time in
seconds for `clock_gettime()`.

Signed-off-by: Chris Friedt <[email protected]>
@cfriedt cfriedt force-pushed the add-test-for-clock-gettime-overflows branch from 53d9b05 to 662f37d Compare December 26, 2022 12:21
@cfriedt
Copy link
Member Author

cfriedt commented Dec 26, 2022

  • use uint64_t instead of int64_t in sys_clock_tick_set() - no reason it should be signed

@xyzzy42
Copy link
Contributor

xyzzy42 commented Dec 26, 2022

That's with CONFIG_SYS_CLOCK_TICKS_PER_SEC=100. The error factor vanishes with CONFIG_SYS_CLOCK_TICKS_PER_SEC=1000 or CONFIG_SYS_CLOCK_TICKS_PER_SEC=10000. However, it is exacerbated when CONFIG_SYS_CLOCK_TICKS_PER_SEC=10.

I don't get how those numbers came from CONFIG_SYS_CLOCK_TICKS_PER_SEC=100?

    const uint64_t rollover_s = UINT64_MAX / CONFIG_SYS_CLOCK_TICKS_PER_SEC;
    …
    curr_tick = rollover_s * CONFIG_SYS_CLOCK_TICKS_PER_SEC;
    cout << curr_tick << "\t" << method1() << "\t" << method2() << endl;

Shouldn't the final value of curr_tick be 18446744073709551600 and not 18446744073709500000?

Anyway, the existing code has an obvious bug if CONFIG_SYS_CLOCK_TICKS_PER_SEC is less than 1000. Converting to ms as an (unnecessary) intermediate step to seconds overflows, even though neither the tick count nor the time in seconds will overflow. This is because ms will be a larger number than ticks.

Another bug is in the test code. It finds the second in which the 64-bit tick counter overflows. But the result will be stored in a timespec. This is allowed to have a 32-bit signed seconds field. This will overflow before the tick counter unless CONFIG_SYS_CLOCK_TICKS_PER_SEC is > 2,000,000,000. In other words, rollover_s can overflow ts->tv_sec.

@cfriedt
Copy link
Member Author

cfriedt commented Dec 27, 2022

I don't get how those numbers came from CONFIG_SYS_CLOCK_TICKS_PER_SEC=100?

    const uint64_t rollover_s = UINT64_MAX / CONFIG_SYS_CLOCK_TICKS_PER_SEC;
    …
    curr_tick = rollover_s * CONFIG_SYS_CLOCK_TICKS_PER_SEC;
    cout << curr_tick << "\t" << method1() << "\t" << method2() << endl;

Shouldn't the final value of curr_tick be 18446744073709551600 and not 18446744073709500000?

That's only to get an arbitrary curr_tick less than rollover_s (with up to 1s of ticks to spare).

However, I realized that my for-loop boundary conditions were not in correct units. Still illustrates the issue in any case.

for(curr_tick = 0; curr_tick < rollover_s

Anyway, the existing code has an obvious bug if CONFIG_SYS_CLOCK_TICKS_PER_SEC is less than 1000. Converting to ms as an (unnecessary) intermediate step to seconds overflows, even though neither the tick count nor the time in seconds will overflow. This is because ms will be a larger number than ticks.

👍

Another bug is in the test code. It finds the second in which the 64-bit tick counter overflows. But the result will be stored in a timespec. This is allowed to have a 32-bit signed seconds field. This will overflow

Given that Zephyr will own the POSIX headers, that's a good point. We should enforce that it's 64-bit (a recommendation for the 2038 problem). It's not something that will be fixed here

https://pubs.opengroup.org/onlinepubs/9699919799/functions/time.html

Thanks for the review.

@carlescufi carlescufi merged commit 8e4137f into zephyrproject-rtos:main Jan 4, 2023
@xyzzy42
Copy link
Contributor

xyzzy42 commented Jan 4, 2023

The overflow test doesn't respect CONFIG_SYS_CLOCK_MAX_TIMEOUT_DAYS. It assumes the 64-bit tick counter is the only possible overflow, but I don't think that is guaranteed.

@cfriedt cfriedt deleted the add-test-for-clock-gettime-overflows branch January 24, 2023 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: Kernel area: POSIX POSIX API Library area: Testsuite Testsuite bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

posix: clock: current method of capturing elapsed time leads to loss in seconds

7 participants