Skip to content

Conversation

@marcpaterno
Copy link

This small addition allows as.integer64.character to handle hexidecimal printed representation of large integers, in the style of as.integer in base R.

@MichaelChirico
Copy link
Collaborator

thanks for the PR! the idea seems great!

would you mind

  1. filing an issue that motivates the change, including the current failure mode / (optional) known workarounds
  2. including a test that actually touches the 64-bit range, i.e. that wouldn't fit in 32-bit integers (or I guess that's covered already?, I never remember what the exact boundary is... maybe make it at least 2^33 for clarity)
  3. include a few more tests, e.g. of negative #s, hex #s that don't fit in 64 bits, any edge cases you can think of

@MichaelChirico
Copy link
Collaborator

WDYT about just letting strtoll() handle this by setting base=0 instead?

https://en.cppreference.com/w/c/string/byte/strtol

@MichaelChirico
Copy link
Collaborator

I think the issue with just doing base=0 is it will not be back-compatible (c.f. as.integer64('011'), which will be 11LL on main but 9LL with base=0).

We can easily imagine there's ppl working with long-string identifiers relying on the old behavior.

That said, I think we can do this?

-ret[i] = strtoll(str, &endpointer, 10);
+int negative=str[0] == '-';
+if (!strncmp(str+negative, "0x", 2) || !strncmp(str+negative,"0X", 2)) {
+  ret[i] = strtoll(str, &endpointer, 16);
+} else {
+  ret[i] = strtoll(str, &endpointer, 10);
+}

@marcpaterno
Copy link
Author

I think the issue with just doing base=0 is it will not be back-compatible (c.f. as.integer64('011'), which will be 11LL on main but 9LL with base=0).

We can easily imagine there's ppl working with long-string identifiers relying on the old behavior.

That said, I think we can do this?

-ret[i] = strtoll(str, &endpointer, 10);
+int negative=str[0] == '-';
+if (!strncmp(str+negative, "0x", 2) || !strncmp(str+negative,"0X", 2)) {
+  ret[i] = strtoll(str, &endpointer, 16);
+} else {
+  ret[i] = strtoll(str, &endpointer, 10);
+}

I have tested this code and it yields different behavior in one edge case. I thought I had picked the appropriate behavior, but you can of course correct me.

In the case of handling the string "-0x8000000000000000", my original code yielded -9223372036854775807, the first value in the result of lim.integer64(). Modified as you propose above, the result is "NA".

I will add the new testing (that includes the handling of input "0") to the PR, and await your decision on the desired behavior for handling "-0x8000000000000000".

@MichaelChirico
Copy link
Collaborator

Let's pick whatever strtoll("-0x8000000000000000", &endpointer, 16) does, I think it's best to be consistent with & as close to the well-tested underlying C library routine.

@MichaelChirico
Copy link
Collaborator

please add a NEWS entry btw!

@marcpaterno
Copy link
Author

I do not think it is possible to match the behavior of strtoll with the current representation of integer64.
strtoll will read "-9223372036854775808" and represent it exactly. This value is 1 smaller than what lim.integer64() indicates can be represented by integer64. strtoll will read "-9223372036854775809" and yield the value -9223372036854775808.

How would you like to proceed?

@MichaelChirico
Copy link
Collaborator

Oh, great point. That's because -922...808 is how NA_integer64_ is represented (just as NA_integer_ is represented as -214...647):

#define NA_INTEGER64 LLONG_MIN

In fact you've found what I think is kind of buggy behavior in as.integer64.character for decimal string input, too: #175.

Part of the issue is, there's no 128-bit double to case into with full fidelity to work with these out-of-range values before casting back down to 64 bits; that's what's done in the 32-bit case by R:

https://github.com/r-devel/r-svn/blob/c8ab68478c5d2322956e19764c20c092916133cd/src/main/util.c#L2112

Anyway, for now, I think we should aim to be consistent between hex & decimal strings, and to bitstrings & return NA_integer64_ for -9223372036854775808

# as.bitstring(lim.integer64())
as.integer64(structure("1000000000000000000000000000000000000000000000000000000000000000", class='bitstring'))
# integer64
# [1] <NA>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants