Skip to content

Conversation

@travisstaloch
Copy link
Contributor

  • similar to scanf in c but with curly specifiers.
  • initial support for c, d, e, x, o, b, s specifiers.

Initial draft, please review.

I was bored solving some old advent of code problems and wrote this. I think the community has been wanting this in zig std lib for a while.

Perhaps this should depend on std.fmt.parseInt() rather than rolling my own parseInt()? This parseInt() takes a out pointer and directly assigns to it. Open to suggestions here and in general.

- similar to scanf in c but with curly specifiers.
- initial support for c, d, e, x, o, b, s specifiers.
@data-man
Copy link
Contributor

Crazy idea (thanks to new @Type): parse to any type.
E.g.

const myStruct = struct {
   u: u32,
   i: i32
};

will transformed to

struct {
   u: ?u32,
   i: ?i32
};

And parse to this new type.

@travisstaloch
Copy link
Contributor Author

I like this idea. So this would look like the following and return an instance of the given type?

try scan("42", "{}", struct{a: u32})

Only problem would be parsing strings / slices. This would require allocating memory. Maybe thats the right direction to go?

@data-man
Copy link
Contributor

data-man commented Nov 19, 2020

Only problem would be parsing strings / slices. This would require allocating memory.

Why?
E.g. mem.split does not require allocations.

@travisstaloch
Copy link
Contributor Author

travisstaloch commented Nov 19, 2020

Oh yeah. I guess not. A slice would just point to the input.

This would also allow setting the length of the slice. Currently the length of the parsed substring is lost. It would be easy to check for a .len field and set it.

@travisstaloch
Copy link
Contributor Author

struct {
   u: ?u32,
   i: ?i32
};

Not sure this is necessary, Just return a regular instance of the struct on success and error otherwise. No need for nullables I think.

@data-man
Copy link
Contributor

I also suggest thinking in advance about more flexible patterns.
E.g. Lua patterns are well-proven and widely used: Format Strings for Pack and Unpack

@data-man
Copy link
Contributor

Not sure this is necessary, Just return a regular instance of the struct on success and error otherwise.

And don't return the size?

@travisstaloch
Copy link
Contributor Author

travisstaloch commented Nov 19, 2020

It looks I misread scanf's return value. Rather than the number of bytes matched, its actually the number of items matched.

On  success,  these functions return the number of input items successfully 
matched and assigned; this can be fewer than provided for, or even zero, 
in the event of an early matching failure.

Maybe returning a new struct type would be better:

struct {
   items_matched: usize,
   items: T
}

This would indicate that the first items_matched fields are set, and the remainder are undefined?

Not sure if this is any better than using the nullable fields suggestion...

@travisstaloch
Copy link
Contributor Author

travisstaloch commented Nov 19, 2020

after a little thought, nullable fields seem like a better option to me. this would also allow for named specifiers in addition to in-order field assignment. maybe an items_matched field is also valuable. it just seems like a returned object with nullable fields would be easier to work with and more flexible.

inline for (fmt) |fmt_c, fmt_i| {
var neg = false;
if (i == std.math.maxInt(usize)) return error.InputMatchFailure;
switch (state) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the parser in #6411, perhaps we can extract it and use in both the print and scan features.
The idea is to support the width & precision specifiers to limit the amount of scanned input, otherwise the user has no control over how much each specifier will munch.

Copy link
Contributor Author

@travisstaloch travisstaloch Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reviews. This seems like the best way forward to me, extracting and re-using the existing parser. I'll take a closer look at that pr. Let me know if you have an idea of what this might look like once its extracted. A struct with method fields? Or maybe just fmt.format() but with method argument? I hope to work more on this tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it depends on what the specification for scan is. I doubt the alignment and fill specifiers are needed and the only place where I can see a width specifier to be useful is with the c specifier.
If you can live with this reduced set of options I'd simply extract the parser (and arg_state) struct and all you have to do is copy-paste the inline while in format and modify it to suit your needs.

Copy link
Contributor Author

@travisstaloch travisstaloch Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit that addresses many of these points. Haven't handled precision yet, but initial width support is there. There are many TODOs in the code. For one, I wasn't sure what to do if {c} has a width or precision.

'}' => if (fmt_i > 0 and fmt[fmt_i - 1] == '}') {
if (input[i] == '}') {
i += 1;
} else i = std.math.maxInt(usize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string is being parsed at compile time, you can simply stop the compilation with a @compileError telling the user what's wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for all the error cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in new commit

i += 1;
} else i = std.math.maxInt(usize);
},
else => if (std.ascii.isSpace(fmt_c)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need a comptime prefix, the isSpace call should not end up in the final binary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in last commit

}

if (i < input.len) {
if (input[i] == '-') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're eating the trailing -/+ even if the argument is not a number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the last commit solves this. still need to add a test for this.

var arg = args[arg_i];
switch (fmt_c) {
'}', 'd' => {
if (i < input.len) i += try parseInt(input[i..], 10, std.ascii.isDigit, arg, neg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should split the scanning from the parsing, the latter is done by fmt.parseInt so you only have to do the former here.

Write some small helper routines such as scanSigned, scanUnsigned, scanFloat (remember of "weird" values such as NaN, infinite and their different casing/abbreviations), scanBool to return a slice of text to parse and update the current input position.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs work as i'm still mixing the scanning and parsing in new commit (look for is_negative).

's' => {
var si: usize = 0;
while (i < input.len) {
if (std.ascii.isSpace(input[i]) or (fmt_i + 2 < fmt.len and input[i] == fmt[fmt_i + 2])) break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's input[i] == fmt[fmt_i + 2] checking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i removed this behavior in new commit. it was allowing for a fmt like "{s}[{s}]" (with no spaces) to work. without this, everything (including square brackets) until a space will be gobbled up.

if (arg_i >= args.len) return error.TooFewArguments;
var arg = args[arg_i];
switch (fmt_c) {
'}', 'd' => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each specifier must check that arg is a pointer to the correct type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also believe this is happening now.

var si: usize = 0;
while (i < input.len) {
if (std.ascii.isSpace(input[i]) or (fmt_i + 2 < fmt.len and input[i] == fmt[fmt_i + 2])) break;
arg[si] = input[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, s is returning a copy of the string? Not a slice of the input?
You're not checking arg's type nor its length, this is dangerous either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attempted to address in new commit

},
}
}
if (arg_i < args.len) return error.TooManyArguments;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this can be statically checked at compile time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in new commit

@VojtechStep
Copy link

I don't think transforming the types to nullables is the better option here. If the programmer wants to scan the string for a struct, I would assume they want the struct to be returned. If they had an assortment of options that could optionally be set, shouldn't those fields be optional in the original struct to begin with?

- ArgsState now a type function with an args_len parameter
@LemonBoy
Copy link
Contributor

My advice here is to start from the documentation, write down a few lines about what specifiers are allowed, what the precision/width modifiers do, how they behave, how are literal pieces of the format string matched, how are the whitespaces treated (and what you consider as whitespace).

For one, I wasn't sure what to do if {c} has a width or precision.

Well I'd say width specifies the maximum amount of characters you can scan, precision can be ignored here.

@travisstaloch
Copy link
Contributor Author

travisstaloch commented Nov 20, 2020

I think this commit introduced a spin-loop somewhere. Locally

$ zig build test-std -Dskip-non-native
# ...
# many successes/ skips
# ...
Test [637/1619] event.loop.test "std-native-ReleaseFast-bare-multi std.event.Loop - sleep"...SKIP
Test [658/1619] fmt.scan.test "std-native-ReleaseFast-bare-multi scan basic"... ^C

Output hangs here before i killed it.

I don't understand what the multiple sets of tests are, here's another output with test filter just for my scan tests

$ zig build test-std -Dskip-non-native -Dtest-filter=scan
All 8 tests passed.
All 8 tests passed.
All 8 tests passed.
Test [2/8] fmt.scan.test "std-native-ReleaseFast-bare-multi scan basic"... 

(this spin-loops too)

@travisstaloch
Copy link
Contributor Author

travisstaloch commented Nov 20, 2020

Ok, it looks like this commit is spin-looping in release-fast and release-small modes.

$ zig build test-std -Dskip-non-native=true -Dtest-filter=scan -Dskip-release-fast -Dskip-release-small
All 8 tests passed.
All 8 tests passed.
All 8 tests passed.
All 8 tests passed.
All 8 tests passed.
All 8 tests passed.
# no hang

I have no idea why... any ideas?

@travisstaloch
Copy link
Contributor Author

Here's a backtrace from gdb:

Program received signal SIGINT, Interrupt.
0x000000000020f0a0 in io.fixed_buffer_stream.FixedBufferStream([]const u8).read (self=0x7fffffffddb0, dest=...)
    at /media/data/Users/Travis/Documents/Code/zig/zig/lib/std/io/fixed_buffer_stream.zig:67
67	            const size = std.math.min(dest.len, self.buffer.len - self.pos);
(gdb) bt
#0  0x000000000020f0a0 in io.fixed_buffer_stream.FixedBufferStream([]const u8).read (self=0x7fffffffddb0, dest=...)
    at /media/data/Users/Travis/Documents/Code/zig/zig/lib/std/io/fixed_buffer_stream.zig:67
#1  io.reader.Reader(*io.fixed_buffer_stream.FixedBufferStream([]const u8),io.fixed_buffer_stream.ReadError,io.fixed_buffer_stream.FixedBufferStream([]const u8).read).read
    (self=..., buffer=...) at /media/data/Users/Travis/Documents/Code/zig/zig/lib/std/io/reader.zig:32
#2  io.peek_stream.PeekStream(fifo.LinearFifoBufferType { .Static = 4},io.reader.Reader(*io.fixed_buffer_stream.FixedBufferStream([]const u8),io.fixed_buffer_stream.ReadError,io.fixed_buffer_stream.FixedBufferStream([]const u8).read)).read (self=0x7fffffffdd78, dest=...)
    at /media/data/Users/Travis/Documents/Code/zig/zig/lib/std/io/peek_stream.zig:71
#3  fmt.readByte (reader=0x7fffffffdd78) at fmt.zig:1983
#4  fmt.matchByteFn (reader=0x7fffffffdd78) at fmt.zig:1999
#5  fmt.scanReader (args=..., backing_reader=...) at fmt.zig:2076
#6  fmt.scan (input=..., args=...) at fmt.zig:2025
#7  fmt.scan.test "std-native-ReleaseFast-bare-multi scan basic" () at /media/data/Users/Travis/Documents/Code/zig/zig/lib/std/fmt/scan.zig:9
#8  0x0000000000218bf7 in std.special.main () at /media/data/Users/Travis/Documents/Code/zig/zig/lib/std/special/test_runner.zig:61
#9  start.callMain () at start.zig:334
#10 start.initEventLoopAndCallMain () at start.zig:276
#11 start.callMainWithArgs (argc=<optimized out>, argv=<optimized out>, envp=...) at start.zig:239
#12 start.posixCallMainAndExit () at start.zig:230
#13 0x000000000021849f in start._start () at start.zig:162

@travisstaloch
Copy link
Contributor Author

Looks like something related to the io.peekStream. I tried upping the peekStream size from 4 to 16 but its still spinning. I hope the ci job will time out. Its been running more than an hour now i think.

- all previous tests passing
- added tests for reader, bools, floats (-nan, -inf), width
- initial width support for int, float, slice types
- expect mutable slice pointer and assign its length, don't write beyond its length
- try to stop eating +/-
- ScanErrors only add InputMatchFailure, NegationOfUnsignedInteger, and ConversionFailure
  - other previous errors are now mis-compilations
- added scanReader(). scan() is now just a wrapper around scanReader with fixedBufferStream
  - scanReader uses an io.peekStream for look ahead
- add comptime prefix to isSpace
'{' => {
if (start_index < i) {
//try writer.writeAll(fmt[start_index..i]);
// TODO not sure what to do here
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LemonBoy any idea what should be done here? I don't understand what it means when start_index < i.

@Vexu Vexu added the standard library This issue involves writing Zig code for the standard library. label Dec 25, 2020
@andrewrk
Copy link
Member

I'm cleaning up old drafts, feel free to re-open if you are still working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

standard library This issue involves writing Zig code for the standard library.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants