Introduce std.fmt.scan - Draft #7158

travisstaloch · 2020-11-18T23:11:41Z

similar to scanf in c but with curly specifiers.
initial support for c, d, e, x, o, b, s specifiers.

Initial draft, please review.

I was bored solving some old advent of code problems and wrote this. I think the community has been wanting this in zig std lib for a while.

Perhaps this should depend on std.fmt.parseInt() rather than rolling my own parseInt()? This parseInt() takes a out pointer and directly assigns to it. Open to suggestions here and in general.

- similar to scanf in c but with curly specifiers. - initial support for c, d, e, x, o, b, s specifiers.

data-man · 2020-11-19T01:55:40Z

Crazy idea (thanks to new @Type): parse to any type.
E.g.

const myStruct = struct {
   u: u32,
   i: i32
};

will transformed to

struct {
   u: ?u32,
   i: ?i32
};

And parse to this new type.

travisstaloch · 2020-11-19T02:02:42Z

I like this idea. So this would look like the following and return an instance of the given type?

try scan("42", "{}", struct{a: u32})

Only problem would be parsing strings / slices. This would require allocating memory. Maybe thats the right direction to go?

data-man · 2020-11-19T02:08:22Z

Only problem would be parsing strings / slices. This would require allocating memory.

Why?
E.g. mem.split does not require allocations.

travisstaloch · 2020-11-19T02:10:00Z

Oh yeah. I guess not. A slice would just point to the input.

This would also allow setting the length of the slice. Currently the length of the parsed substring is lost. It would be easy to check for a .len field and set it.

travisstaloch · 2020-11-19T02:13:19Z

struct {
   u: ?u32,
   i: ?i32
};

Not sure this is necessary, Just return a regular instance of the struct on success and error otherwise. No need for nullables I think.

data-man · 2020-11-19T02:30:27Z

I also suggest thinking in advance about more flexible patterns.
E.g. Lua patterns are well-proven and widely used: Format Strings for Pack and Unpack

data-man · 2020-11-19T02:33:43Z

Not sure this is necessary, Just return a regular instance of the struct on success and error otherwise.

And don't return the size?

travisstaloch · 2020-11-19T02:49:43Z

It looks I misread scanf's return value. Rather than the number of bytes matched, its actually the number of items matched.

On  success,  these functions return the number of input items successfully 
matched and assigned; this can be fewer than provided for, or even zero, 
in the event of an early matching failure.

Maybe returning a new struct type would be better:

struct {
   items_matched: usize,
   items: T
}

This would indicate that the first items_matched fields are set, and the remainder are undefined?

Not sure if this is any better than using the nullable fields suggestion...

travisstaloch · 2020-11-19T03:41:34Z

after a little thought, nullable fields seem like a better option to me. this would also allow for named specifiers in addition to in-order field assignment. maybe an items_matched field is also valuable. it just seems like a returned object with nullable fields would be easier to work with and more flexible.

LemonBoy · 2020-11-19T07:50:18Z

lib/std/fmt/scan.zig

+    inline for (fmt) |fmt_c, fmt_i| {
+        var neg = false;
+        if (i == std.math.maxInt(usize)) return error.InputMatchFailure;
+        switch (state) {


I rewrote the parser in #6411, perhaps we can extract it and use in both the print and scan features.
The idea is to support the width & precision specifiers to limit the amount of scanned input, otherwise the user has no control over how much each specifier will munch.

Thank you for the reviews. This seems like the best way forward to me, extracting and re-using the existing parser. I'll take a closer look at that pr. Let me know if you have an idea of what this might look like once its extracted. A struct with method fields? Or maybe just fmt.format() but with method argument? I hope to work more on this tomorrow.

Well, it depends on what the specification for scan is. I doubt the alignment and fill specifiers are needed and the only place where I can see a width specifier to be useful is with the c specifier.
If you can live with this reduced set of options I'd simply extract the parser (and arg_state) struct and all you have to do is copy-paste the inline while in format and modify it to suit your needs.

I just pushed a commit that addresses many of these points. Haven't handled precision yet, but initial width support is there. There are many TODOs in the code. For one, I wasn't sure what to do if {c} has a width or precision.

LemonBoy · 2020-11-19T07:52:08Z

lib/std/fmt/scan.zig

+                '}' => if (fmt_i > 0 and fmt[fmt_i - 1] == '}') {
+                    if (input[i] == '}') {
+                        i += 1;
+                    } else i = std.math.maxInt(usize);


The string is being parsed at compile time, you can simply stop the compilation with a @compileError telling the user what's wrong.

Ditto for all the error cases.

addressed in new commit

LemonBoy · 2020-11-19T07:53:51Z

lib/std/fmt/scan.zig

+                        i += 1;
+                    } else i = std.math.maxInt(usize);
+                },
+                else => if (std.ascii.isSpace(fmt_c)) {


You probably need a comptime prefix, the isSpace call should not end up in the final binary.

addressed in last commit

LemonBoy · 2020-11-19T07:56:07Z

lib/std/fmt/scan.zig

+                }
+
+                if (i < input.len) {
+                    if (input[i] == '-') {


You're eating the trailing -/+ even if the argument is not a number.

i think the last commit solves this. still need to add a test for this.

LemonBoy · 2020-11-19T08:01:35Z

lib/std/fmt/scan.zig

+                var arg = args[arg_i];
+                switch (fmt_c) {
+                    '}', 'd' => {
+                        if (i < input.len) i += try parseInt(input[i..], 10, std.ascii.isDigit, arg, neg);


You should split the scanning from the parsing, the latter is done by fmt.parseInt so you only have to do the former here.

Write some small helper routines such as scanSigned, scanUnsigned, scanFloat (remember of "weird" values such as NaN, infinite and their different casing/abbreviations), scanBool to return a slice of text to parse and update the current input position.

This still needs work as i'm still mixing the scanning and parsing in new commit (look for is_negative).

LemonBoy · 2020-11-19T08:03:45Z

lib/std/fmt/scan.zig

+                    's' => {
+                        var si: usize = 0;
+                        while (i < input.len) {
+                            if (std.ascii.isSpace(input[i]) or (fmt_i + 2 < fmt.len and input[i] == fmt[fmt_i + 2])) break;


What's input[i] == fmt[fmt_i + 2] checking?

i removed this behavior in new commit. it was allowing for a fmt like "{s}[{s}]" (with no spaces) to work. without this, everything (including square brackets) until a space will be gobbled up.

LemonBoy · 2020-11-19T08:05:47Z

lib/std/fmt/scan.zig

+                if (arg_i >= args.len) return error.TooFewArguments;
+                var arg = args[arg_i];
+                switch (fmt_c) {
+                    '}', 'd' => {


Each specifier must check that arg is a pointer to the correct type.

i also believe this is happening now.

LemonBoy · 2020-11-19T08:06:30Z

lib/std/fmt/scan.zig

+                        var si: usize = 0;
+                        while (i < input.len) {
+                            if (std.ascii.isSpace(input[i]) or (fmt_i + 2 < fmt.len and input[i] == fmt[fmt_i + 2])) break;
+                            arg[si] = input[i];


So, s is returning a copy of the string? Not a slice of the input?
You're not checking arg's type nor its length, this is dangerous either way.

attempted to address in new commit

LemonBoy · 2020-11-19T08:07:32Z

lib/std/fmt/scan.zig

+            },
+        }
+    }
+    if (arg_i < args.len) return error.TooManyArguments;


All of this can be statically checked at compile time.

addressed in new commit

VojtechStep · 2020-11-19T11:02:31Z

I don't think transforming the types to nullables is the better option here. If the programmer wants to scan the string for a struct, I would assume they want the struct to be returned. If they had an assortment of options that could optionally be set, shouldn't those fields be optional in the original struct to begin with?

- ArgsState now a type function with an args_len parameter

LemonBoy · 2020-11-20T10:01:25Z

My advice here is to start from the documentation, write down a few lines about what specifiers are allowed, what the precision/width modifiers do, how they behave, how are literal pieces of the format string matched, how are the whitespaces treated (and what you consider as whitespace).

For one, I wasn't sure what to do if {c} has a width or precision.

Well I'd say width specifies the maximum amount of characters you can scan, precision can be ignored here.

travisstaloch · 2020-11-20T10:04:54Z

I think this commit introduced a spin-loop somewhere. Locally

$ zig build test-std -Dskip-non-native
# ...
# many successes/ skips
# ...
Test [637/1619] event.loop.test "std-native-ReleaseFast-bare-multi std.event.Loop - sleep"...SKIP
Test [658/1619] fmt.scan.test "std-native-ReleaseFast-bare-multi scan basic"... ^C

Output hangs here before i killed it.

I don't understand what the multiple sets of tests are, here's another output with test filter just for my scan tests

$ zig build test-std -Dskip-non-native -Dtest-filter=scan
All 8 tests passed.
All 8 tests passed.
All 8 tests passed.
Test [2/8] fmt.scan.test "std-native-ReleaseFast-bare-multi scan basic"...

(this spin-loops too)

travisstaloch · 2020-11-20T10:18:40Z

Ok, it looks like this commit is spin-looping in release-fast and release-small modes.

$ zig build test-std -Dskip-non-native=true -Dtest-filter=scan -Dskip-release-fast -Dskip-release-small
All 8 tests passed.
All 8 tests passed.
All 8 tests passed.
All 8 tests passed.
All 8 tests passed.
All 8 tests passed.
# no hang

I have no idea why... any ideas?

travisstaloch · 2020-11-20T10:46:12Z

Here's a backtrace from gdb:

Program received signal SIGINT, Interrupt.
0x000000000020f0a0 in io.fixed_buffer_stream.FixedBufferStream([]const u8).read (self=0x7fffffffddb0, dest=...)
    at /media/data/Users/Travis/Documents/Code/zig/zig/lib/std/io/fixed_buffer_stream.zig:67
67	            const size = std.math.min(dest.len, self.buffer.len - self.pos);
(gdb) bt
#0  0x000000000020f0a0 in io.fixed_buffer_stream.FixedBufferStream([]const u8).read (self=0x7fffffffddb0, dest=...)
    at /media/data/Users/Travis/Documents/Code/zig/zig/lib/std/io/fixed_buffer_stream.zig:67
#1  io.reader.Reader(*io.fixed_buffer_stream.FixedBufferStream([]const u8),io.fixed_buffer_stream.ReadError,io.fixed_buffer_stream.FixedBufferStream([]const u8).read).read
    (self=..., buffer=...) at /media/data/Users/Travis/Documents/Code/zig/zig/lib/std/io/reader.zig:32
#2  io.peek_stream.PeekStream(fifo.LinearFifoBufferType { .Static = 4},io.reader.Reader(*io.fixed_buffer_stream.FixedBufferStream([]const u8),io.fixed_buffer_stream.ReadError,io.fixed_buffer_stream.FixedBufferStream([]const u8).read)).read (self=0x7fffffffdd78, dest=...)
    at /media/data/Users/Travis/Documents/Code/zig/zig/lib/std/io/peek_stream.zig:71
#3  fmt.readByte (reader=0x7fffffffdd78) at fmt.zig:1983
#4  fmt.matchByteFn (reader=0x7fffffffdd78) at fmt.zig:1999
#5  fmt.scanReader (args=..., backing_reader=...) at fmt.zig:2076
#6  fmt.scan (input=..., args=...) at fmt.zig:2025
#7  fmt.scan.test "std-native-ReleaseFast-bare-multi scan basic" () at /media/data/Users/Travis/Documents/Code/zig/zig/lib/std/fmt/scan.zig:9
#8  0x0000000000218bf7 in std.special.main () at /media/data/Users/Travis/Documents/Code/zig/zig/lib/std/special/test_runner.zig:61
#9  start.callMain () at start.zig:334
#10 start.initEventLoopAndCallMain () at start.zig:276
#11 start.callMainWithArgs (argc=<optimized out>, argv=<optimized out>, envp=...) at start.zig:239
#12 start.posixCallMainAndExit () at start.zig:230
#13 0x000000000021849f in start._start () at start.zig:162

travisstaloch · 2020-11-20T10:51:23Z

Looks like something related to the io.peekStream. I tried upping the peekStream size from 4 to 16 but its still spinning. I hope the ci job will time out. Its been running more than an hour now i think.

- all previous tests passing - added tests for reader, bools, floats (-nan, -inf), width - initial width support for int, float, slice types - expect mutable slice pointer and assign its length, don't write beyond its length - try to stop eating +/- - ScanErrors only add InputMatchFailure, NegationOfUnsignedInteger, and ConversionFailure - other previous errors are now mis-compilations - added scanReader(). scan() is now just a wrapper around scanReader with fixedBufferStream - scanReader uses an io.peekStream for look ahead - add comptime prefix to isSpace

travisstaloch · 2020-11-21T00:04:07Z

lib/std/fmt.zig

+                '{' => {
+                    if (start_index < i) {
+                        //try writer.writeAll(fmt[start_index..i]);
+                        // TODO not sure what to do here


@LemonBoy any idea what should be done here? I don't understand what it means when start_index < i.

andrewrk · 2021-01-11T21:01:08Z

I'm cleaning up old drafts, feel free to re-open if you are still working on it.

Introduce std.fmt.scan

86f5505

- similar to scanf in c but with curly specifiers. - initial support for c, d, e, x, o, b, s specifiers.

LemonBoy suggested changes Nov 19, 2020

View reviewed changes

move ArgsState struct and State enum out of format()

dca847d

- ArgsState now a type function with an args_len parameter

travisstaloch commented Nov 21, 2020

View reviewed changes

Vexu added the standard library This issue involves writing Zig code for the standard library. label Dec 25, 2020

andrewrk closed this Jan 11, 2021

FireFox317 mentioned this pull request Jul 19, 2022

Looking for scanf() (formatted input on streams) #12161

Open

Uh oh!

Introduce std.fmt.scan - Draft #7158

Introduce std.fmt.scan - Draft #7158

Uh oh!

Conversation

travisstaloch commented Nov 18, 2020

Uh oh!

data-man commented Nov 19, 2020

Uh oh!

travisstaloch commented Nov 19, 2020

Uh oh!

data-man commented Nov 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

travisstaloch commented Nov 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

travisstaloch commented Nov 19, 2020

Uh oh!

data-man commented Nov 19, 2020

Uh oh!

data-man commented Nov 19, 2020

Uh oh!

travisstaloch commented Nov 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

travisstaloch commented Nov 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

travisstaloch Nov 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

travisstaloch Nov 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VojtechStep commented Nov 19, 2020

Uh oh!

LemonBoy commented Nov 20, 2020

Uh oh!

travisstaloch commented Nov 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

data-man commented Nov 19, 2020 •

edited

Loading

travisstaloch commented Nov 19, 2020 •

edited

Loading

travisstaloch commented Nov 19, 2020 •

edited

Loading

travisstaloch commented Nov 19, 2020 •

edited

Loading

travisstaloch Nov 19, 2020 •

edited

Loading

travisstaloch Nov 20, 2020 •

edited

Loading

travisstaloch commented Nov 20, 2020 •

edited

Loading

travisstaloch commented Nov 20, 2020 •

edited

Loading