-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
std.os.windows: Support UTF-8 ↔ UTF-16 conversion for Windows native console I/O #16618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
911c601 to
1a88bec
Compare
|
Thanks for working on this! At some point I'd like to benchmark reading/writing to console against master to see what the performance cost of this approach is. Prior attempt at the same thing: #12400 |
1a88bec to
4ff8419
Compare
|
Update: I also tried UTF-16 to UTF-8 conversion for raw mode, but I encountered ferocious edge cases, especially when I read the console input partially (leaving UTF-16 code units in the console input text buffer) and then switching on/off It should be perfectly fine as long as people don't mess with console input mode. I would be optimistic to assume that it is effectively done. |
6f18d6d to
d5e15c5
Compare
2043ba0 to
fffbb92
Compare
|
Is the hidden allocation here fine? Wouldn't this be better solved with a stack-allocated buffer and multiple writes? |
I did that just to keep the logic simple. |
…g fixed-size temporary buffer
|
Done. The temporary buffer is now 1024-byte long, and no heap allocation involved. |
8208db5 to
d4fe650
Compare
d4fe650 to
9495568
Compare
|
@squeek502 Just rebased onto |
|
Will try writing some benchmarks / test cases when I get a chance. In the meantime, it might be helpful if you could take a look at #12400 and compare your implementation with its (relatively simple) implementation. What are the relevant differences? Are there things that your implementation handles that wouldn't be handled by the one in #12400? |
As far as I can tell after a quick glance at #12400:
|
andrewrk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not good. Let's take a step back and understand what is going on before adding such hacks to the codebase.
| } else null; | ||
| if (kernel32.ReadFile(in_hFile, buffer.ptr, want_read_count, &amt_read, overlapped) == 0) { | ||
| var console_mode: DWORD = undefined; | ||
| const is_console_handle: bool = kernel32.GetConsoleMode(in_hFile, &console_mode) != FALSE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? You want to call GetConsoleMode with every call to ReadFile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some context:
- In std: Windows Console IO Fix #12400 it used
isTtywhich usesos.isattywhich callsGetConsoleMode(so it'd call it on every read/write) - Rust uses an
is_consolefunction which is a wrapper aroundGetConsoleMode(so it calls it on every read/write AFAICT)
However, it seems Go stores the kind as "console" in newFile to skip the GetConsoleMode on read/write:
- https://github.com/golang/go/blob/6e8caefc19cae465444775f6cd107b138a26cce7/src/os/file_windows.go#L49-L51
- https://github.com/golang/go/blob/6e8caefc19cae465444775f6cd107b138a26cce7/src/internal/poll/fd_windows.go#L419-L420
But still seems to have a GetConsoleMode call in a (lower level?) API? I'm not familiar with Go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? You want to call
GetConsoleModewith every call toReadFile?
No, I don't.
However, std.fs.File is merely a thin wrapper around the underlying OS file handle. If we want to remember whether the OS handle is a console handle on std.fs.File creation, we will need to add OS-specific fields in std.fs.File data structure, or add additional records in std.os.windows along with a lookup scheme by a std.fs.File handle (either pointer or a unique ID), just like how C runtime libraries deal with POSIX file descriptors internally.
For minimum intrusive changes with platform-agnostic code (i.e. std.fs.File), I chose calling GetConsoleMode with every ReadFile call. I would like to hear your thoughts on design choices before making any change.
| // There is no reliable way to implement perfectly platform-agnostic UTF-16 to UTF-8 | ||
| // conversion for raw mode, because it is impossible to know the number of pending | ||
| // code units stored in console input buffer, while in cooked mode we can rely on the | ||
| // terminating LF character. Without knowing that, ReadConsoleW() may accidentally pop | ||
| // out characters without blocking, or prompt for user input at unexpected timing. | ||
| // In the case of raw mode, redirect to kernel32.ReadFile() without conversion for now, | ||
| // just don't make things worse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not good enough. If it cannot be implemented perfectly using the current abstractions, then the current abstractions need to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me elaborate on this:
The current goal of my implementation is to make users totally unaware of UTF-16 presence even if they call std.os.windows.ReadFile directly. Users are able to pass UTF-8-encoded strings everywhere, and std.os.windows.ReadFile does the translation job under the hood. It means that I need to insert a UTF-16 → UTF-8 translation layer between std.os.windows.ReadFile and kernel32.ReadConsoleW.
Here comes the key problem that gives rise to annoying edge cases: The user-provided buffer is meant to be UTF-8 encoded. Console input (which is UTF-16 encoded) must be converted to UTF-8 before filling into the user-provided buffer, while the converted UTF-8 length can't be easily determined. When the user-provided buffer is not large enough to hold the converted console input, there are three pieces of data we have to deal with carefully:
- UTF-8 fragments. For example, a Unicode character from console input converts into a 3-byte sequence, but the user-provided buffer is only 2 bytes long. Then
ReadFileshould store the first 2 bytes into the user-provided buffer and stash the final byte. WheneverReadFileis called again, this byte should be popped first before reading anything else from the console. - Single UTF-16 high surrogate yet to form a complete Unicode code point. We should stash it separately and wait for a low surrogate from another
ReadConsoleWcall to convert into UTF-8 byte sequence. - UTF-16 code units left in
ReadConsoleW's dedicated buffer. It is completely opaque to user-mode code, causing serious troubles, which will be discussed below.
There are two modes affecting ReadConsoleW behaviour determined by ENABLE_LINE_INPUT flag (accessed with GetConsoleMode/SetConsoleMode functions):
- Cooked mode (
ENABLE_LINE_INPUTset): If CR is encountered in the dedicated buffer, an LF is appended to the user-provided buffer and reading from console ends immediately. Otherwise, wait for console input until a CR is entered (either single key press or pasting multiple characters). Everything after the CR is left inside the dedicated buffer after the read. - Raw mode (
ENABLE_LINE_INPUTunset): If there is at least one character available in the dedicated buffer, reading from console ends immediately. Otherwise, wait for console input (either single key press or pasting multiple characters). The dedicated buffer is flushed after the read.
Cooked mode can be emulated perfectly, as in my implementation. However, to emulate the behaviour in raw mode, we have to try flushing UTF-8 fragments and then call ReadConsoleW to flush the dedicated buffer. The problem arises when we popped at least one byte from the UTF-8 fragments. Because we have "read" something (from UTF-8 fragments), std.os.windows.ReadFile should return immediately if the ReadConsoleW dedicated buffer is empty, but calling ReadConsoleW when the dedicated buffer is empty causes an unwanted wait for console input. As I explained before, there is no way to know whether the dedicated buffer is empty other than calling ReadConsoleW, so it is a stalemate.
If there are better abstractions, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relevant commit message from Go: golang/go@610d522
The readConsole implementation has changed a bit since then, the latest is here: https://github.com/golang/go/blob/6e8caefc19cae465444775f6cd107b138a26cce7/src/internal/poll/fd_windows.go#L454-L521
There does not seem to be any special handling of ENABLE_LINE_INPUT in Rust or Go at all.
|
For what it's worth, here are my initial reactions to these points:
This is a neutral point IMO--not a pro nor a con.
This makes sense to me and seems like a good change. #12400 returned
This needs some justification. Why should Zig care about LF ↔ CRLF conversion in console input/output? It doesn't seem like Rust/Go do this AFAICT.
👍 |
For console output, I agree that converting LF to CRLF will not bring special benefits, as Windows Console Host should recognize single LF. This conversion makes more sense when writing to a text file. I implemented this way just because C But for console input, if we convert only the text encoding after |
|
Matching libc behavior is not a goal here. In my opinion the goals should be:
|
|
All right, I will remove LF → CRLF conversion for console output. But I still believe it is a good idea to keep CRLF → LF conversion for console input:
What do you think of this? |
It seems like handling this via conversion within |
|
All right, I will remove LF ↔ CRLF conversion completely and leave the job to users, considering that there are no abstractions of "text mode" and "binary mode" for Zig's |
|
I'm sorry, I didn't review this in time, and now it has bitrotted. Furthermore, so many pull requests have stacked up that I can't keep up and I am therefore declaring Pull Request Bankruptcy and closing old PRs that now have conflicts with master branch. If you want to reroll, you are by all means welcome to revisit this changeset with respect to the current state of master branch, and there's a decent chance your patch will be reviewed the second time around. Either way, I'm closing this now, otherwise the PR queue will continue to grow indefinitely. |
|
All right. I have been busy recently, hopefully I will come back to this PR and rebase all the changes in a month or two. |
These commits aim to eliminate text encoding problems regarding Windows native console I/O.
To-do list:
WriteConsoleW()(resolves Zig runtime should set console output mode to utf-8 by default on Windows #7600, closes std.start: initialize Windows console output CP to UTF-8 on exe startup #14411)ReadConsoleW()(resolves Cannot read unicode data from STDIN on Windows #5148)