Skip to content

Conversation

@sideshowbarker
Copy link
Contributor

This change adds a preparsed subdirectory in the encoding directory, with tests for which the result of the encoding sniffing algorithm at https://html.spec.whatwg.org/#encoding-sniffing-algorithm is the expected result — that is, tests for which the expected result is the output of running only the encoding sniffing algorithm (of which the main sub-algorithm is the so-called “meta prescan”) — without also running the tokenization state machine and tree-construction stage.

This change also adds a README file that explicitly documents what the expected results for the encoding tests are, based on whether or not they’re in the preparsed subdirectory.

Without those changes, it’s unclear whether the expected results shown in the existing tests are for the output of fully parsing the test data — through the tokenization state machine and tree-construction stage — or instead just the output of the encoding sniffing algorithm only. And without those changes, we also don’t have any tests a system can use for testing only the output from the encoding sniffing algorithm.

Fixes #28

@sideshowbarker sideshowbarker changed the title Test the (meta) prescan algorithm Test the encoding sniffing algorithm (aka meta prescan) Aug 21, 2020
Copy link
Member

@gsnedders gsnedders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a part of me that would rather we just add a #preparsed section to the existing tests as and when it differs from the final state from the parser, as in the vast majority of cases they can be used for both the pre-parser and the whole system.

(also does scripting support matter for testing encoding? how do document.write and <noscript> interact here? because it so we should make sure we have expectations as required)

This change adds a `preparsed` subdirectory in the `encoding` directory,
with tests for which the result of the *encoding sniffing algorithm* at
https://html.spec.whatwg.org/#encoding-sniffing-algorithm is the
expected result — that is, tests for which the expected result is the
output of running *only* the encoding sniffing algorithm (of which the
main sub-algorithm is the so-called “meta prescan”) — without
also running the tokenization state machine and tree-construction stage.

This change also adds a README file that explicitly documents what the
expected results for the encoding tests are, based on whether or not
they’re in the `preparsed` subdirectory.

Without those changes, it’s unclear whether the expected results shown
in the existing tests are for the output of fully parsing the test data —
through the tokenization state machine and tree-construction stage — or
instead just the output of the encoding sniffing algorithm only. And
without those changes, we also don’t have any tests a system can use for
testing only the output from the encoding sniffing algorithm.

Fixes #28
@sideshowbarker sideshowbarker force-pushed the sideshowbarker/preparsed-encoding-tests-add branch from bb92134 to b6c4e3f Compare August 24, 2020 17:09
@sideshowbarker
Copy link
Contributor Author

There's still a part of me that would rather we just add a #preparsed section to the existing tests as and when it differs from the final state from the parser, as in the vast majority of cases they can be used for both the pre-parser and the whole system.

I thought about that too, but I think that’s less preferable because it would push off more work onto the N downstream parser implementors who’d all need to update their test-data parsers in their test harnesses — to teach the testharness about that new #preparsed section name.

In contrast, putting the tests with new test expectations into a new subdirectory allows those N downstream parser implementors to just keep running their existing test harnesses on the existing test data with zero changes needed. And for handling the added preparsed subdirectory, they could just completely ignore it for now. I can imagine for most of them, that will happen automatically, unless their harness code is doing directory recursion — and even in that case, it’s a minor change to just a condition in their code to ignore the preparsed subdirectory.

(also does scripting support matter for testing encoding? how do document.write and <noscript> interact here? because it so we should make sure we have expectations as required)

I think that question is orthogonal to the scope of this PR. But it seems to me the answer is that for the preparsed case, scripting support definitely does not matter. For the “normal” parsed case, I don’t the answer. But since the implementation I work on doesn’t have scripting support anyway, I’m not personally super-motivated to spend time investigating it…

@gsnedders
Copy link
Member

I thought about that too, but I think that’s less preferable because it would push off more work onto the N downstream parser implementors who’d all need to update their test-data parsers in their test harnesses — to teach the testharness about that new #preparsed section name.

How much work is that for most downstream parsers?

In contrast, putting the tests with new test expectations into a new subdirectory allows those N downstream parser implementors to just keep running their existing test harnesses on the existing test data with zero changes needed. And for handling the added preparsed subdirectory, they could just completely ignore it for now. I can imagine for most of them, that will happen automatically, unless their harness code is doing directory recursion — and even in that case, it’s a minor change to just a condition in their code to ignore the preparsed subdirectory.

I think the majority do do recursion?

I think that question is orthogonal to the scope of this PR.

It is. Just wondering if we might potentially want further flags later, because any way we do this has a cost to some people hence it would be nice to have it happen all at once.

@sideshowbarker
Copy link
Contributor Author

I thought about that too, but I think that’s less preferable because it would push off more work onto the N downstream parser implementors who’d all need to update their test-data parsers in their test harnesses — to teach the testharness about that new #preparsed section name.

How much work is that for most downstream parsers?

For the test harness we’re using the validator.nu parser, I guess it’d be an hour of work, versus minutes of work if we just put the tests into a different directory.

But that said, it’s not absolutely a lot of work.

However, I think what’s more important to consider than how much work it takes is instead, whether it will break anybody’s existing test harnesses. And in that case I can say that adding a new #preparsed section name that our current harness doesn’t know about will break our harness. (I recognize that our harness code should ideally by written to just ignore section names it doesn’t recognize, but that’s not the case now.)

So I personally would prefer not to make changes that have a higher risk of breaking somebody else’s harness code, and instead mitigate that risk by adding the new test types to the existing test suite in a way that seems less likely to break anybody’s existing harness.

I also recognize that it would not be a lot of work for others to update their harness code to recognize a new section name. But I would prefer not to make the choice for them unilaterally.

In contrast, putting the tests with new test expectations into a new subdirectory allows those N downstream parser implementors to just keep running their existing test harnesses on the existing test data with zero changes needed. And for handling the added preparsed subdirectory, they could just completely ignore it for now. I can imagine for most of them, that will happen automatically, unless their harness code is doing directory recursion — and even in that case, it’s a minor change to just a condition in their code to ignore the preparsed subdirectory.

I think the majority do do recursion?

So even if they’re doing recursion, the relatively minor change that anybody could make for now — short of adding full support for the new test types — is to write a single one-to-three-line condition that just causes their harness to completely ignore the preparsed subdirectory for now. And then when they can make time to go back and actually add support for running the new tests, they remove that condition and un-ignore that subdirectory.

For the case of the validator.nu parser now, that’s what we’re already doing for the scripted subdirectories. We don’t have scripting support, so we just permanently ignore any scripted subdirectories.

So I guess I personally would be less happy if the decision had been made previously to add a #scripted section name rather than using the subdirectory mechanism. Because in that case, the test files would — from my perspective — cluttered with additional #scripted tests that I don’t care about and won’t run, but which are mixed in with the rest of the tests that I actually do care about.

@sideshowbarker
Copy link
Contributor Author

sideshowbarker commented Sep 2, 2020

@inikulin Given you’re the main person who’s been most active in this repo most recently, it’d be helpful if you weigh in with an opinion on this patch. Specifically, would it better to add a new #preparsed section name, or would you prefer the tests be put into a separate subdirectory (without a new section name being added)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Split encoding tests to those determined pre-parse and during-parse

2 participants