Skip to content

Add support for validating lang attribute values #150

@geoffmcl

Description

@geoffmcl

This file -
https://github.com/html5lib/html5lib-tests/blob/master/validator/langattribute.test
contains some 1394 invalid lang attribute values, from
<span lang=roh>
to
<span lang='en '>

At present tidy make no test of the lang value, except for some like -
<span lang=' '>
where it will report -
line 1 column 1 - Warning: attribute "lang" lacks value

At present Tidy has NO TABLE of valid lang values, thus no check is made of the value given.

A simple sample case

<!DOCTYPE html>
<html>
<head>
<title>invalid lang code</title>
<meta charset="utf-8">
</head>
<body>
<p><span lang="roh">Invalid 'roh'</span></p>
</body>
</html>

Tidy will pass this with -
No warnings or errors were found.

While the W3C validator will show an error:
Line 8, Column 20: Bad value roh for attribute lang on element span: The language subtag roh is not a valid ISO language part of a language tag.

And show additional information like:

Syntax of language tag:
An RFC 5646[1] language tag consists of hyphen-separated ASCII-alphanumeric subtags. There is a primary tag identifying a natural language by its shortest ISO 639 language code (e.g. en for English) and zero or more additional subtags adding precision. The most common additional subtag type is a region subtag which most commonly is a two-letter ISO 3166 country code (e.g. GB for the United Kingdom). IANA maintains a registry of permissible subtags[2].

A future tidy should also perform this test, using a list from the language-subtag-registry file.

[1] https://tools.ietf.org/html/rfc5646
[2] http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions