-
-
Notifications
You must be signed in to change notification settings - Fork 32.8k
gh-138284 : urllib.parse.parse_qsl now raises ValueError if illegal characters is passed, according to RFC 3986 #138291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Lib/urllib/parse.py
Outdated
@@ -91,6 +91,9 @@ | |||
# Unsafe bytes to be removed per WHATWG spec | |||
_UNSAFE_URL_BYTES_TO_REMOVE = ['\t', '\r', '\n'] | |||
|
|||
# Allowed valid characters in parse_qsl | |||
_VALID_QUERY_CHARS = re.compile(r"^[A-Za-z0-9\-._~!$&'()*+,;=:@/?%]*$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be replaced with str.isascii
, str.isdecimal
and a strings with the others, this should be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I will do it and add new commit.
Please add a NEWS entry, and this does break existing code. |
…k for performance
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
@StanFromIreland I have added your suggestion. Can you please review it again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am the reporter of the issue which is being solved here. The changes look good to me, and I think they solve the issue very well. Approving from my end.
Note: @Davda-James do get it reviewed by Stan.
Suggestion: Squash your commits below to have a single commit. It is a good practice to have :)

cc: @StanFromIreland
I have requested the expert for this module.
Please do not, it confuses gh making it difficult to review. They will be squashed when merged anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that this change is expected, the following must be done:
- The documentation of
urllib.parse.parse_qsl
must be updated accordingly. - Test coverage must be increased.
I'm not entirely sure that we necessarily need to consider this as a bug fix. The rationale is as follows:
The urlsplit() and urlparse() APIs do not perform validation of inputs. They may not raise errors on inputs that other applications consider invalid. They may also succeed on some inputs that might not be considered URLs elsewhere. Their purpose is for practical functionality rather than purity.
I do not know whether we should consider this is a pitfall or not.
Misc/NEWS.d/next/Library/2025-08-31-13-00-22.gh-issue-138284.6MOp4k.rst
Outdated
Show resolved
Hide resolved
Lib/urllib/parse.py
Outdated
@@ -854,6 +866,11 @@ def _unquote(s): | |||
name, has_eq, value = name_value.partition(eq) | |||
if not has_eq and strict_parsing: | |||
raise ValueError("bad query field: %r" % (name_value,)) | |||
if strict_parsing: | |||
# Validate RFC3986 characters | |||
to_check = (name_value.decode() if isinstance(name_value, bytes) else name_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use _unquote
as this handles the %-encoded values and takes care of the encoding parameter as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if strict_parsing:
# Validate RFC3986 characters
to_check = _unquote(name_value)
if isinstance(to_check, (bytes, bytearray)):
to_check = to_check.decode(encoding, errors)
if not _is_valid_rfc3986_query(to_check): using like this is it good as we need to decode back as _unquote returns bytes and _is_valid_rfc3986_query accepts the string ?
…MOp4k.rst Updated it according to suggestion Co-authored-by: Bénédikt Tran <[email protected]>
urllib.parse.parse_qsl earlier it was accepting the illegal characters as well.
Proof (that I reproduce) :

Closes issue : #138284
Proof (after fixing error):

I added the test for it as well.

Test for urlparse only :
All tests:

urllib.parse.parse_qsl
is accepting illegal characters #138284