-
-
Notifications
You must be signed in to change notification settings - Fork 32.4k
Open
Labels
stdlibPython modules in the Lib dirPython modules in the Lib dirtopic-regextopic-unicodetype-bugAn unexpected behavior, bug, or errorAn unexpected behavior, bug, or error
Description
I'm not sure whether it's a bug or expected behaviour, but it seems odd so I figure reporting it is a good idea: while a precomposed character is considered "a word" by the regex engine (specifically \w
), its decomposed form is not, because a diacritic is not considered part of a word.
>>> import re, unicodedata
>>> s = "ö"
>>> list(s)
['ö']
>>> list(unicodedata.normalize('NFD', s))
['o', '̈']
>>> re.fullmatch(r'\w+', s)
<re.Match object; span=(0, 1), match='ö'>
>>> re.fullmatch(r'\w+', unicodedata.normalize('NFD', s))
This leads to odd effects when ingesting and filtering decomposed data.
Tested on 3.8.13, 3.10.6, and 3.11.1 (all installed via pyenv), on a Mint 21.1).
Metadata
Metadata
Assignees
Labels
stdlibPython modules in the Lib dirPython modules in the Lib dirtopic-regextopic-unicodetype-bugAn unexpected behavior, bug, or errorAn unexpected behavior, bug, or error