-
-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Description
We should agree on how we expect the string types to be used in Python 2 code.
There are at least four ways we can approach this:
- Make
strusually valid whenunicodeis expected. This is how mypy currently works, and this is similar to how PEP 484 definesbytearray/bytescompatibility. This will correspond to runtime semantics, but it's not safe as non-ascii characters instrobjects will result in programs sometimes blowing up. A 7-bitstrinstance is almost always valid at runtime whenunicodeis expected. - Get rid of the
str -> unicodepromotion and useUnion[str, unicode]everywhere (or create an alias for it). This is almost like approach 1, except that we have a different name forunicodeand more complex error messages and a complex programming model due to the proliferation of union types. There is potential for some additional type safety by using justunicodein user code. - Enforce explicit
str/unicodedistinction in Python 2 code, similar to Python 3 (strwould behave more or less like Python 3bytes), and discourage union types. This will make it harder to annotate existing Python 2 programs which often use the two types almost interchangeably, but it will make programs safer. - Have three different string types:
bytes(distinct from fromstr) means 8-bitstrinstances -- these aren't compatible withunicode.strmeans asciistrinstances. These are compatible withbytesandunicode, but not the other way around.unicodemeansunicodeinstances and isn't special. A string literal will have implicit typestrorbytesdepending on whether it only has ascii characters. This approach should be pretty safe and potentially also makes it fairly easy to adapt existing code, but harder than with approach 1.
These also affect how stubs should be written and thus it would be best if every tool using typeshed could use the same approach:
- For approach 1, stubs should usually use
str,unicodeorAnyStr. This is how many stubs are written already. - For approach 2, stubs should use
str,Uniont[str, unicode]orAnyStrfor attributes and function arguments, and return types could additionally use plainunicode. Return types would in general be hard to specify precisely, as it may be difficult to predict whether a function called withstror combination ofstrandunicodereturnsstr,unicodeorUnion[str, unicode]. In approach 1 we can safely fall back tounicodeif unsure.AnyStrwould be less useful as we could have mixed function arguments like(str, unicode)easily (see the typeshed issues mentioned below for more about this). - For approach 3, stubs would usually use either
str,unicodeorAnyStr, butunicodewouldn't accept plainstrobjects. - For approach 4, stubs could use three different types (
bytes,str,unicode) in addition toAnyStr, and these would all behave differently. Unlike the first three approaches,AnyStrwould range overstr,unicodeandbytesin Python 2 mode.
Note that mypy currently assumes approach 1 and I don't know how well the other approaches would work in practice.
[This was adapted from a comment on #1135; see the original issue for more discussion. Also, https://github.com/python/typeshed/issues/50 is relevant.]