why Python3 str(bytes) converts to literal string b'<str>'

Question

I am using python3. Following is example which explains question.

# python3
Python 3.6.8 (default, Sep 26 2019, 11:57:09) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>> help(str)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |  
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.

>>> d = b'abcd'
>>> type(d)
<class 'bytes'>
>>> print(d)
b'abcd'
>>> len(d)
4
>>> m = str(d)
>>> type(m)
<class 'str'>
>>> print(m)
b'abcd'
>>> len(m)
7
>>> m.encode()
b"b'abcd'"
>>> 
>>> m = str(d, encoding='utf-8')
>>> type(m)
<class 'str'>
>>> print(m)
abcd
>>> len(m)
4
>>>

It is mentioned in help(str) "encoding defaults to sys.getdefaultencoding()" still str(d) converts string with b'' in it. Note the len of string is 7 now. Question is,

why default coding needs to be specified explicitly to make correct string out of bytes
How to get back to bytes - New type is string. (encode on string will add that extra b)
is there way that pylint catch/warn this problem.

AKX · Accepted Answer · 2020-07-28 07:54:59Z

str() for bytes is the same as the repr() is for bytes, exactly for the reason that you wouldn't end up misusing it. Here's a more complex example, where the source string is an emoji.

>>> s = "😸"
>>> len(s)
1  # One codepoint.
>>> b = s.encode("utf-8")
>>> len(b)
4  # Four bytes.
>>> print(b)
b'\xf0\x9f\x98\xb8'  # Repr of the bytes, not to be interpreted.
>>> print(repr(b))
b'\xf0\x9f\x98\xb8'  # Same as above!
>>> s2 = b.decode("utf-8")  # Decode back to string from bytes.
>>> s == s2
True
>>>

That is, use str.encode() to get bytes from a string, bytes.decode() to get a string from bytes.

Giacomo Catenazzi · Accepted Answer · 2020-07-28 07:56:48Z

You are using str not as a casting function (as in C and C++), but as a string representation (to be print, so it could be different to repr()) of the value.

The problem is that there is not good printable string of a binary array, so I assume there is not specific str() function, so it fall back to repr(), which add some extra annotation (for developer), like the prefix b'.

Python cannot convert binary data to a string, without knowing the encoding. (binary is coded: a is 0x61 in ASCII, and string is decoded: a means a).

So you may want to d.decode('utf-8').

Note: system encoding is a different thing. It is used for terminal input and output, but not for binary array, or in general data read from disk.

deceze · Accepted Answer · 2020-07-28 07:58:04Z

If encoding [..] is specified, then the object must expose a data buffer that will be decoded using the given encoding [..]. Otherwise, returns the result of object.__str__() (if defined) or repr(object).

This pretty much answers your questions. If you omit the encoding argument, then repr(object) is used, which results in "b'...'" as the resulting string value. If you do supply the encoding argument, then it will attempt to decode the supplied object with that encoding. Those are two fundamentally different operations:

Produce a string representation of the object, which is pretty safe and can't really fail.
Decode a binary object, i.e. try to interpret its contents in some way, which may very well fail.

Those two operations are represented by two different ways to call the str function. You would not want to implicitly trigger variation #2 and have potential error conditions to deal with by some implicitly set global value, when all you expected was to do #1.

snakecharmerb · Accepted Answer · 2020-07-28 08:50:59Z

is there way that pylint catch/warn this problem

I don't think pylint would catch it, but mypy would, if you are willing to add type annotations to your code.

Python will issue a warning when str is called on a bytes instance if it is executed with the -b flag.

$ python3 -b -c 'str(b"a")'
-c:1: BytesWarning: str() on a bytes instance

Note that the warning is only raised once, AFAICT.

If executed with -bb, an exception will be raised.

python3 -bb -c 'str(b"a")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
BytesWarning: str() on a bytes instance

Collectives™ on Stack Overflow

why Python3 str(bytes) converts to literal string b'<str>'

4 Answers 4

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Related