1

Using Python 3, I recently stumbled on a behavior I found quite inconsistent :

>>> str(b'test')
"b'test'"
>>> str(b'test', 'ascii')
'test'

For me, calling str() should always convert the bytes object passed to a string. When no encoding is given, it should try to convert it using the default encoding (or raise an exception because no encoding is given).

Does anybody know why str() is behaving like that when no encoding is given ?

3 Answers 3

1

str with one argument calls the __str__ of that argument. That call is expected to succeed, and return a string, because it is used - among other places - by print(). Imagine if something like this could happen everytime you did print(bytes_object):

>>> class Chaos:
...    def __str__(self):
...       assert False
... 
>>> print(Chaos())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in __str__
AssertionError

The only way this is allowed to fail is if the class doesn't define __str__, in which case str falls back to repr.

On the other hand, str with an encoding isn't called by print, and explicitly only supports objects that expose the Buffer API.

So, that leaves the question of what to do for str(bytes_object) without an encoding, if it can't fail. The fact that the errors argument defaults to strict means it can't really assume any encoding, even though the docs say it will use sys.getdefaultencoding(). It could do to have set the default errors to something looser like ignore, but there's good reasons not to do that - especially, it goes against the Zen of Python:

Errors should never pass silently.
Unless explicitly silenced.

And this case is particularly bad: it is quite easy to imagine it silently doing the wrong thing, but looking, on the dev's machine, like it's doing the right thing, only to randomly lose data in other environments (remember that sys.getdefaultencoding() is affected by the locale settings Python is currently running under). So, str(bytes_object) without an encoding is probably an error in most cases, but it still can't fail the normal way - so it does the next best thing: it produces output that is clearly wrong in all cases that you meant to call it with an explicit encoding.

1

str typically returns a human-readable result. So str(some_bytes) will give you something you can look at, and as a reader you'll probably want the bytes representation.

This is needed because str is often called on arbitrary objects, so the functionality has to be supported.

The real question is why someone decided to have str(x, y) to mean x.decode(y). You can already do bytes.decode(x, y) if you really want that, so it seems amazingly redundant. Nevertheless, it is that way.

EDIT: Read eryksun's comment.

EDIT 2: As it seems to have disappeared, the comment stated that str(x, y) decodes arbitrary buffers, which will therefore be marginally faster than converting to a bytes object first.

1
  • Fixed, and thanks for the great point about arbitrary buffers.
    – Veedrac
    Commented Apr 27, 2014 at 2:34
0

str(obj) is special because it works for almost any object. Its purpose is to return a human readable representation (to get unambiguous representation, call repr(obj) or ascii(obj)).

You could call codecs.decode() to get the behaviour you ask:

>>> from array import array
>>> a = array('B')
>>> a.frombytes(b'abc') # just an example, otherwise call b'abc'.decode()
>>> a
array('B', [97, 98, 99])
>>> a.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'array.array' object has no attribute 'decode'
>>> import codecs
>>> codecs.decode(a)
'abc'
>>> codecs.decode(a, 'ascii')
'abc'

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.