How to convert a string to unicode/byte string in Python 3?

Question

I know this works:

a = u"\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print(a) # 方法，删除存储在

But if I have a string from a JSON file which does not start with "u"(a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"), I know how to make it in Python 2 (print unicode(a, encoding='unicode_escape') # Prints 方法，删除存储在). But how to do it with Python 3?

Similarly, if it's a byte string loaded from a file, how to convert it?

print("好的".encode("utf-8"))  # b'\xe5\xa5\xbd\xe7\x9a\x84'
# how to convert this?
b = '\xe5\xa5\xbd\xe7\x9a\x84'  # 好的

Python 3 uses unicode as default, therefore just print(a) (your console should support unicode). To convert byte string to unicode in Python 3, use str(b, 'utf-8'). To test your code, use IDLE (Python shell) which supports unicode. — acw1668, Commented Aug 12, 2016 at 2:13
@Lex: Are you saying the file itself contains the literal text \u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728? — ShadowRanger, Commented Aug 12, 2016 at 2:23
@ShadowRanger Thanks for pointing that out, I removed my comment after you corrected my answer. Again, unaware how vast the change is between python2 vs python 3 — Nick Bull, Commented Aug 12, 2016 at 2:23
@acw1668 print(str("\xe5\xa5\xbd\xe7\x9a\x84","utf-8")) raise a error :"TypeError: decoding str is not supported", — Lex, Commented Aug 12, 2016 at 2:29
@ShadowRanger yes, it's a json unicode text, I made it use print(json.loads('"{}"'.format(b))), but it looks weird, if I have a very long json string and the json format is not quite right ,this method may be not work — Lex, Commented Aug 12, 2016 at 2:34

ShadowRanger · Accepted Answer · 2016-08-12 02:27:12Z

3

If I understand correctly, the file contains the literal text \u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728 (so it's plain ASCII, but with backslashes and all that describe the Unicode ordinals the same way you would in a Python str literal). If so, there are two ways to handle this:

Read the file in binary mode, then call mystr = mybytes.decode('unicode-escape') to convert from the bytes to str interpreting the escapes
Read the file in text mode, and use the codecs module for the "text -> text" conversion (bytes to bytes and text to text codecs are now supported only by the codecs module functions; bytes.decode is purely for bytes to text and str.encode is purely for text to bytes, because usually, in Py2, str.encode and unicode.decode was a mistake, and removing the dangerous methods makes it easier to understand what direction the conversions are supposed to go), e.g. decodedstr = codecs.decode(encodedstr, 'unicode-escape')

answered Aug 12, 2016 at 2:27

ShadowRanger

156k12 gold badges216 silver badges307 bronze badges

Not the OP but I tried reading from a file in binary mode that had one line \xe5\xa5\xbd\xe7\x9a\x84. This gave me b'\\xe5\\xa5\\xbd\\xe7\\x9a\\x84' and printing that with .decode('unicode-escape') gives å¥½ç ... and not '好的' as expected by OP
– Asish M.
Commented Aug 12, 2016 at 2:42
the string \u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728 in file is loaded from a http request, it's a json unicode string, I tried the code print(codecs.decode("'\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728'", 'unicode-escape')), it prints 'æ¹æ³ï¼å é¤åå¨å¨', not '好的'
– Lex
Commented Aug 12, 2016 at 2:46

Add a comment |

Collectives™ on Stack Overflow

How to convert a string to unicode/byte string in Python 3?

1 Answer 1

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Linked

Related