3

I'm trying to convert a string with octal-escaped Unicode back into a proper Unicode string as follows, using Python 3:

"training\345\256\214\346\210\220\345\276\214.txt" is the read-in string.

"training完成後.txt" is the string's actual representation, which I'm trying to obtain.

However, after skimming SO, seems the suggested solution was the following most everywhere I could find for Python 3:

decoded_string = bytes(myString, "utf-8").decode("unicode_escape")

Unfortunately, that seems to yield the wrong Unicode string when applied to my sample:

'trainingå®Â\x8cæÂ\x88Â\x90å¾Â\x8c.txt'

This seems easy to do with byte literals, as well as in Python 2, but unfortunately doesn't seem as easy with strings in Python 3. Help much appreciated, thanks! :)

1 Answer 1

5

Assuming your starting string is a Unicode string with literal backslashes, you first need a byte string to use the unicode-escape codec, but the octal escapes are UTF-8, so you'll need to convert it again to a byte string and then decode as UTF-8:

>>> s = r'training\345\256\214\346\210\220\345\276\214.txt'
>>> s
'training\\345\\256\\214\\346\\210\\220\\345\\276\\214.txt'
>>> s.encode('latin1')
b'training\\345\\256\\214\\346\\210\\220\\345\\276\\214.txt'
>>> s.encode('latin1').decode('unicode-escape')
'trainingå®\x8cæ\x88\x90å¾\x8c.txt'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'training\xe5\xae\x8c\xe6\x88\x90\xe5\xbe\x8c.txt'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'training完成後.txt'

Note that the latin1 codec does a direct translation of Unicode codepoints U+0000 to U+00FF to bytes 00-FF.

Sign up to request clarification or add additional context in comments.

2 Comments

Also, just so I understand better, would you mind elaborating a little bit more as to why the latin-1 encoding is needed before decoding to utf-8?
@coltonoscopy In Python 3, you must explicitly encode to bytes and decode to Unicode, so you can't directly .decode('unicode-escape') on a Unicode string. .encode('latin1') is the trick to convert back to a byte string with 1:1 translation of codepoints to bytes...assuming of course you only have U+0000 to U+00FF codepoints in the string. The second .encode('latin1') was needed because after the decode, you have a Unicode string with UTF-8 encoded data in it, so it had to be converted back to bytes before decoding as UTF-8.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.