5

I have a string with miss encoding »Æ¹ûÊ��. On http://2cyr.com/decode/?lang=en website, you can encode it with gb2312 then decode it with iso8859 so to display it correctly.

In C#, there's a function called Encoding.Convert, which can help you convert convert the bytes from one encoding to the other. In process is straight forward:

encode the string into bytesA, using gb2312 encoder
Encoding.Convert bytesA from gb2312 encoding to iso8859 encoding
decode the bytes using iso8859 encoder

In Python, I have tried all kinds of encoding and decoding methods I can think of, but no one can help me convert the given string to the correct codecs that can be displayed correctly.

0

1 Answer 1

6

Your data is UTF-8 encoded GB2312, at least as pasted into my UTF-8 configured terminal window:

>>> data = '»Æ¹ûÊ÷'
>>> data.decode('utf8').encode('latin1').decode('gb2312')
u'\u9ec4\u679c\u6811'
>>> print _
黄果树

Encoding to Latin 1 lets us interpret characters as bytes to fix the encoding.

Rule of thumb: whenever you have double-encoded data, undo the extra 'layer' of encoding by decoding to Unicode using that codec, then encoding again with Latin-1 to get bytes again.

Sign up to request clarification or add additional context in comments.

2 Comments

This won't work in Python 3 (str has no decode method). But this will: "»Æ¹ûÊ÷".encode("latin1").decode("gb2312"). The string must be encoded in UTF-8, use #encoding: utf-8 for example.
@arbautjc: Note that both my method and yours require that the raw string bytes use a certain encoding, yes. My terminal used UTF-8, hence the decode from UTF-8 first.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.