Try the following code. Warning, it is only a proof of concept. When the text contains also characters written as non-escape sequences, the replacement must be done the more complicated way (I will show later when wanted). See the comments below.
import binascii
s1 = '\\xd0\\xb1'
print('s1 =', repr(s1), '=', list(s1)) # list() to emphasize what are the characters
s2 = s1.replace('\\x', '')
print('s2 =', repr(s2))
b = binascii.unhexlify(s2)
print('b =', repr(b), '=', list(b))
s3 = b.decode('utf8')
print('s3 =', ascii(s3))
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(s3)
It prints on concole:
c:\__Python\user\so20210201>py a.py
s1 = '\\xd0\\xb1' = ['\\', 'x', 'd', '0', '\\', 'x', 'b', '1']
s2 = 'd0b1'
b = b'\xd0\xb1' = [208, 177]
s3 = '\u0431'
And it writes the character to the output.txt
file.
The problem is that the question combines both unicode escaping and escaping binary values. In other words, the unicode string can contain some sequence that represents binary value somehow; however, you cannot force that binary value into the unicode string directly, because any unicode character is actually an abstract integer, and the integer can be represented in many ways (as a sequence of bytes).
If the unicode string contained escape sequences like \\n
, it could be done differently, using the 'unicode_escape' prescription for bytes.decode()
. However, in this case, you need both decoding from ascii escape sequences and then from utf-8.
Update: Here is a function for converting your kind of strings with other ascii characters (i.e. not only the escape sequences). The function use the finite automaton -- may look too complex at first (actually it is only verbose).
def userDecode(s):
status = 0
lst = [] # result as list of bytes as ints
xx = None # variable for one byte escape conversion
for c in s: # unicode character
print(status, ' c ==', c) ## just for debugging
if status == 0:
if c == '\\':
status = 1 # escape sequence for a byte starts
else:
lst.append(ord(c)) # convert to integer
elif status == 1: # x expected
assert(c == 'x')
status = 2
elif status == 2: # first nibble expected
xx = c
status = 3
elif status == 3: # second nibble expected
xx += c
lst.append(int(xx, 16)) # this is a hex representation of int
status = 0
# Construct the bytes from the ordinal values in the list, and decode
# it as UTF-8 string.
return bytes(lst).decode('utf-8')
if __name__ == '__main__':
s = userDecode('\\xd0\\xb1whatever')
print(ascii(s)) # cannot be displayed on console that does not support unicode
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(s)
Look also inside the generated file. Remove the debug print. It displays the following on the console:
c:\__Python\user\so20210201>b.py
0 c == \
1 c == x
2 c == d
3 c == 0
0 c == \
1 c == x
2 c == b
3 c == 1
0 c == w
0 c == h
0 c == a
0 c == t
0 c == e
0 c == v
0 c == e
0 c == r
'\u0431whatever'