2

The question is how to extract string, which represented as bytes (warning) in string? What I actually mean:

>>> s1 = '\\xd0\\xb1'  #  But this is NOT bytes of s1! s1 should be 'б'!
'\\xd0\\xb1'
>>> s1[0]
'\\'
>>> len(s1)            #  The problem is here: I thought I would see (2), but:
8
>>> type(s1)
<class 'str'>
>>> type(s1[0])
<class 'str'>
>>> s1[0] == '\\'
True

So how can i convert s1 to 'б' (cyrillic symbol - the real representation of '\xd0\xb1'). I already asked here a similiar question, but my bad was misunderstood of real represented view of s1 (i thought that '\' was '\', not the '\\').

2 Answers 2

4
>>> s1 = b'\xd0\xb1' 
>>> s1.decode("utf8")
'б'
>>> len(s1)
2
4
  • Why do you put a b in there, why not r for raw string? Commented Nov 26, 2013 at 6:47
  • @GamesBrainiac because it isn't a raw string - the backslashes are meaningful. The b makes it a byte string. \xd0 is a single byte, with the value 0xD0. You can combine them (making it a raw byte string), but then you trigger the same error as the OP.
    – lvc
    Commented Nov 26, 2013 at 6:51
  • I see. Thanks, I did not know that these were byte-strings. Much appreciated :) Come to the python chatroom sometimes, I'm sure we could all learn a lot from you :) Commented Nov 26, 2013 at 6:53
  • It could be a solution for problem, but s1 theoretically may be declared in side-code (other sources, came from internet, et cetera). The question is not how to convert '\xd0\xb1' with len == 2 to 'б', but how to convert '\\xd0\\xb1' with len == 8 to 'б' Commented Nov 26, 2013 at 14:07
3

Try the following code. Warning, it is only a proof of concept. When the text contains also characters written as non-escape sequences, the replacement must be done the more complicated way (I will show later when wanted). See the comments below.

import binascii

s1 = '\\xd0\\xb1'
print('s1 =', repr(s1), '=', list(s1))            # list() to emphasize what are the characters

s2 = s1.replace('\\x', '')
print('s2 =', repr(s2))

b = binascii.unhexlify(s2)
print('b =', repr(b), '=', list(b))

s3 = b.decode('utf8')
print('s3 =', ascii(s3))

with open('output.txt', 'w', encoding='utf-8') as f:
    f.write(s3)

It prints on concole:

c:\__Python\user\so20210201>py a.py
s1 = '\\xd0\\xb1' = ['\\', 'x', 'd', '0', '\\', 'x', 'b', '1']
s2 = 'd0b1'
b = b'\xd0\xb1' = [208, 177]
s3 = '\u0431'

And it writes the character to the output.txt file.

The problem is that the question combines both unicode escaping and escaping binary values. In other words, the unicode string can contain some sequence that represents binary value somehow; however, you cannot force that binary value into the unicode string directly, because any unicode character is actually an abstract integer, and the integer can be represented in many ways (as a sequence of bytes).

If the unicode string contained escape sequences like \\n, it could be done differently, using the 'unicode_escape' prescription for bytes.decode(). However, in this case, you need both decoding from ascii escape sequences and then from utf-8.

Update: Here is a function for converting your kind of strings with other ascii characters (i.e. not only the escape sequences). The function use the finite automaton -- may look too complex at first (actually it is only verbose).

def userDecode(s):
    status = 0
    lst = []                       # result as list of bytes as ints
    xx = None                      # variable for one byte escape conversion
    for c in s:                    # unicode character
        print(status, ' c ==', c)  ## just for debugging
        if status == 0:
            if c == '\\':
                status = 1         # escape sequence for a byte starts
            else:
                lst.append(ord(c)) # convert to integer

        elif status == 1:          # x expected
            assert(c == 'x')
            status = 2

        elif status == 2:          # first nibble expected
            xx = c
            status = 3

        elif status == 3:          # second nibble expected
            xx += c
            lst.append(int(xx, 16)) # this is a hex representation of int
            status = 0

    # Construct the bytes from the ordinal values in the list, and decode
    # it as UTF-8 string.
    return bytes(lst).decode('utf-8')


if __name__ == '__main__':

    s = userDecode('\\xd0\\xb1whatever')
    print(ascii(s))    # cannot be displayed on console that does not support unicode

    with open('output.txt', 'w', encoding='utf-8') as f:
        f.write(s)

Look also inside the generated file. Remove the debug print. It displays the following on the console:

c:\__Python\user\so20210201>b.py
0  c == \
1  c == x
2  c == d
3  c == 0
0  c == \
1  c == x
2  c == b
3  c == 1
0  c == w
0  c == h
0  c == a
0  c == t
0  c == e
0  c == v
0  c == e
0  c == r
'\u0431whatever'
3
  • You are welcome :) Anyway, how did you get the string with the escape sequences?
    – pepr
    Commented Nov 26, 2013 at 14:58
  • There is a Flask server. The message (string) is crypted by RSA key at server-side and returned as binary data ... in string (like s1 in example). It's taken using Requests package on the client-side. Bad news: i have no access to server sources so i can not change the format used to send crypted message. Update: there is miss of few things: 1. Message crypted by RSA key at server; 2. It is sended to client like binary data in string format (like s1); 3. It is recieved at client and decrypted; 4. The result is something like s1. Commented Nov 26, 2013 at 22:23
  • I see. Anyway, isn't it some "well known" (not to me) way of escaping the transfered content? If yes, there could be some module around for the purpose.
    – pepr
    Commented Nov 27, 2013 at 8:05

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.