How to Handle Both Encoding and Decoding with Custom Error Handler in Python?

Question

I'm working on a scenario where I need to handle encoding and decoding of strings using custom error handling in Python. Specifically, I want to create an error handler that can deal with both encoding and decoding exceptions.

The encoding process:

I have implemented a custom error handler utf8_hex_replace that converts characters which can't be encoded in a specified encoding to their UTF-8 hex representation. This works fine when encoding a string like 'Hello, 世界' into an ASCII byte string, and the non-ASCII characters are replaced by their corresponding UTF-8 hex values.

Here's the code I currently have:

import codecs

def utf8_hex_replace(exception):
    if isinstance(exception, UnicodeEncodeError):
        replacement = f"[{exception.object[exception.start].encode('UTF-8').hex(' ').upper()}]"
        next_pos = exception.start + 1
        return replacement, next_pos
    else:
        raise exception  # Re-raise if it's another type of error

codecs.register_error('utf8hexreplace', utf8_hex_replace)

text = 'Hello, 世界'

encoded_utf8hexreplace = text.encode('ASCII', errors='utf8hexreplace')
print(encoded_utf8hexreplace)

This results in:

b'Hello, [E4 B8 96][E7 95 8C]'

The question:

Can I extend this utf8_hex_replace function to handle decoding as well? I would like to decode the byte string b'Hello, [E4 B8 96][E7 95 8C]' back into the original string 'Hello, 世界'. Specifically, I want to add an elif branch in utf8_hex_replace that can catch UnicodeDecodeError and decode the hex-encoded values back to their original characters.

Is this possible with the current approach, or is there a better way to handle both encoding and decoding in this case?

Any guidance or examples would be greatly appreciated!

What's wrong on backslashreplace error handler? Try 'Hello, 世界'.encode( 'ASCII', 'backslashreplace').decode( 'unicode_escape') — JosefZ, Commented Nov 26, 2024 at 16:43

Dawg · Accepted Answer · 2024-11-26 10:35:44Z

Yes you can just handle the UnicodeDecodeError and interpret the encoded hex values correctly during the decoding process

import codecs
import re

def utf8_hex_replace(exception):
    if isinstance(exception, UnicodeEncodeError):
        replacement = f"[{exception.object[exception.start].encode('UTF-8').hex(' ').upper()}]"
        next_pos = exception.start + 1
        return replacement, next_pos
    elif isinstance(exception, UnicodeDecodeError):
        # Extract the problematic part of the input
        problematic_bytes = exception.object[exception.start:exception.end]
        hex_pattern = re.match(rb'\[([0-9A-F ]+)\]', problematic_bytes)
        if hex_pattern:
            hex_bytes = bytes.fromhex(hex_pattern.group(1).decode())
            replacement = hex_bytes.decode('UTF-8')
            next_pos = exception.start + len(hex_pattern.group(0))
            return replacement, next_pos
        else:
            raise exception
    else:
        raise exception

codecs.register_error('utf8hexreplace', utf8_hex_replace)

# Test encoding
text = 'Hello, 世界'
encoded_utf8hexreplace = text.encode('ASCII', errors='utf8hexreplace')
print(encoded_utf8hexreplace)  # b'Hello, [E4 B8 96][E7 95 8C]'

# Test decoding
# Assuming the encoded string uses the same [XX XX ...] format for non-ASCII characters
decoded_utf8hexreplace = encoded_utf8hexreplace.decode('ASCII', errors='utf8hexreplace')
print(decoded_utf8hexreplace)  # 'Hello, 世界'

'encoded_utf8hexreplace.decode('ASCII')' returns Hello, [E4 B8 96][E7 95 8C] not raising any error… — JosefZ, Commented Nov 26, 2024 at 15:33

LeapingQuantum · Accepted Answer · 2024-11-26 16:09:04Z

Thank you for the response. Although the solution provided didn't directly address my issue, it did inspire me to come up with a solution that works. Here's how I handled both encoding and decoding with a custom error handler:

The key idea is to encode non-ASCII characters as their UTF-8 hex representation wrapped in square brackets during the encoding process. After encoding, I used a regular expression to match and decode the hex-encoded values back to their original characters during the decoding process.

Here’s the updated solution:

import codecs
import re


def utf8_hex_replace(exception):
    if isinstance(exception, UnicodeEncodeError):
        replacement = f"[{exception.object[exception.start].encode('UTF-8').hex(' ').upper()}]"
        next_pos = exception.start + 1
        return replacement, next_pos
    else:
        raise exception


def decode_utf8hexreplace(encoded_text):
    hex_pattern = re.compile(r'\[([0-9A-F ]+)]')
    return hex_pattern.sub(
        lambda match: bytes.fromhex(match.group(1)).decode('UTF-8'),
        encoded_text.decode('ASCII')
    )


codecs.register_error('utf8hexreplace', utf8_hex_replace)

if __name__ == '__main__':
    text = 'Hello, 世界'
    encoded = text.encode('ASCII', errors='utf8hexreplace')
    decoded = decode_utf8hexreplace(encoded)

    print(f'{encoded = }')
    print(f'{decoded = }')

With this approach, you can encode a string such as 'Hello, 世界' into an ASCII byte string with non-ASCII characters represented by their UTF-8 hex values, and then decode it back into the original string by replacing the hex-encoded values with the corresponding characters.

Collectives™ on Stack Overflow

How to Handle Both Encoding and Decoding with Custom Error Handler in Python?

The encoding process:

The question:

2 Answers 2

Hot Network Questions

Collectives™ on Stack Overflow

The encoding process:

The question:

2 Answers 2

Related