1

I have text "confrères" in a text file with encoded format "ISO-8859-2". I want to encode this value in "UTF-8" in python.

I used following code in python(2.7) to convert it but the converted value ["confrčres"] is different from original value ["confrères"].

# -*- coding: utf-8 -*-

import chardet
import codecs

a1=codecs.open('.../test.txt', 'r')

a=a1.read()

b = a.decode(chardet.detect(a)['encoding']).encode('utf8')

a1=codecs.open('.../test_out.txt', 'w').write(b)

Any idea how to get actual value but in UTF8 encoded format in the output file.

Thanks

1
  • 1
    Note that ISO-8859-2 does not have any codepoint for è. You cannot encode that character with that codec. Commented Aug 14, 2015 at 12:39

1 Answer 1

5

If you know the codec used, don't use chardet. Character detection is never foolproof, the library guessed wrong for your file.

Note that ISO-8859-2 is the wrong codec, as that codec cannot even encode the letter è. You have ISO-8859-1 (Latin-1) or Windows codepage 1252 data instead; è in 8859-1 and cp1252 is encoded to 0xE8, and 0xE8 in 8859-2 is č:

>>> print u'confrčres'.encode('iso-8859-2').decode('iso-8859-1')
confrères

Was 8859-2 perhaps the guess chardet made?

You can use the io library to handle decoding and encoding on the fly; it is the same codebase that handles all I/O in Python 3 and has fewer issues than codecs:

from shutil import copyfileobj

with open('test.txt', 'r', encoding='iso-8859-1') as inf:
    with open('test_out.txt', 'w', encoding='utf8') as outf:
        copyfileobj(inf, outf)

I used shutil.copyfileobj() to handle the copying across of data.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Martjin... It worked perfectly fine... I have around 300 files in a folder with different encoding methods - example : TIS-620, ascii, ISO-8859-1, EUC-JP - can you help me in finding existing encoding method for each file and use it dynamically in the above code?
@annamalaimuthuraman: if you have many files, then chardet may be an option, just take into account that it'll get at least some of the guesses wrong.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.