Python unicode decoding encoding

Question

Here is my full code, and it is working fine with ASCII, but when comes the "unicode" charaters in the picture... I hate my life...

I know this is not english, but let me explain:

I have got 2 input files (realmek, nevek), and 1 result file (osszes).

I have got a working page in (html).

Like I said with ANSI characters this is working.

BUT when I try use strange chracters: "űáéđĐ" I need to save 2 input, and 1 output files in UNICODE. But than my program drops a "encoding decoding" error. And I know it is normal.

So my question is: How can I solve this? where I need to handle decoding encoding?

I am thinking about this for 3 days... I tried many decoding, like "u = unicode( s, "utf-8" )" ; $ export LANG=en_US.UTF-8; etc. But it didn't worked.

from urllib import urlopen
import re

faj = "hiba"
cast = "hiba"
pont = 0
szint = 0

fj = open("C:\Users\Rendszergazda\Desktop\Achievements\Realmek.txt", "r")
tombr = fj.readline()
realmek = tombr.split(" ")
fj.close()

fh = open("C:\Users\Rendszergazda\Desktop\Achievements\Nevek.txt", "r")
tomb = fh.readline()
nevek = tomb.split(" ")
fh.close()

osszes = open("C:\Users\Rendszergazda\Desktop\Achievements\Osszes.txt", "a")

for x in realmek:
    realm = x
    for y in nevek:
        nev = y
        lap = urlopen("http://eu.battle.net/wow/en/character/"+str(realm)+"/"+str(nev)+"/achievement").read()
        letezik = re.compile('<div id="server-erro(.*)">')
        letez = re.findall(letezik,lap)
        if (letez != []):   
            a = 0    
        else:

            lapn = lap.split("\n")      
            mapo = lapn[1087]
            pontos = re.compile('\t\t\t\t\t(.*)\r')
            pont = re.findall(pontos,mapo)

            mapom = lapn[1322]
            feastn = re.compile('<div class="bar-contents">\t\t\t\t\t\t\t\t\t\t\t\t(.*)\r')
            feast = re.findall(feastn,mapom)

            fajkeres = re.compile('</strong></span> <a href="/wow/en/game/race/(.*)" class="race">')
            castkeres = re.compile('</a> <a href="/wow/en/game/class/(.*)" class="class">')
            szintkeres = re.compile('<span class="level"><strong>(.*)</strong></span> <a href="/wow/en/game/')

            faj  = re.findall(fajkeres,lap)
            cast = re.findall(castkeres,lap)
            szint = re.findall(szintkeres,lap)        
            link = "http://eu.battle.net/wow/en/character/"+str(realm)+"/"+str(nev)+"/advanced"

            ccast = cast [0]
            ffaj = faj [0]
            sszint = szint [0]
            ppont = pont [0]
            ffeast = feast [0]

            osszes.write(str(nev)+" "+str(realm)+" "+str(ppont)+" "+str(ffeast)+" "+str(ffaj)+" "+str(ccast)+" "+str(sszint)+" "+str(link)+"\n")      

osszes.close()

Define "it doesn't work". Do you get unexpected output, an exception, a crash, ... Please edit the error details into your question. — Fred Foo
– Fred Foo, Commented Mar 26, 2012 at 13:32
It doesn't give any error, just give a stack overflow, because the UNICODE characters not will be in the same spot, example: mapo = lapn[1087] with ASNII give '16', and with UNICODE source input: give unicode stings... and where I need to decoding before I use .write SO the problem, not the error, because I know what is the error, the problem is the encoding, but I don't know where I need to encode first the inputs. — user1292883
– user1292883, Commented Mar 26, 2012 at 13:42
I had some problems like those and I resolve. Encoding is, in a kind of way, very problematic. Take a look to mine question about it and try to extract suggestions from those. Good luck! — DonCallisto
– DonCallisto, Commented Mar 26, 2012 at 13:54
You code isn't a Short, Self Contained, Correct Example so it's hard for us to help. — agf
– agf, Commented Mar 26, 2012 at 13:56

alexis · Accepted Answer · 2012-03-26 15:31:46Z

2

Instead of plain open, use codecs.open to read and write your file. They take an optional argument that you use to specify what encoding to use. Ensure that you can read, print and write non-ascii text correctly (it will be seen as unicode inside your script), and afterwards check whether you're using any regexps that need adjustment.

Also, if you're using any non-ascii characters in your python source, declare your script's encoding by adding something like this as the first or second line:

# -*- coding: utf-8 -*-

edited Mar 26, 2012 at 15:31

answered Mar 26, 2012 at 13:59

alexis

50.4k18 gold badges108 silver badges173 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python unicode decoding encoding

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related