Problems with encoding while parsing website (html) and getting strings

Question

I'm making a web scrawler using HTMLParser from the library html.parser. I'm getting some strings from each html page and I'm modifing them with the following function. The following function, indeed, is aimed at deleting italian prepositions and articles from each word in the string of input:

def delApostrophedPrepositions(string):
    p = re.compile(r'\b[^\s/-]+\b')
    string = p.findall(string)
    apostrophedPrepArt = ["d'", "all'", "dall'", "tr'", "s'", "sull'", "dell'", "nell'", "l'"]
    i = 0
    while i < len(string):
        #print(string[i])
        for ap in apostrophedPrepArt:
            #print(ap)
            if string[i].startswith(ap):
                #print(frase[i])
                string[i] = string[i][len(ap):]
        i = i + 1
    return " ".join(string)

If I pass to the function phrases that I've wrote in the code, the function works, but I've detected a weird behaviour and I can neither explain nor resolve the reason of that behaviour. I'll try to explain that behaviuor: I detected that the locution: "Dati aggregati dell’attività amministrativa" was never modified while parsing the website, so I have made the following steps:

1) I have opened a file called: "apostroph.txt"
2) I have written "Dati aggregati dell’attività amministrativa" in it.
3) I have called my function with the phrase at step two as input value. Then I have written the result in an other file
4) I have copied (Ctrl+C) the same quoted locution in step 2 from the following website: view-source:http://www.regione.emilia-romagna.it/trasparenza/attivita-e-procedimenti and I have pasted it (Ctrl+V) in a new file. Then I have called my function with that phrase as input value.

Finally, I have noticed that the result in the step 3 was correctly: "Dati aggregati attività amministrativa", but the result of the step 4 was uncorrectly: "Dati aggregati dell'attività amministrativa"

I specify that convert_charrefs is set to True in the HTMLParser

Martijn Pieters · Accepted Answer · 2014-07-29 14:34:07Z

2

The apostrophe in the webpage is not what you are expecting it to be:

>>> phrase = 'Dati aggregati dell’attività amministrativa'
>>> phrase[19]
'’'
>>> print(ascii(phrase[19]))
'\u2019'

That's a U+2019 RIGHT SINGLE QUOTATION MARK codepoint, not the U+0027 APOSTROPHE codepoint your code looks for.

You'll need to either normalise your inputs to use one character or expand your matching to take the many different Unicode alternatives into account.

In this case Unidecode could help:

>>> from unidecode import unidecode
>>> unidecode(phrase)
"Dati aggregati dell'attivita amministrativa"

but take into account that à has been replaced by a now too.

Another approach would be to use str.translate() to map such characters; you'd then have to produce your own table first:

>>> apostrophes = dict.fromkeys(
...     (0x2013, 0x2018, 0x2019, 0x201b, 0x2035, 0x275b, 0x275c),
...     "'")
>>> phrase.translate(apostrophes)
"Dati aggregati dell'attività amministrativa"

edited Jul 29, 2014 at 14:34

answered Jul 29, 2014 at 13:59

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

StackUser Over a year ago

How can I do to normalise my input?

Martijn Pieters Over a year ago

@UtenteStack: you could use str.translate() to map many 'foreign' apostrophes to ' as well.

StackUser Over a year ago

@Pieters: thank you. But I think I could spend a few weeks, if I wanted to map every character to the the characters with it can be confused. I have to find specific phrases in each website, so I think that I'll use unidecode to pre-empively convert to ascii both the phrase I'm looking for, both the phrase found and finally compare them, in order to understand if I've found the right phrase. What do you think about my choice?

Martijn Pieters Over a year ago

That may be a better approach, yes; if you are merely collecting stats on phrases, using unidecode is probably the better approach.

StackUser Over a year ago

not properly stats...I have to find specific phrases in order to find specific links that are dictated by law in italian websites government. I have to say whether those links are present and whether they work

|

Collectives™ on Stack Overflow

Problems with encoding while parsing website (html) and getting strings

1 Answer 1

6 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Related