I'm making a web scrawler using HTMLParser from the library html.parser. I'm getting some strings from each html page and I'm modifing them with the following function. The following function, indeed, is aimed at deleting italian prepositions and articles from each word in the string of input:
def delApostrophedPrepositions(string):
p = re.compile(r'\b[^\s/-]+\b')
string = p.findall(string)
apostrophedPrepArt = ["d'", "all'", "dall'", "tr'", "s'", "sull'", "dell'", "nell'", "l'"]
i = 0
while i < len(string):
#print(string[i])
for ap in apostrophedPrepArt:
#print(ap)
if string[i].startswith(ap):
#print(frase[i])
string[i] = string[i][len(ap):]
i = i + 1
return " ".join(string)
If I pass to the function phrases that I've wrote in the code, the function works, but I've detected a weird behaviour and I can neither explain nor resolve the reason of that behaviour. I'll try to explain that behaviuor: I detected that the locution: "Dati aggregati dell’attività amministrativa" was never modified while parsing the website, so I have made the following steps:
- 1) I have opened a file called: "apostroph.txt"
- 2) I have written "Dati aggregati dell’attività amministrativa" in it.
- 3) I have called my function with the phrase at step two as input value. Then I have written the result in an other file
- 4) I have copied (Ctrl+C) the same quoted locution in step 2 from the following website:
view-source:http://www.regione.emilia-romagna.it/trasparenza/attivita-e-procedimentiand I have pasted it (Ctrl+V) in a new file. Then I have called my function with that phrase as input value.
Finally, I have noticed that the result in the step 3 was correctly: "Dati aggregati attività amministrativa", but the result of the step 4 was uncorrectly: "Dati aggregati dell'attività amministrativa"
I specify that convert_charrefs is set to True in the HTMLParser