Issue #1: grepping "Flyers: Video Center"... I don't see the result :
In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sedHow to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:
sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file
Issue #2 `America’s' shows as 'Americaâs' (??):
Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)
If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:
printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"
If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).
Drop all non-ascii chars:
type a.txt | iconv -f utf8 -t ASCII//TRANSLIT
Or to preserve chars from one locale:
type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8