Skip to main content
replaced http://askubuntu.com/ with https://askubuntu.com/
Source Link

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sedHow to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8
replaced http://superuser.com/ with https://superuser.com/
Source Link

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8
Source Link
Franklin Piat
  • 3.1k
  • 3
  • 34
  • 38

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8