Revisions to Unix character set conversion - Unix & Linux Stack Exchange

replaced http://askubuntu.com/ with https://askubuntu.com/

Source Link

edited Apr 13, 2017 at 12:22

1

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8

replaced http://superuser.com/ with https://superuser.com/

Source Link

edited Mar 20, 2017 at 10:04

Community Bot

1

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0 use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8

Source Link

answered Mar 8, 2015 at 20:51

Franklin Piat

3.1k
3
34
38

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8

Stack Exchange Network

Return to Answer