Revisions to How to replace decoded Non-breakable space (nbsp)

Added more information, improved grammar.

Source Link

edited Jun 9, 2021 at 17:07

25.8k
12
88
73

Problem Explanation

The problemreason why it's not working is that you are specifying the non-breaking space incorrectly.

The proper code offor the non-breaking space in the UTF-8 encoding is 0xC2A0, it consists of two bytes - C20xC2 (194) and A00xA0 (160), so technically, you're specifying only the half of the character's code.

A Bit of Theory

Legacy character encodings were using the constant number of bits to encode every character in their set. For example, the original ASCII encoding was using 7 bits per character, extended ASCII 8 bits.

The UTF-8 encoding is so-called variable width character encoding, which means that the number of bytesbits used to encoderepresent individual characters is variable, in the case of UTF-8, character codes consist of one up to four (8 bit) bytes (octets). In general, similarly to the Huffman coding, more frequently used characters have shorter codes while more rare characters have longer codes. That helps reduce the data size of the average text.

Solution

You can replace all occurences of the UTF-8 non-breaking space in text using a simple (and fast) str_replace or using a more flexible regular expression, depending on your needs:

// faster solution
$regular_spaces = str_replace("\xc2\xa0", ' ', $original_string);

// more flexible solution
$regular_spaces = preg_replace('/\xc2\xa0/', ' ', $original_string);

Notes

Note that in case of str_replace, you have to use double quotes (") to enclose the search string because it doesn't understand the textual representation of character codes so it needs those codes to be converted into actual characters first. That's made automatically by PHP because strings enclosed in double quotes are being processed and special sequences (e.g. newline character \n, textual representation of character codes, etc.) are replaced by actual characters (e.g. 0x0A for \n in UTF-8) before the string value is being used.

In contrast, the preg_replace function itself understands the textual representation of character codes so you don't need PHP to convert them into actual characters and you can use apostrophes (single quotes, ') to enclose the search string in this case.

Problem Explanation

The problem is that you are specifying the non-breaking space incorrectly. The proper code of the non-breaking space in the UTF-8 encoding is 0xC2A0, it consists of two bytes - C2 (194) and A0 (160), you're specifying only the half of the character's code.

The UTF-8 encoding is so-called variable width character encoding, which means that the number of bytes used to encode individual characters is variable, in the case of UTF-8 character codes consist of one up to four (8 bit) bytes. In general, similarly to the Huffman coding, more frequently used characters have shorter codes while more rare characters have longer codes.

Solution

You can replace all occurences of the UTF-8 non-breaking space in text using a simple (and fast) str_replace or using a more flexible regular expression, depending on your needs:

// faster solution
$regular_spaces = str_replace("\xc2\xa0", ' ', $original_string);

// more flexible solution
$regular_spaces = preg_replace('/\xc2\xa0/', ' ', $original_string);

Notes

Note that in case of str_replace, you have to use double quotes (") to enclose the search string because it doesn't understand the textual representation of character codes so it needs those codes to be converted into actual characters first. That's made automatically by PHP because strings enclosed in double quotes are being processed and special sequences (e.g. newline character \n, textual representation of character codes, etc.) are replaced by actual characters (e.g. 0x0A for \n in UTF-8) before the string value is being used.

In contrast, the preg_replace function itself understands the textual representation of character codes so you don't need PHP to convert them into actual characters and you can use apostrophes (single quotes, ') to enclose the search string in this case.

Problem Explanation

The reason why it's not working is that you are specifying the non-breaking space incorrectly.

The proper code for the non-breaking space in the UTF-8 encoding is 0xC2A0, it consists of two bytes - 0xC2 (194) and 0xA0 (160), so technically, you're specifying only the half of the character's code.

A Bit of Theory

Legacy character encodings were using the constant number of bits to encode every character in their set. For example, the original ASCII encoding was using 7 bits per character, extended ASCII 8 bits.

The UTF-8 encoding is so-called variable width character encoding, which means that the number of bits used to represent individual characters is variable, in the case of UTF-8, character codes consist of one up to four (8 bit) bytes (octets). In general, similarly to the Huffman coding, more frequently used characters have shorter codes while more rare characters have longer codes. That helps reduce the data size of the average text.

Solution

You can replace all occurences of the UTF-8 non-breaking space in text using a simple (and fast) str_replace or using a more flexible regular expression, depending on your needs:

// faster solution
$regular_spaces = str_replace("\xc2\xa0", ' ', $original_string);

// more flexible solution
$regular_spaces = preg_replace('/\xc2\xa0/', ' ', $original_string);

Notes

Note that in case of str_replace, you have to use double quotes (") to enclose the search string because it doesn't understand the textual representation of character codes so it needs those codes to be converted into actual characters first. That's made automatically by PHP because strings enclosed in double quotes are being processed and special sequences (e.g. newline character \n, textual representation of character codes, etc.) are replaced by actual characters (e.g. 0x0A for \n in UTF-8) before the string value is being used.

In contrast, the preg_replace function itself understands the textual representation of character codes so you don't need PHP to convert them into actual characters and you can use apostrophes (single quotes, ') to enclose the search string in this case.

Extended description, fixed grammar, organized to sections.

Source Link

edited May 10, 2021 at 9:58

David Ferenczy Rogožan

25.8k
12
88
73

Problem Explanation

The problem is that you are specifying the non-breakablebreaking space in a wrong wayincorrectly. The proper code of the non-breakablebreaking space in the UTF-8 encoding is 0xC2A00xC2A0, it consists of two bytes - C2 (194) and A0 (160), you're specifying only the half of the character's code.

The UTF-8 encoding is so-called variable width character encoding, which means that the number of bytes used to encode individual characters is variable, in the case of UTF-8 character codes consist of one up to four (8 bit) bytes. In general, similarly to the Huffman coding, more frequently used characters have shorter codes while more rare characters have longer codes.

Solution

You can replace it usingall occurences of the UTF-8 non-breaking space in text using a simple (and fast) str_replace or using a more flexible regular expression, depending on your needs:

// faster solution
$regular_spaces = str_replace("\xc2\xa0", ' ', $original_string);

// more flexible solution
$regular_spaces = preg_replace('/\xc2\xa0/', ' ', $original_string);

Notes

Note that in case of str_replace, you have to use double quotes (") to enclose the search string because it doesn't understand the textual representation of character codes so it needs those codes to be converted into actual characters first. That's made automatically by PHP because strings enclosed in double quotes are being processed and special sequences (e.g. newline character \n, textual representation of character codes, etc.) are replaced by actual characters (e.g. 0x0A for \n in UTF-8) before the string value is being used.

In contrast, the preg_replace function itself understands the textual representation of the character codes so you don't need PHP to convert them into actual characters and you can use apostrophes (single quotes, ') to enclose the search string in this case.

The UTF-8 encoding is so called variable width character encoding, that means character codes consist from one up to four (8 bit) bytes. In general, more frequently used characters have shorter codes while more exotic characters have longer codes.

The problem is that you are specifying the non-breakable space in a wrong way. The proper code of the non-breakable space in UTF-8 encoding is 0xC2A0, it consists of two bytes - C2 (194) and A0 (160), you're specifying only the half of the character's code.

You can replace it using the simple (and fast) str_replace or using a more flexible regular expression, depending on your needs:

// faster solution
$regular_spaces = str_replace("\xc2\xa0", ' ', $original_string);

// more flexible solution
$regular_spaces = preg_replace('/\xc2\xa0/', ' ', $original_string);

Note that in case of str_replace, you have to use double quotes (") to enclose the search string because it doesn't understand textual representation of character codes so it needs those codes to be converted into actual characters first. That's made automatically by PHP because strings enclosed in double quotes are being processed and special sequences (e.g. newline character \n, textual representation of character codes, etc.) are replaced by actual characters (e.g. 0x0A for \n in UTF-8) before the string value is being used.

In contrast, the preg_replace function itself understands textual representation of the character codes so you don't need PHP to convert them into actual characters and you can use apostrophes (single quotes, ') to enclose the search string in this case.

The UTF-8 encoding is so called variable width character encoding, that means character codes consist from one up to four (8 bit) bytes. In general, more frequently used characters have shorter codes while more exotic characters have longer codes.

Problem Explanation

The problem is that you are specifying the non-breaking space incorrectly. The proper code of the non-breaking space in the UTF-8 encoding is 0xC2A0, it consists of two bytes - C2 (194) and A0 (160), you're specifying only the half of the character's code.

The UTF-8 encoding is so-called variable width character encoding, which means that the number of bytes used to encode individual characters is variable, in the case of UTF-8 character codes consist of one up to four (8 bit) bytes. In general, similarly to the Huffman coding, more frequently used characters have shorter codes while more rare characters have longer codes.

Solution

You can replace all occurences of the UTF-8 non-breaking space in text using a simple (and fast) str_replace or using a more flexible regular expression, depending on your needs:

// faster solution
$regular_spaces = str_replace("\xc2\xa0", ' ', $original_string);

// more flexible solution
$regular_spaces = preg_replace('/\xc2\xa0/', ' ', $original_string);

Notes

Note that in case of str_replace, you have to use double quotes (") to enclose the search string because it doesn't understand the textual representation of character codes so it needs those codes to be converted into actual characters first. That's made automatically by PHP because strings enclosed in double quotes are being processed and special sequences (e.g. newline character \n, textual representation of character codes, etc.) are replaced by actual characters (e.g. 0x0A for \n in UTF-8) before the string value is being used.

In contrast, the preg_replace function itself understands the textual representation of character codes so you don't need PHP to convert them into actual characters and you can use apostrophes (single quotes, ') to enclose the search string in this case.

Upper-cased character hex code, added general information about UTF-8 encoding.

Source Link

edited Dec 27, 2019 at 0:38

David Ferenczy Rogožan

25.8k
12
88
73

The problem is that you are specifying the non-breakable space in a wrong way. The proper code of the non-breakable space in UTF-8 encoding is 0xC2A0, it consists of two bytes - C2 (194) and A0 (160), you're specifying only the half of the character's code.

You can replace it using the simple (and fast) str_replace or using a more flexible regular expression, depending on your needs:

// faster solution
$regular_spaces = str_replace("\xc2\xa0", ' ', $original_string);

// more flexible solution
$regular_spaces = preg_replace('/\xc2\xa0/', ' ', $original_string);

Note that in case of str_replace, you have to use double quotes (") to enclose the search string because it doesn't understand textual representation of character codes so it needs those codes to be converted into actual characters first. That's made automatically by PHP because strings enclosed in double quotes are being processed and special sequences (e.g. newline character \n, textual representation of character codes, etc.) are replaced by actual characters (e.g. 0x0A for \n in UTF-8) before the string value is being used.

In contrast, the preg_replace function itself understands textual representation of the character codes so you don't need PHP to convert them into actual characters and you can use apostrophes (single quotes, ') to enclose the search string in this case.

Note how theThe UTF-8 character codeencoding is specified as two separate numbersso called variable width character encoding, that means character codes consist from one up to four (8 bit) bytes. In general, more frequently used characters have shorter codes while more exotic characters have longer codes.

The problem is that you are specifying the non-breakable space in a wrong way. The proper code of the non-breakable space in UTF-8 encoding is 0xC2A0, you're specifying only the half of the character's code.

You can replace it using the simple (and fast) str_replace or using a more flexible regular expression, depending on your needs:

// faster solution
$regular_spaces = str_replace("\xc2\xa0", ' ', $original_string);

// more flexible solution
$regular_spaces = preg_replace('/\xc2\xa0/', ' ', $original_string);

Note that in case of str_replace, you have to use double quotes (") to enclose the search string because it doesn't understand textual representation of character codes so it needs those codes to be converted into actual characters first. That's made automatically by PHP because strings enclosed in double quotes are being processed and special sequences (e.g. newline character \n, textual representation of character codes, etc.) are replaced by actual characters (e.g. 0x0A for \n in UTF-8) before the string value is being used.

In contrast, the preg_replace function itself understands textual representation of the character codes so you don't need PHP to convert them into actual characters and you can use apostrophes (single quotes, ') to enclose the search string in this case.

Note how the UTF-8 character code is specified as two separate numbers.

The problem is that you are specifying the non-breakable space in a wrong way. The proper code of the non-breakable space in UTF-8 encoding is 0xC2A0, it consists of two bytes - C2 (194) and A0 (160), you're specifying only the half of the character's code.

You can replace it using the simple (and fast) str_replace or using a more flexible regular expression, depending on your needs:

// faster solution
$regular_spaces = str_replace("\xc2\xa0", ' ', $original_string);

// more flexible solution
$regular_spaces = preg_replace('/\xc2\xa0/', ' ', $original_string);

Note that in case of str_replace, you have to use double quotes (") to enclose the search string because it doesn't understand textual representation of character codes so it needs those codes to be converted into actual characters first. That's made automatically by PHP because strings enclosed in double quotes are being processed and special sequences (e.g. newline character \n, textual representation of character codes, etc.) are replaced by actual characters (e.g. 0x0A for \n in UTF-8) before the string value is being used.

In contrast, the preg_replace function itself understands textual representation of the character codes so you don't need PHP to convert them into actual characters and you can use apostrophes (single quotes, ') to enclose the search string in this case.

The UTF-8 encoding is so called variable width character encoding, that means character codes consist from one up to four (8 bit) bytes. In general, more frequently used characters have shorter codes while more exotic characters have longer codes.

Upper-cased character hex code.

Source Link

edited Dec 27, 2019 at 0:26

David Ferenczy Rogožan

25.8k
12
88
73

Loading

Slightly improved, added link to Wikipedia.

Source Link

edited Dec 27, 2019 at 0:18

David Ferenczy Rogožan

25.8k
12
88
73

Loading

Improved formatting.

Source Link

edited Aug 27, 2018 at 15:33

David Ferenczy Rogožan

25.8k
12
88
73

Loading

Clarify that double quotes are needed

Source Link

edit approved Aug 27, 2018 at 9:05

Martin Smith

4.1k
2
34
59

Loading

Improved grammar, added explanation how preg_replace works.

Source Link

edited Jul 11, 2018 at 16:38

David Ferenczy Rogožan

25.8k
12
88
73

Loading

Improved grammar, added explanation how preg_replace works.

Source Link

edited Jul 11, 2018 at 16:28

David Ferenczy Rogožan

25.8k
12
88
73

Loading

Improved grammar.

Source Link

edited Jul 11, 2018 at 16:23

David Ferenczy Rogožan

25.8k
12
88
73

Loading

added 9 characters in body

Source Link

edited Apr 20, 2018 at 10:01

David Ferenczy Rogožan

25.8k
12
88
73

Loading

Fixed grammar.

Source Link

edited Mar 14, 2017 at 13:00

David Ferenczy Rogožan

25.8k
12
88
73

Loading

Added another solution suggested by Simon.

Source Link

edited Nov 21, 2016 at 16:38

David Ferenczy Rogožan

25.8k
12
88
73

Loading

Source Link

answered Nov 21, 2016 at 16:26

David Ferenczy Rogožan

25.8k
12
88
73

Loading

Collectives™ on Stack Overflow

Return to Answer

Post Timeline

Problem Explanation

A Bit of Theory

Solution

Notes

Problem Explanation

Solution

Notes

Problem Explanation

A Bit of Theory

Solution

Notes

Problem Explanation

Solution

Notes

Problem Explanation

Solution

Notes