0

The Unicode standards states:

Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange.

The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form.

I have trouble understanding the definition of code unit (first pargraph above) and I'm not sure if it's maybe because English is not my native language or I am missing something else.

  • I couldn't find a definition for "unit of encoded text" but I guess what is meant here is "code point"? Is this true?
  • Does "The minimal bit combination that can represent a unit of encoded text for processing" mean "The minimal bit combination for which at last one code point exists which can be represented with this bit combination"?

I'm asking because the definition in the standard ("can represent a") gives me - but maybe I don't understand the phrasing correctly here - the impression that any (i.e. all) code points can be represented with this bit combination (which would be at odds with UTF-8's 1 byte code unit size).

5
  • Yeah, no, UTF-8 is a variable length encoding. Simple characters, like these, are represented in 8-bits, but all the special ones have longer encodings. Does that help? Commented Oct 28 at 14:09
  • Thank you for your relpy! I'm aware of that, I think my main questions probably are: Is my rephrased definition (second bullet point) correct? And am I the only one who got a bit confused by the original definition? Commented Oct 28 at 14:36
  • UTF-16 needs an intermediate form. First you explain how to get surrogates, and then how to build the code. The last step is independent of representation (Little-Endian, or Big-Endian), so the need of code unit notation. UTF-16 has chunk of 16 bit (code unit), but some characters (code point) may be represented by 2 code units. Commented Oct 28 at 14:48
  • To illustrate: if we take two extremes the letters a and 𝼃, 𝼃 in UTF-8 code units would be 0xF0 0x9D 0xBC 0x83 (four code units representing four bytes), in UTF-16-BE it would be 0xD837 0xDF03 (two 16 bit code units representing four bytes, and in UTF-32-BE 0x0001DF03 (one code unit representing four bytes). Commented Oct 29 at 2:08
  • At the other end of the spectrum a in UTF-8 code units: 0x61 (a single code unit, representing a single byte), in UTF-16_BE it would be 0x0061 (a single code unit representing two bytes), and in UTF-32-BE the code unit would be 0x00000061 (a single code unit representing four bytes). Commented Oct 29 at 2:11

2 Answers 2

1

All Unicode code points can be represented by all three encoding forms. They just differ in how they do it.

In UTF-32, every code point is represented by a chunk of data that is 4 bytes (or 32 bits) long. No matter the code point, even if it’s the smallest possible value zero, UTF-32 always uses exactly 32 bits to encode it. That is why UTF-32 is said to use 32-bit code units. Nothing smaller than 32 bits can have any meaning in this encoding form – it can never represent a code point – and therefore no chunk of data that is shorter than 32 bits can be interchanged or processed in a useful way when working with UTF-32.

In UTF-16, every code point is represented by a chunk of data that is either 2 bytes (16 bits) or 4 bytes (32 bits) long, depending on which code point it is; smaller code points use fewer bytes. The smallest amount of data that has any meaning in this encoding form is 16 bits long, so UTF-16 is said to use 16-bit code units. If you have 16 bits of UTF-16 data, that would already be enough to represent a whole code point. It’s just not enough to represent any code point, because some of them require 32 bits instead.

And in UTF-8, every code point is represented by a chunk of data that is either 1 byte (8 bits), 2 bytes (16 bits), 3 bytes (24 bits) or 4 bytes (32 bits) long, also depending on how large the code point is. The smallest amount of data that has any meaning in this encoding form is 8 bits long, so UTF-8 is said to use 8-bit code units. 8 bits are enough to fit a whole code point, just not any code point, but only those that are small enough.

Sign up to request clarification or add additional context in comments.

2 Comments

Although translating those bytes to code units, each character is represented in utf-8 by one to four 8-bit code units. UTF-16 by one or two 16-bit code units, and UTF-32 by one 32-bit code unit.
What @CharlotteBuff say is correct. It might also be helpful to think of code units in relation to how one maps between code points and byte sequences, and Unicode's encoding form concept. Each character has a code point (or "Unicode scalar value"), which is an integer value from 0 to 10FFFF. In each of three encoding forms, the code point is mapped to a machine representation using values of a specific size: 32-bit, 16-bit or 8-bit. Unicode Technical Report 17 may be helpful to read.
0

I couldn't find a definition for "unit of encoded text" but I guess what is meant here is "code point"? Is this true?

No. A codepoint is a numeric value that Unicode defines for a given unique character. Whereas a codeunit is a unit of encoded data for a given codepoint in a given UTF (Unicode Transformation Form):

  • UTF-8 uses 8-bit codeunits. The bits of the codepoint value are encoded (divided up) between 1-4 codeunits, depending on the value:

    • U+0000..U+007F = 1 CU
    • U+0080..U+07FF = 2 CUs
    • U+0800..U+FFFF = 3 CUs
    • U+010000..U+10FFFF = 4 CUs

    For example:

    Codepoint Binary of codepoint Binary of UTF representation Encoded UTF Codeunits
    U+0057 0000 0000 0101 0111 0101 0111 57
    U+0392 0000 0011 1001 0010 1100 1110 1001 0010 CE 92
    U+C704 1100 0111 0000 0100 1110 1100 1001 1100 1000 0100 EC 9C 84
    U+10345 0 0001 0000 0011 0100 0101 1111 0000 1001 0000 1000 1101 1000 0101 F0 90 8D 85
  • UTF-16 uses 16-bit codeunits. The bits of the codepoint value are encoded (divided up) between 1 or 2 codeunits, depending on the value:

    • U+0000..U+D7FF = 1 CU
    • U+D800..U+DFFF = reserved/illegal
    • U+E000..U+FFFF = 1 CU
    • U+010000..U+10FFFF = 2 CUs

    For example:

    Codepoint Binary of codepoint Binary of UTF representation Encoded UTF Codeunits
    U+0024 0000 0000 0010 0100 0000 0000 0010 0100 0024
    U+20AC 0010 0000 1010 1100 0010 0000 1010 1100 20AC
    U+10437 0001 0000 0100 0011 0111 1101 1000 0000 0001 1101 1100 0011 0111 D801 DC37
    U+24B62 0010 0100 1011 0110 0010 1101 1000 0101 0010 1101 1111 0110 0010 D852 DF62
  • UTF-32 uses 32-bit codeunits. Each codeunit is large enough to hold an entire codepoint value (21 bits max), so there is no need to divide up the bits, regardless of value.

Does "The minimal bit combination that can represent a unit of encoded text for processing" mean "The minimal bit combination for which at last one code point exists which can be represented with this bit combination"?

No.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.