The Unicode standards states:
Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange.
The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form.
I have trouble understanding the definition of code unit (first pargraph above) and I'm not sure if it's maybe because English is not my native language or I am missing something else.
- I couldn't find a definition for "unit of encoded text" but I guess what is meant here is "code point"? Is this true?
- Does "The minimal bit combination that can represent a unit of encoded text for processing" mean "The minimal bit combination for which at last one code point exists which can be represented with this bit combination"?
I'm asking because the definition in the standard ("can represent a") gives me - but maybe I don't understand the phrasing correctly here - the impression that any (i.e. all) code points can be represented with this bit combination (which would be at odds with UTF-8's 1 byte code unit size).
aand𝼃,𝼃in UTF-8 code units would be 0xF0 0x9D 0xBC 0x83 (four code units representing four bytes), in UTF-16-BE it would be 0xD837 0xDF03 (two 16 bit code units representing four bytes, and in UTF-32-BE 0x0001DF03 (one code unit representing four bytes).ain UTF-8 code units: 0x61 (a single code unit, representing a single byte), in UTF-16_BE it would be 0x0061 (a single code unit representing two bytes), and in UTF-32-BE the code unit would be 0x00000061 (a single code unit representing four bytes).