Quick simple one.
A Unicode character is... well, for the purposes of this enumeration, it is an integer from 0 to 1,114,111 inclusive, which we express in hexadecimal as U+0000 through U+10FFFF. The interpretations and interactions of these characters are extremely complex, but we can actually ignore these here. In particular, I'm going to ignore issues relating to surrogates.
There are several ways to encode Unicode characters as bytes (octets). Three such encodings are UTF-8, UTF-16 and UTF-32. Let's start with the most interesting and popular of these:
In UTF-8:
This allows us to create a fairly simple Fibonacci-esque recurrence relation for the number of possible UTF-8 byte sequences of length N bytes:
Or...
| N | Number of valid UTF-8 byte sequences of length N |
|---|---|
| 0 | 1 |
| 1 | 128 |
| 2 | 18,304 |
| 3 | 2,652,160 |
| 4 | 383,795,200 |
| 5 | 55,514,234,880 |
| 6 | 8,030,282,317,824 |
| 7 | 1,161,610,848,632,832 |
| 8 | 168,031,256,854,921,216 |
| 9 | 24,306,331,164,952,494,080 |
| 10 | 3,515,999,113,145,063,833,600 |
A table of values up to N = 100 is here.
Each entry in this sequence is approximately 144.654 times the previous one. This is the sole positive eigenvalue of the 4 × 4 matrix above - a closed form expression for this value does exist but it's pretty ghastly.
UTF-16 is a lot simpler (and rather less popular).
The recurrence relation is:
Or...
Or, as we can ignore the sequences with an odd number of bytes...
| N | Number of valid UTF-16 byte sequences of length 2N |
|---|---|
| 0 | 1 |
| 1 | 65,536 |
| 2 | 4,296,015,872 |
| 3 | 281,612,415,664,128 |
| 4 | 18,460,255,972,103,290,880 |
| 5 | 1,210,106,627,408,128,699,793,408 |
| 6 | 79,324,904,915,185,326,649,998,573,568 |
| 7 | 5,199,905,857,288,526,673,293,821,089,939,456 |
| 8 | 340,864,208,454,757,229,430,061,207,854,549,827,584 |
| 9 | 22,344,329,261,775,181,962,073,467,059,698,981,855,559,680 |
| 10 | 1,464,715,384,527,942,980,583,053,593,085,519,767,325,967,908,864 |
A table of values up to N = 100 is here.
Each non-zero entry in this sequence is approximately 65551.996 times the previous one. That number is the positive eigenvalue of the 2 × 2 matrix, which is exactly 1024 (32 + 5√41).
UTF-32 is seldom-used and extremely inefficient, but also about as simple as it gets:
Or, as as we can ignore the sequences with a number of bytes not divisible by 4...
| N | Number of valid UTF-32 byte sequences of length 4N |
|---|---|
| 0 | 1 |
| 1 | 1,114,112 |
| 2 | 1,241,245,548,544 |
| 3 | 1,382,886,560,579,452,928 |
| 4 | 1,540,690,511,780,295,460,519,936 |
| 5 | 1,716,501,787,460,568,536,110,786,936,832 |
| 6 | 1,912,375,239,431,268,932,903,461,055,767,773,184 |
| 7 | 2,130,600,202,753,249,893,374,940,803,763,545,317,572,608 |
| 8 | 2,373,727,253,089,828,745,207,742,048,762,611,000,851,453,444,096 |
| 9 | 2,644,598,017,394,415,282,980,887,909,431,010,067,380,614,499,508,682,752 |
| 10 | 2,946,378,386,355,326,799,752,402,990,552,001,488,189,551,181,276,617,558,196,224 |
A table of values up to N = 100 is here.
Each non-zero entry in this sequence is exactly 1,114,112 times the previous one.
Discussion (3)
2025-10-11 16:19:41 by me:
2025-10-11 18:02:57 by qntm:
2025-10-11 20:58:22 by you: