How many valid UTF-8 (or UTF-16, or UTF-32) byte sequences are there?

Quick simple one.

A Unicode character is... well, for the purposes of this enumeration, it is an integer from 0 to 1,114,111 inclusive, which we express in hexadecimal as U+0000 through U+10FFFF. The interpretations and interactions of these characters are extremely complex, but we can actually ignore these here. In particular, I'm going to ignore issues relating to surrogates.

There are several ways to encode Unicode characters as bytes (octets). Three such encodings are UTF-8, UTF-16 and UTF-32. Let's start with the most interesting and popular of these:

UTF-8

In UTF-8:

  • Characters U+0000 to U+007F inclusive are encoded using 1 byte - there are 128 of these 1-byte sequences.
  • Characters U+0080 to U+07FF inclusive are encoded using 2 bytes - there are 1,920 of these 2-byte sequences.
  • Characters U+0800 to U+FFFF inclusive are encoded using 3 bytes - there are 63,488 of these 3-byte sequences.
  • Characters U+10000 to U+10FFFF inclusive are encoded using 4 bytes - there are 1,048,576 of these 4-byte sequences.

This allows us to create a fairly simple Fibonacci-esque recurrence relation for the number of possible UTF-8 byte sequences of length N bytes:

a0 = 1 a1 = 128a0 a2 = 1920a0 + 128a1 a3 = 63488a0 + 1920a1 + 128a2 aN = 1048576aN-4 + 63488aN-3 + 1920aN-2 + 128aN-1 (N4)

Or...

( aN-3 aN-2 aN-1 aN ) = ( 0 1 0 0 0 0 1 0 0 0 0 1 1048576 63488 1920 128 ) N ( 0 0 0 1 )

Table of values

N Number of valid UTF-8 byte sequences of length N
01
1128
218,304
32,652,160
4383,795,200
555,514,234,880
68,030,282,317,824
71,161,610,848,632,832
8168,031,256,854,921,216
924,306,331,164,952,494,080
103,515,999,113,145,063,833,600

A table of values up to N = 100 is here.

Each entry in this sequence is approximately 144.654 times the previous one. This is the sole positive eigenvalue of the 4 × 4 matrix above - a closed form expression for this value does exist but it's pretty ghastly.

UTF-16

UTF-16 is a lot simpler (and rather less popular).

  • Characters U+0000 to U+FFFF inclusive are encoded using 2 bytes - there are 65,536 of these 2-byte sequences.
  • Characters U+10000 to U+10FFFF inclusive are encoded using 4 bytes - there are 1,048,576 of these 4-byte sequences.

The recurrence relation is:

a0 = 1 a1 = 0 a2 = 65536a0 a3 = 0 aN = 1048576aN-4 + 65536aN-2 (N4)

Or...

( aN-3 aN-2 aN-1 aN ) = ( 0 1 0 0 0 0 1 0 0 0 0 1 1048576 0 65536 0 ) N ( 0 0 0 1 )

Or, as we can ignore the sequences with an odd number of bytes...

( a2(N-1) a2N ) = ( 0 1 1048576 65536 ) N ( 0 1 )

Table of values

N Number of valid UTF-16 byte sequences of length 2N
01
165,536
24,296,015,872
3281,612,415,664,128
418,460,255,972,103,290,880
51,210,106,627,408,128,699,793,408
679,324,904,915,185,326,649,998,573,568
75,199,905,857,288,526,673,293,821,089,939,456
8340,864,208,454,757,229,430,061,207,854,549,827,584
922,344,329,261,775,181,962,073,467,059,698,981,855,559,680
101,464,715,384,527,942,980,583,053,593,085,519,767,325,967,908,864

A table of values up to N = 100 is here.

Each non-zero entry in this sequence is approximately 65551.996 times the previous one. That number is the positive eigenvalue of the 2 × 2 matrix, which is exactly 1024 (32 + 5√41).

UTF-32

UTF-32 is seldom-used and extremely inefficient, but also about as simple as it gets:

  • Characters U+0000 to U+10FFFF inclusive are encoded using 4 bytes - there are 1,114,112 of these 4-byte sequences.
a0 = 1 a1 = 0 a2 = 0 a3 = 0 aN = 1114112aN-4 (N4)

Or, as as we can ignore the sequences with a number of bytes not divisible by 4...

a4N = 1114112N (N0)

Table of values

N Number of valid UTF-32 byte sequences of length 4N
01
11,114,112
21,241,245,548,544
31,382,886,560,579,452,928
41,540,690,511,780,295,460,519,936
51,716,501,787,460,568,536,110,786,936,832
61,912,375,239,431,268,932,903,461,055,767,773,184
72,130,600,202,753,249,893,374,940,803,763,545,317,572,608
82,373,727,253,089,828,745,207,742,048,762,611,000,851,453,444,096
92,644,598,017,394,415,282,980,887,909,431,010,067,380,614,499,508,682,752
102,946,378,386,355,326,799,752,402,990,552,001,488,189,551,181,276,617,558,196,224

A table of values up to N = 100 is here.

Each non-zero entry in this sequence is exactly 1,114,112 times the previous one.

Discussion (3)

2025-10-11 16:19:41 by me:

"i'm going to ignore issues relating to surrogates" why? isn't that the only Fun part of unicode?

2025-10-11 18:02:57 by qntm:

If we start including the fun parts of Unicode we'll never stop!

2025-10-11 20:58:22 by you:

As much as I like the previous ones, this one is a rather flat, trivial table of numbers.