How many valid UTF-8 (or UTF-16, or UTF-32) byte sequences are there? @ Things Of Interest

Things Of Interest Code

How many valid UTF-8 (or UTF-16, or UTF-32) byte sequences are there?

2025-10-11 by qntm

Quick simple one.

A Unicode character is... well, for the purposes of this enumeration, it is an integer from 0 to 1,114,111 inclusive, which we express in hexadecimal as U+0000 through U+10FFFF. The interpretations and interactions of these characters are extremely complex, but we can actually ignore these here. In particular, I'm going to ignore issues relating to surrogates.

There are several ways to encode Unicode characters as bytes (octets). Three such encodings are UTF-8, UTF-16 and UTF-32. Let's start with the most interesting and popular of these:

UTF-8

In UTF-8:

Characters U+0000 to U+007F inclusive are encoded using 1 byte - there are 128 of these 1-byte sequences.
Characters U+0080 to U+07FF inclusive are encoded using 2 bytes - there are 1,920 of these 2-byte sequences.
Characters U+0800 to U+FFFF inclusive are encoded using 3 bytes - there are 63,488 of these 3-byte sequences.
Characters U+10000 to U+10FFFF inclusive are encoded using 4 bytes - there are 1,048,576 of these 4-byte sequences.

This allows us to create a fairly simple Fibonacci-esque recurrence relation for the number of possible UTF-8 byte sequences of length N bytes:

\begin{matrix} a_{0} & = & 1 \\ a_{1} & = & 128 a_{0} \\ a_{2} & = & 1920 a_{0} + 128 a_{1} \\ a_{3} & = & 63488 a_{0} + 1920 a_{1} + 128 a_{2} \\ a_{N} & = & 1048576 a_{N - 4} + 63488 a_{N - 3} + 1920 a_{N - 2} + 128 a_{N - 1} & (N \geq 4) \end{matrix}

Or...

(\begin{matrix} a_{N - 3} \\ a_{N - 2} \\ a_{N - 1} \\ a_{N} \end{matrix}) = {(\begin{matrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1048576 & 63488 & 1920 & 128 \end{matrix})}^{N} (\begin{matrix} 0 \\ 0 \\ 0 \\ 1 \end{matrix})

Table of values

`N`	Number of valid UTF-8 byte sequences of length `N`
0	1
1	128
2	18,304
3	2,652,160
4	383,795,200
5	55,514,234,880
6	8,030,282,317,824
7	1,161,610,848,632,832
8	168,031,256,854,921,216
9	24,306,331,164,952,494,080
10	3,515,999,113,145,063,833,600

A table of values up to N = 100 is here.

Each entry in this sequence is approximately 144.654 times the previous one. This is the sole positive eigenvalue of the 4 × 4 matrix above - a closed form expression for this value does exist but it's pretty ghastly.

UTF-16

UTF-16 is a lot simpler (and rather less popular).

Characters U+0000 to U+FFFF inclusive are encoded using 2 bytes - there are 65,536 of these 2-byte sequences.
Characters U+10000 to U+10FFFF inclusive are encoded using 4 bytes - there are 1,048,576 of these 4-byte sequences.

The recurrence relation is:

\begin{matrix} a_{0} & = & 1 \\ a_{1} & = & 0 \\ a_{2} & = & 65536 a_{0} \\ a_{3} & = & 0 \\ a_{N} & = & 1048576 a_{N - 4} + 65536 a_{N - 2} & (N \geq 4) \end{matrix}

Or...

(\begin{matrix} a_{N - 3} \\ a_{N - 2} \\ a_{N - 1} \\ a_{N} \end{matrix}) = {(\begin{matrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1048576 & 0 & 65536 & 0 \end{matrix})}^{N} (\begin{matrix} 0 \\ 0 \\ 0 \\ 1 \end{matrix})

Or, as we can ignore the sequences with an odd number of bytes...

(\begin{matrix} a_{2 (N - 1)} \\ a_{2 N} \end{matrix}) = {(\begin{matrix} 0 & 1 \\ 1048576 & 65536 \end{matrix})}^{N} (\begin{matrix} 0 \\ 1 \end{matrix})

Table of values

`N`	Number of valid UTF-16 byte sequences of length 2`N`
0	1
1	65,536
2	4,296,015,872
3	281,612,415,664,128
4	18,460,255,972,103,290,880
5	1,210,106,627,408,128,699,793,408
6	79,324,904,915,185,326,649,998,573,568
7	5,199,905,857,288,526,673,293,821,089,939,456
8	340,864,208,454,757,229,430,061,207,854,549,827,584
9	22,344,329,261,775,181,962,073,467,059,698,981,855,559,680
10	1,464,715,384,527,942,980,583,053,593,085,519,767,325,967,908,864

A table of values up to N = 100 is here.

Each non-zero entry in this sequence is approximately 65551.996 times the previous one. That number is the positive eigenvalue of the 2 × 2 matrix, which is exactly 1024 (32 + 5√41).

UTF-32

UTF-32 is seldom-used and extremely inefficient, but also about as simple as it gets:

Characters U+0000 to U+10FFFF inclusive are encoded using 4 bytes - there are 1,114,112 of these 4-byte sequences.

\begin{matrix} a_{0} & = & 1 \\ a_{1} & = & 0 \\ a_{2} & = & 0 \\ a_{3} & = & 0 \\ a_{N} & = & 1114112 a_{N - 4} & (N \geq 4) \end{matrix}

Or, as as we can ignore the sequences with a number of bytes not divisible by 4...

\begin{matrix} a_{4 N} & = & 1114112^{N} & (N \geq 0) \end{matrix}

Table of values

`N`	Number of valid UTF-32 byte sequences of length 4`N`
0	1
1	1,114,112
2	1,241,245,548,544
3	1,382,886,560,579,452,928
4	1,540,690,511,780,295,460,519,936
5	1,716,501,787,460,568,536,110,786,936,832
6	1,912,375,239,431,268,932,903,461,055,767,773,184
7	2,130,600,202,753,249,893,374,940,803,763,545,317,572,608
8	2,373,727,253,089,828,745,207,742,048,762,611,000,851,453,444,096
9	2,644,598,017,394,415,282,980,887,909,431,010,067,380,614,499,508,682,752
10	2,946,378,386,355,326,799,752,402,990,552,001,488,189,551,181,276,617,558,196,224

A table of values up to N = 100 is here.

Each non-zero entry in this sequence is exactly 1,114,112 times the previous one.

Back to Code

Back to Things Of Interest

Show discussion (3)

Discussion (3)

2025-10-11 16:19:41 by me:

"i'm going to ignore issues relating to surrogates" why? isn't that the only Fun part of unicode?

2025-10-11 18:02:57 by qntm:

If we start including the fun parts of Unicode we'll never stop!

2025-10-11 20:58:22 by you:

As much as I like the previous ones, this one is a rather flat, trivial table of numbers.

Add comment

Hide discussion