Adding up sizes of directories using bash is inconsistent

Question

First, I got the total size of folder /my-downloads:

$ du -sh /my-downloads
304G    /my-downloads

As you can see, it's 304G.

Then, I wanted to find out the total size of all immediate directories inside /my-downloads folder by doing using awk like this:

$ find /my-downloads/* -maxdepth 1 -type d -exec du -sb {} \; | awk 'BEGIN {t_size=0} {t_size+=$1} END {print t_size / 1024 / 1024 / 1024 "GB"}'
350.309GB

As you can see the output is different.

My question is, how come the total size of immediate directories inside /my-downloads folder (not the files) is calculated to be 350.309GB which is larger than the total size of /my-downloads folder as reported by du -sh /my-downloads? How do I explain this discrepancy?

Thanks.

Stéphane Chazelas · Accepted Answer · 2024-12-09 06:36:26Z

Several things:

du -sh /my-downloads

Gives the cumulative disk usage (not size) of the /my-downloads directory file and all the unique files and directories found by a recursive descent. It's intended to give you an indication of how much disk space you would reclaim, would you remove that directory and its contents recursively¹.

In:

find /my-downloads/* -maxdepth 1 -type d -exec du -sb {} \;

The shell expands /my-donloads/* to the list of files (of any type, including directory) whose name doesn't start with . (considered hidden by shell globs by default and some other tools such as ls, but not find nor du themselves), and passes them to find.

For those that are of type directory, find will descend into them but up to one level only, the ones of other types will be discarded as -type d will filter them out.

The files that are not of type directory in any of those directories will also be omitted.

For instance if you have:

size	disk usage	path
4096	4096	`/my-downloads`
4096	4096	`/my-downloads/.dir`
1	4096	`/my-downloads/file`
4096	4096	`/my-downloads/dir1`
10000000	0	`/my-downloads/dir1/file`
4096	4096	`/my-downloads/dir1/subdir1`
10000	12288	`/my-downloads/dir1/subdir1/file`
4096	4096	`/my-downloads/dir1/subdir2`
10000	12288	`/my-downloads/dir1/subdir2/file`

find, will run:

du -sb /my-downloads/dir1
du -sb /my-downloads/dir1/subdir1
du -sb /my-downloads/dir1/subdir2

The /my-download directory file itself is omitted as it's never passed as argument to any du invocation.
/my-downloads/.dir is omitted because its name starts with .
/my-downloads/file and /my-downloads/dir1/file are omitted because they're not of type directory
/my-downloads/dir1/subdir{1,2}/file are omitted because they're at depth 2

du -sb file (-b, like -h being a GNU extension), gives the apparent size (not disk usage) in bytes of the file, and for those of type directory also includes the size (not disk usage) of every unique file and directory underneath.

See how /my-downloads/file (which you can create with echo > /my-downloads/file) has an apparent size of 1 byte but takes up 4KiB of disk space (as is common on ext4 file systems where file data is usually allocated in blocks of 4KiB) and /my-downloads/dir1/file (which you can create with truncate -s10000000 /my-downloads/dir1/file) which appears to be 10MB (all-null) bytes large, but doesn't take any space on disk as it's a fully sparse file.

The size of the /my-downloads/dir1/subdir{1,2} and /my-downloads/dir1/subdir{1,2}/file files, will be counted twice, once as part of of the cumulative size of /my-downloads/dir1 and once as part of that of /my-downloads/dir1/subdir{1,2}. /my-downloads/dir1/file itself will be counted once (unless for instance there's another hardlink to it in /my-downloads/dir2, see below).

Since you're running separate du invocations for each directory at depth 0 and 1, if there are files that are found in more that one directory, like if /my-downloads/dir1/subdir1/file is a hard link to /my-downloads/dir1/subdir2/file, its size will be counted once for the /my-downloads/dir1 cumulative size, once for /my-downloads/dir1/subdir1 cumulative size and once for /my-downloads/dir1/subdir2, so 3 times instead of just one.

To sum up, the many reasons why they're different:

disk usage vs apparent size
top level hidden files and directories omitted
top level non-directory files omitted.
some files and dirs counted several times because you're passing directories at both depth 0 and 1.
some hardlinks counted several times because they can't be deduplicated as they're passed to separate invocations of du.
Also beware that if there are files with newline characters in their path, that can throw off the computation.

If you wanted a closer match, you'd do something like:

find /my-downloads -mindepth 1 -maxdepth 1 -print0 |
  du -sB1 --files0-from=- --null |
  awk -v RS='\0' '
    {sum += $1}
    END {print sum / 1024 / 1024 / 1024 "GiB"}'

(assuming an awk implementation that supports using byte 0 as the Record Separator such a GNU awk or recent versions of mawk).

Where:

we list all (not just the non-hidden, directory ones) files in my-downloads
use -B1 instead of -b which sets the block-size to 1 but without switching to apparent size.
we call du only once, by passing the list on standard input rather as arguments (whose size is limited), so du can do its deduplication.
we tell du to print the list null-delimited so it can work with arbitrary file paths.

It's still missing the disk usage of the /my-downloads directory file itself.

In any case, the only involvement of bash (the shell) in your shell code is just:

the expansion of /my-downloads/* into the list of matching files in your case.
the starting of two concurrent processes, one in which it executes find and one in which it executes awk, with the output of one connected to the input of the other via a pipe (a kernel IPC system, not a shell one).

After those are started, the shell is not involved at all, it just waits for them to finish. Other than the initial glob expansion, the shell is not involved in what files are found, what du commands are executed or the calculation.

/my-downloads/* is simple enough a glob that its expansion would be the same in every shell² whether better or worse than bash even non-Bourne-like shells³.

With my alternative command, even the glob expansion by the shell is removed.

Also, careful not to confuse

Gigabyte, nowadays abbreviated GB for 10⁹ (1000 × 1000 × 1000) bytes with
Gibibyte, nowadays abbreviated GiB or still often G for 2³⁰ (1024 × 1024 × 1024) bytes.

The GNU implementation of du, when passed the -h option uses suffixes in the later category unless passed the --si option (though using the same abbreviations without B nor iB suffix in both cases unfortunately).

^{¹ in practice, that might not be the case if there are more hard links to the files within outside the directory, or some of their contents is reflinked in other files, or there are some forms of snapshotting in place at filesystem level, or there is some data not accounted by du such as some extended attributes on some file systems, etc.}

^{² The only differences you might find with other shells would be in the order those files are expanded, some listing them in locale collation order, some using a simpler order where file names are compared byte to byte; another different could arise if there's no matching file where some shells share the misdesign of bash (inherited from the Bourne shell) where the pattern is passed literally to find and some where an error is reported instead and find is not run.}

^{³ That command line is portable to and would work the same in most shells. An exception would be shells of the rc family where {} needs to be quoted (as '{}'; same in older versions of fish) and where \ is not a quoting operator (except in es) where you'd need ';' instead of \;.}

Logan Lee · Accepted Answer · 2024-12-09 02:53:45Z

I'll try to clarify this using an example. Let's say we have directory structure:

└── X
    ├── A
    │   ├── a1
    │   └── a2
    │       └── a3
    ├── B
    │   ├── b1
    │   └── b2
    │       └── b3
    └── C
        ├── c1
        └── c2

find X -maxdepth 1
```
-> X
   X/A
   X/B
   X/C
```

find X/* -maxdepth 1

-> X/A
   X/A/a1
   X/A/a2
-> X/B
   X/B/b1
   X/B/b2
-> X/C
   X/C/c1
   X/C/c2

So if I blindly do find X/* -maxdepth 1 -type d -exec du -sb {} \; | awk 'BEGIN {t_size=0} {t_size+=$1} END {print t_size}' this will add the sizes of X/{A,B,C} twice.

Stack Exchange Network

Adding up sizes of directories using bash is inconsistent

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Adding up sizes of directories using bash is inconsistent

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions