recursive statistics on file types in directory?

Question

I did a website scrape for a conversion project. I'd like to do some statistics on the types of files in there -- for instance, 400 .html files, 100 .gif, etc. What's an easy way to do this? It has to be recursive.

Edit: With the script that maxschelpzig posted, I'm having some problems due to the architecture of the site I've scraped. Some of the files are of the name *.php?blah=blah&foo=bar with various arguments, so it counts them all as unique. So the solution needs to consider *.php* to be all of the same type, so to speak.

maxschlepzig · Accepted Answer · 2018-03-30 07:20:07Z

118

You could use find and uniq for this, e.g.:

$ find . -type f | sed 's/.*\.//' | sort | uniq -c
   16 avi
   29 jpg
  136 mp3
    3 mp4

Command explanation

find recursively prints all filenames
sed deletes from every filename the prefix until the file extension
uniq assumes sorted input
- -c does the counting (like a histogram).

edited Mar 30, 2018 at 7:20

answered Aug 10, 2011 at 18:57

maxschlepzig

59.7k53 gold badges224 silver badges298 bronze badges

I have a similar script. Simple and fast.

Rufo El Magufo
– Rufo El Magufo

2011-08-10 19:10:56 +00:00
Commented Aug 10, 2011 at 19:10
Some of the files are of the name *.php?blah=blah&foo=bar with various arguments, so it counts them all as unique. How can I modify it to look for *.php*?

user394
– user394

2011-08-11 13:35:12 +00:00
Commented Aug 11, 2011 at 13:35
3

You can try to use a different sed expression, e.g. sed 's/^.*$\.[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]$.*$/\1/'

maxschlepzig
– maxschlepzig

2011-08-11 14:35:30 +00:00
Commented Aug 11, 2011 at 14:35
1

@bela83, the prune variants rely on short-circuit evaluation - thus, my first version find -name '.*' -prune -o -type f -print evaluates like: if directory entry matches .* then prune it, otherwise if it is a file then print it. Since .* also matches ., i.e. the CWD, everything is pruned, i.e. find does not even descend into the first directory. Perhaps, 2 year old versions of find behaved differently - or it was just an oversight of me, back then. Anyhow, find -name '.*' -not -name . -prune -o -type f -print fixes this.

maxschlepzig
– maxschlepzig

2015-05-04 20:16:09 +00:00
Commented May 4, 2015 at 20:16
1

@MechEthan you can use explainshell

phuclv
– phuclv

2022-05-04 02:21:07 +00:00
Commented May 4, 2022 at 2:21

| Show 13 more comments

Gary R. Van Sickle · Accepted Answer · 2016-10-29 02:27:43Z

This one-liner seems to be a fairly robust method:

find . -type f -printf '%f\n' | sed -r -n 's/.+(\..*)$/\1/p' | sort | uniq -c

The find . -type f -printf '%f\n' prints the basename of every regular file in the tree, with no directories. That eliminates having to worry about directories which may have .'s in them in your sed regex.

The sed -r -n 's/.+(\..*)$/\1/p' replaces the incoming filename with only its extension. E.g., .somefile.ext becomes .ext. Note the initial .+ in the regex; this results in any match needing at least one character before the extension's .. This prevents filenames like .gitignore from being treated as having no name at all and the extension '.gitignore', which is probably what you want. If not, replace the .+ with a .*.

The rest of the line is from the accepted answer.

Edit: If you want a nicely-sorted histogram in Pareto chart format, just add another sort to the end:

find . -type f -printf '%f\n' | sed -r -n 's/.+(\..*)$/\1/p' | sort | uniq -c | sort -bn

Sample output from a built Linux source tree:

    1 .1992-1997
    1 .1994-2004
    1 .1995-2002
    1 .1996-2002
    1 .ac
    1 .act2000
    1 .AddingFirmware
    1 .AdvancedTopics
    [...]
 1445 .S
 2826 .o
 2919 .cmd
 3531 .txt
19290 .h
23480 .c

Stéphane Chazelas · Accepted Answer · 2022-05-04 14:45:52Z

7

With zsh¹:

print -rl -- **/?*.*(D.:e) | uniq -c |sort -n

The pattern **/?*.* matches all files that have an extension, in the current directory and its subdirectories recursively. The D glob qualifier lets zsh traverse even hidden directories and consider hidden files, The . one selects only regular files. The :e modifier retains only the file extension. print -rl prints one match per line. uniq -c counts consecutive identical items (the glob result is already sorted). The final call to sort sorts the extensions by use count.

^{¹ and assuming file extensions don't contain newline characters.}

edited May 4, 2022 at 14:45

Stéphane Chazelas

587k96 gold badges1.1k silver badges1.7k bronze badges

answered Aug 10, 2011 at 22:16

Gilles 'SO- stop being evil'

866k205 gold badges1.8k silver badges2.3k bronze badges

Add a comment |

Zsolt Katona · Accepted Answer · 2017-08-22 10:59:32Z

I've put a bash script into my ~/bin folder called exhist with this content:

#!/bin/bash

for d in */ ; do
        echo $d
        find $d -type f | sed -r 's/.*\/([^\/]+)/\1/' | sed 's/^[^\.]*$//' | sed -r 's/.*(\.[^\.]+)$/\1/' | sort | uniq -c | sort -nr
#       files only      | keep filename only          | no ext -> '' ext   | keep part after . (i.e. ext) | count          | sort by count desc
done

Whichever directory I'm in, I just type 'exh', tab auto-completes it, and I see something like this:

$ exhist
src/
      7 .java
      1 .txt
target/
     42 .html
     10 .class
      4 .jar
      3 .lst
      2 
      1 .xml
      1 .txt
      1 .properties
      1 .js
      1 .css

P.S. Trimming the part after the question mark should be simple to do with another sed command probably after the last one (I haven't tried it): sed 's/\?.*//'

Stewart · Accepted Answer · 2020-08-17 11:27:19Z

I know this thread is old but, this is one of top results when searching for "bash count file extensions".

I encountered the same problem as you and created a script similar to maxschlepzig

Here is the command i made that counts the extensions of all files in the working directory recursively. This takes into account UPPER, and LOWER cases, merging them, removing false positive results, and counting the occurrences.

find . -type f \
  | tr '[:upper:]' '[:lower:]' \
  | grep -E ".*\.[a-zA-Z0-9]*$" \
  | sed -e 's/.*\(\.[a-zA-Z0-9]*\)$/\1/' \
  | sort |
  | uniq -c \
  | sort -n

Here is the github link if you'd like to see more documentation.

https://github.com/Hoppi164/list_file_extensions

luochen1990 · Accepted Answer · 2022-05-04 12:44:04Z

0

Here is an improved version of maxschlepzig's answer:

find . -type f -printf "%f\n" | sed 's/.*\(\.\)/\1/' | sort | uniq -c | sort -k 1nr

Or if your find doesn't support -printf you can use follow instead:

find . -type f | sed 's/.*\///' | sed 's/.*\(\.\)/\1/' | sort | uniq -c | sort -k 1nr

The result is sorted by count desc like this:

3500 .html
524 .pdf
167 .in
160 .ans
156 .ppt
144 .doc
71 .png
65 .js
56 .jpg
47 .css
44 .pas
38 .gif
34 .PDF
32 .txt
1 RELEASENOTES
1 .bak

edited May 4, 2022 at 12:44

answered May 3, 2022 at 13:38

luochen1990

1314 bronze badges

Add a comment |

Stack Exchange Network

recursive statistics on file types in directory?

6 Answers 6

You must log in to answer this question.

Linked

Hot Network Questions

recursive statistics on file types in directory?

6 Answers 6

You must log in to answer this question.

Linked

Related

Hot Network Questions