22

Sometimes I'm grep-ing thousands of files and it'd be nice to see some kind of progress (bar or status).

I know this is not trivial because grep outputs the search results to STDOUT and my default workflow is that I output the results to a file and would like the progress bar/status to be output to STDOUT or STDERR.

Would this require modifying source code of grep?

Ideal command is:

grep -e "STRING" --results="FILE.txt"

and the progress:

[curr file being searched], number x/total number of files

written to STDOUT or STDERR

3

6 Answers 6

17

This wouldn't necessarily require modifying grep, although you could probably get a more accurate progress bar with such a modification.

If you are grepping "thousands of files" with a single invocation of grep, it is most likely that you are using the -r option to recursively a directory structure. In that case, it is not even clear that grep knows how many files it will examine, because I believe it starts examining files before it explores the entire directory structure. Exploring the directory structure first would probably increase the total scan time (and, indeed, there is always a cost to producing progress reports, which is why few traditional Unix utilities do this.)

In any case, a simple but slightly inaccurate progress bar could be obtained by constructing the complete list of files to be scanned and then feeding them to grep in batches of some size, maybe 100, or maybe based on the total size of the batch. Small batches would allow for more accurate progress reports but they would also increase overhead since they would require additional grep process start-up, and the process start-up time can be more than grepping a small file. The progress report would be updated for each batch of files, so you would want to choose a batch size that gave you regular updates without increasing overhead too much. Basing the batch size on the total size of the files (using, for example, stat to get the filesize) would make the progress report more exact but add an additional cost to process startup.

One advantage of this strategy is that you could also run two or more greps in parallel, which might speed the process up a bit.


In broad terms, a simple script (which just divides the files by count, not by size, and which doesn't attempt to parallelize).

# Requires bash 4 and Gnu grep
shopt -s globstar
files=(**)
total=${#files[@]}
batchSize=100
for ((i=0; i<total; i+=batchSize)); do
  echo $i/$total >>/dev/stderr
  grep -d skip -e "$pattern" "${files[@]:i:batchSize}" >>results.txt
done

For simplicity, I use a globstar (**) to safely put all the files in an array. If your version of bash is too old, then you can do it by looping over the output of find, but that's not very efficient if you have lots of files. Unfortunately, there is no way that I know of to write a globstar expression which only matches files. (**/ only matches directories.) Fortunately, GNU grep provides the -d skip option which silently skips directories. That means that the file count will be slightly inaccurate, since directories will be counted, but it probably doesn't make much difference.

You probably will want to make the progress report cleaner by using some console codes. The above is just to get you started.

The simplest way to divide that into different processes would be to just divide the list into X different segments and run X different for loops, each with a different starting point. However, they probably won't all finish at the same time so that is sub-optimal. A better solution is GNU parallel. You might do something like this:

find . -type f -print0 |
parallel --progress -L 100 -m -j 4 grep -e "$pattern" > results.txt

(Here -L 100 specifies that up to 100 files should be given to each grep instance, and -j 4 specifies four parallel processes. I just pulled those numbers out of the air; you'll probably want to adjust them.)

Sign up to request clarification or add additional context in comments.

5 Comments

Very good and almost complete answer. Please post an example on how to use find, parallel, grep commands to accomplish the task and I'll mark it as accepted.
@adrian: it would help to know how you are currently invoking grep: the -r thing was just a guess.
my usual grep command is grep -e "STRING" * -r. Doing a batch of <num_cores>*X files at a time is a perfect idea.
@Adrian: Ok, added some concrete examples, but you will still probably want to fiddle around with them. Good luck.
I did it! I'm proud to announce my ppGrep.sh shown in my answer, even usefull on small amount of very big files! ( Note Feel free to compare with perl's parallel which is sometime heavy! ;)
5

Progress bar for grep in bash:

ppGrep.sh - Progress parallel grep - bash script

I dit it, see at bottom of this post!!

Preamble

grep-ing thousands of files

Not only!! Following this answer, you will see my method is even usefull on small amount of very big files.

How this work?

By using /proc kernel variables and parallelization for monitoring tasks.

  • Under /proc/$pid/fd/ you could see which file descriptor are used by any currently running process (identified by $pid) as long you have permission (same user).
  • Under /proc/$pid/fdinfo/$fd you could read current pointer position in file pointed by /proc/$pid/fd/$fd.

With this informations, you could show progression of currently running grep process!

Here is a sample view of output:

cd /usr/lib/i386-linux-gnu/
myDualProgressGrep.sh "some pattern"

Screen shoot progress grep

Intro

  • for building nice progress bar I will take functions from How to add a progress bar to a shell script?
  • as shopt -s extglob, then files=(**) will take some unpredictable time, I will use a spinner during this step.
  • I will use parallelization for running grep as background task in order to be able to monitor them.
  • For testing this, I ran my tests on my desktop, by using as root user:
    # sync && echo 3 > /proc/sys/vm/drop_caches
    for cleaning all caches between each tests.
  • Finally, instead of just output to a file, I use zstd to compress final result.
  • This script is usefull for small amount of big files, see second part with two progress bar!

Begin of script, display functions:

#!/bin/bash
shopt -s globstar

percentBar ()  { 
    local prct totlen=$((8*$2)) lastchar barstring blankstring;
    printf -v prct %.2f "$1"
    ((prct=10#${prct/.}*totlen/10000, prct%8)) &&
        printf -v lastchar '\\U258%X' $(( 16 - prct%8 )) ||
            lastchar=''
    printf -v barstring '%*s' $((prct/8)) ''
    printf -v barstring '%b' "${barstring// /\\U2588}$lastchar"
    printf -v blankstring '%*s' $(((totlen-prct)/8)) ''
    printf -v "$3" '%s%s' "$barstring" "$blankstring"
}
percent(){ local p=00$(($1*100000/$2));printf -v "$3" %.2f ${p::-3}.${p: -3};}
startSpinner () {
    tput civis >&2
    exec {doSpinner}> >(spinner "$@")
}
stopSpinner () {
    echo >&"$doSpinner" &&
        exec {doSpinner}>&-
    tput cnorm >&2
    printf '\r\e[K\e[A\e[K' >&2
    doSpinner=0
}
spinner() {
    local str shs
    printf -v str '\e[A%s\e[B\e[4D%s,' ⠉⠉⠉⢹ ⠀⠀⠀⢸ ⠈⠉⠉⢹ ⠀⠀⠀⣸ ⠀⠉⠉⢹ ⠀⠀⢀⣸ ⠀⠈⠉⢹\
           ⠀⠀⣀⣸ ⠀⠀⠉⢹ ⠀⢀⣀⣸ ⠀⠀⠈⢹ ⠀⣀⣀⣸ ⠀⠀⠀⢹ ⢀⣀⣀⣸ ⠀⠀⠀⢸ ⣀⣀⣀⣸ ⠀⠀⠀⢰ ⣄⣀⣀⣸ \
           ⠀⠀⠀⢠ ⣆⣀⣀⣸ ⠀⠀⠀⢀ ⣇⣀⣀⣸ ⡀⠀⠀⠀ ⣇⣀⣀⣸ ⡄⠀⠀⠀ ⣇⣀⣀⣰ ⡆⠀⠀⠀ ⣇⣀⣀⣠ ⡇⠀⠀⠀ \
           ⣇⣀⣀⣀ ⡏⠀⠀⠀ ⣇⣀⣀⡀ ⡏⠁⠀⠀ ⣇⣀⣀⠀ ⡏⠉⠀⠀ ⣇⣀⡀⠀ ⡏⠉⠁⠀ ⣇⣀⠀⠀ ⡏⠉⠉⠀ ⣇⡀⠀⠀ \
           ⡏⠉⠉⠁ ⣇⠀⠀⠀ ⡏⠉⠉⠉ ⡇⠀⠀⠀ ⡏⠉⠉⠙ ⠇⠀⠀⠀ ⡏⠉⠉⠹ ⠃⠀⠀⠀ ⡏⠉⠉⢹ ⠁⠀⠀⠀ ⡏⠉⠉⢹ \
           ⠀⠀⠀⠈ ⠏⠉⠉⢹ ⠀⠀⠀⠘ ⠋⠉⠉⢹ ⠀⠀⠀⠸
    IFS=, read -a shs <<<$str
    local -i pnt
    printf '\e7' 1>&2
    while ! read -rsn1 -t "${1:-.02}"; do
        printf '%s\e8' "${shs[pnt++%${#shs[@]}]}" 1>&2
    done
}
declare -i doSpinner

# Main script:

printf '\nScan filesystem... ' >&2
startSpinner
bunch=5000
files=(**)
filecnt=${#files[@]}
col=$(tput cols)
exec {OutFD}> >(zstd >/tmp/grepResult.zst)
for (( i=0 ; i <= 1 + filecnt / bunch ; i++ )); do
    sfiles=("${files[@]: bunch * i : bunch}")
    (( ${#sfiles[@]} )) || continue
    exec {grepFd}< <(grep -d skip -D skip "$1" "${sfiles[@]}" >&$OutFD 2>&1;
                     echo done)
    declare -i bpid=$! gpid=0
    printf -v sfmt '[%q]=cnt++ ' "${sfiles[@]}"
    declare -Ai idxs="($sfmt)"
    while [[ -d /proc/$bpid ]]; do
        ((gpid))|| gpid=$(ps --ppid $bpid ho pid)
        file=$(readlink /proc/$gpid/fd/3) 
        if [[ $file ]] && (( crt=${idxs["${file#$PWD/}"]:-0}, crt >1 )); then
            (( doSpinner )) && stopSpinner
            percent $crt $filecnt pct
            percentBar $pct $((col-8-2*${#filecnt})) bar
            printf '\r%*s/%s\e[44;32m%s\e[0m%6.2f%%' \
                   ${#filecnt} $crt $filecnt "$bar" "$pct" >&2
            read -ru $grepFd -t .02 _
        fi
    done
done
exec {OutFD}>&-
(( doSpinner )) && stopSpinner

Some explanations:

  • This script accept only 1 argument: grep pattern.

  • You must be located in root's search path before running.

  • If you want to store output uncompressed, replace:

    exec {OutFD}> >(zstd >/tmp/grepResult.zst)
    

    by:

    exec {OutFD}>/tmp/grepResult
    

    or to send output to standard output (as monitor display use stderr):

    exec {OutFD}>&1
    
  • In order to prevent argument line too long error, I split the file list stored in $files by bunch of 5'000, this could be tunned...

  • Immediately after scanning filesystem, by (**), I run a loop over number of files, by bunch,

  • then first, run grep as background task.

  • With the two lines printf -v sfmt... and declare -Ai idxs=...cnt++, I build (quickly because no loop) an associative array as files index, for knowing the position of all file in the bunch, over the whole $files list.

  • While grep is running, his parent pid have to exist!

  • The grep pid is the only child of backgrounded bash run by exec {grepFd}< <(....

  • And grep alway open the file he are reading in file descriptor 3 (except when grep from stdin).

  • So readlink will return full path of currently accessed file by grep.

  • We have to drop "$PWD/" for using file name as $idxs key. (Note file=(**) is starting from $PWD.) crt=${idxs["${file#$PWD/}"]}.

time grep -d skip -D skip -r --exclude='.*' 'some pattern' * |& zstd >/tmp/grepResult-00.zst
real    4m43.832s
user    0m5.701s
sys     0m7.375s
time myProgressGrep.sh 'some pattern'
80166/80170██████████████████████████████████████████████████████████████100.00%
real    4m24.766s
user    0m30.360s
sys     0m55.247s

One step further: add another progress bar for showing big file read.

As I plan to use stat repetitively, I will use stat loadable bash. This will require bash-builtins package to be installed! Under debian based distro, use apt install bash-builtins (other distro method welcome as comments!)

Then replace # Main script by:

enable -f /usr/lib/bash/stat stat

printf '\nScan filesystem... ' >&2
startSpinner
bunch=10000
files=(**)
filecnt=${#files[@]}
col=$(tput cols)
exec {OutFD}> >(zstd >/tmp/grepResult.zst)
# exec {OutFD}>&1
for (( i=0 ; i <= 1 + filecnt / bunch ; i++ )); do
    sfiles=("${files[@]: bunch * i : bunch}")
    (( ${#sfiles[@]} )) || continue
    exec {grepFd}< <(grep -d skip -D skip "$1" "${sfiles[@]}" >&$OutFD 2>&1;
                     echo done)
    declare -i bpid=$! gpid=0
    printf -v sfmt '[%q]=cnt++ ' "${sfiles[@]}"
    declare -Ai idxs="($sfmt)"
    while [[ -d /proc/$bpid ]]; do
        ((gpid)) || gpid=$(ps --ppid $bpid ho pid)
        file=$(readlink /proc/$gpid/fd/3) 
        if [[ $file ]] && (( crt=${idxs["${file#$PWD/}"]:-0}, crt >1 )); then
            (( doSpinner )) && stopSpinner
            { read -r _ fpos </proc/$gpid/fdinfo/3 ;} 2>/dev/null || fpos=0
            stat -A fstat "$file"
            percent $((fpos<fstat[size]?fpos:fstat[size])) ${fstat[size]} fpct
            percentBar $fpct $((col-47)) fbar
            percent $crt $filecnt pct
            percentBar $pct $((col-8-2*${#filecnt})) bar
            file=${file##*/}
            printf '\r%-40s\e[44;32m%s\e[0m%6s%%\n%*s/%s\e[44;32m%s\e[0m%6s%%\e[A'\
                   "${file::40}" "$fbar" "$fpct" \
                   ${#filecnt} $crt $filecnt "$bar" "$pct" >&2
            read -ru $grepFd -t .02 _
        fi
    done
done
exec {OutFD}>&-
printf '\n\n' >&2
  • Where I use read -r _ fpos </proc/$gpid/fdinfo/3 for knowing exact position, current grep pid are reading $file...
  • Then use stat for knowing $file size.
    • This is subject to some kind of race condition, aka there is a lot of chance current grep already closed file returned by readlink when fdinfo/3 are accessed. This is trapped by: { ... ;} 2>/dev/null and fpos < fstat[size] ? fpos : fstat[size]

Then I will see two progress bar!

Movie-0023.mp4                          █████                             15.35%
79943/80170█████████████████████████████████████████████████████████████▊ 99.72%
  • first line show for current reading file name and position of grep pointer in it.
  • second line show progression over whole file count.

About performances

All test I've done, using

sync && echo 3 > /proc/sys/vm/drop_caches

command as root, between each tests.

I've compared execution of both version, with:

time grep -d skip -D skip -r --exclude='.*' 'some pattern' * |& zstd >/tmp/grepResult-00.zst

I ran this several times and post here better result:

real    4m43.832s
user    0m5.701s
sys     0m7.375s

In order to compare result file with /tmp/grepResult.zst generated by script.

On my host, with my configuration.

time myProgressGrep.sh 'some pattern'
80166/80170██████████████████████████████████████████████████████████████100.00%
real    4m24.766s
user    0m30.360s
sys     0m55.247s
time myDualProgressGrep.sh 'some pattern'
Xterm.log.host.2023.11.21.18.55.11.340  ████████████████████████████████▎ 98.08%
80168/80170██████████████████████████████████████████████████████████████100.00%

real    4m21.073s
user    0m27.864s
sys     0m50.262s

Yes, surprisingly my script seem quicker than plain grep -r command. (I suspect -r option to be not as performant than bash globstar).

ppGrep.sh bash script

You will find this here: ppGrep.sh

Usage:

Usage: ppGrep.sh [OPTIONS] <PATTERN> [FILE] [FILE...]
Options [-E|-F|-G|-P] [-l|-L] [-H] [-Z] [-a] [-c] [-b] [-i] [-l] [-n] [-o] [-s]
      and  [-v] [-w] [-x], as [-e "PATTERN"] and [-f "PATTERN FILE"] are bind
      to 'grep' tasks (see man grep!).
   -j NUM   Max job to run together (default: "3").
   -C PATH  'cd' to PATH before running (instead of "/home/felix/Work/Devel/bash"). 
   -T FILE  Files list from FILE.
   -z       Files list are null bytes separated.
   -d         Dump both STDOUT and STDERR as soon as possible (in right order)
                    default is to keep everything in memory until last job finish.
   -W         Display Warnings when killing some subpid.
   -h       Show this.
Note: FILE cannot be else than a file! There are no '-r' option.

Require

Note: This is a modern bash script, use recent feature, and use zstandard compression for storing process's outputs to be able to return them in correct order. So this script depend on external binaries:

For Debian based distribution, this require:

 coreutils     readlink, cat, sync, rm, stat
 libc-bin      getent, getconf
 ncurses-bin   tput
 procps        ps
 zstd          zstd, zstdcat
 grep          grep

Test and performance

My test show using this is quicker than plain grep, even with only 1 process!

But only while data are to be read from filesystems. If data are in memory cache, then my script's footprint will appear.

Here is a little comparison plain grep, vs ppGrep.sh, by using 1 to 6 parallel process, on my host.:

                   No cache                Cached
grep          10' 58.352902"             7.646374"
ppGrep 1p.    10' 33.748950"            18.293024"   
ppGrep 2p.     9' 23.600785"            16.992925"
ppGrep 3p.     8' 21.149429"            14.622010"    
ppGrep 4p.     8'  8.025807"            14.604110"    
ppGrep 5p.     7' 52.778152"            14.888845"   
ppGrep 6p.     7' 12.015095"            12.992566" 

When data arn't cached, on a 10' jobs, ppGrep could be 30" quicker.

When data are cached, ppGrep use ~7 more seconds...

By running 3 parallel process, you will gain ~2 minutes. This become noticeable.

You will find test script and my full test result in same dir, on my website.

Here is a sample accelerated ~3 time:

Screen shoot progress parallel grep

TODO

  • grep on compressed files, like zgrep, zstdgrep...
  • add option for barColor
  • do tests on more different environments (fs, hdd, sdd, iscsi...)
  • shuffle list to reduce chance that 1 folder with big files to 1 proc
  • '-q' print 1st file found, quit and end all tasks

( Help and suggestion welcome! ;-)

Comments

1

Try the parallel program

find * -name \*.[ch] | parallel -j5 --bar  '(grep grep-string {})' > output-file

Though I found this to be slower than a simple

find * -name \*.[ch] | xargs grep grep-string > output-file

1 Comment

Both of these are buggy when dealing with files with spaces/quotes/newlines/etc in their names. Use -print0 in find and -0 on xargs to work correctly with all possible names.
1

This command show the progress (speed and offset), but not the total amount. This could be manually estimated however.

dd if=/input/file bs=1c skip=<offset> | pv | grep -aob "<string>"

1 Comment

For me, it only works after removing the bs=1c part. I.e. dd if=/input/file | pv | grep -an "string"
0

I'm pretty sure you would need to alter the grep source code. And those changes would be huge.

Currently grep does not know how many lines a file as until it's finished parsing the whole file. For your requirement it would need to parse the file 2 times or a least determine the full line count any other way.

The first time it would determine the line count for the progress bar. The second time it would actually do the work an search for your pattern.

This would not only increase the runtime but violate one of the main UNIX philosophies.

  1. Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features". (source)

There might be other tools out there for your need, but afaik grep won't fit here.

3 Comments

OP doesn't say anything about line counts, only files. And it's not even clear that line counts would be useful; a simpler statistic to gather would be total bytes (which you can get from call to stat), and that would be a more accurate statistic as well, since grep actually reads in blocks, not lines. However, I agree with the basic philosophy of your answer.
Sorry I misunderstood the output 'number x' an thought he means line x in file y.
Without modifying grep code, under Linux, you could monitor file count and file position using bash! See my answer
0

I normaly use something like this:

grep | tee "FILE.txt" | cat -n | sed 's/^/match: /;s/$/     /' | tr '\n' '\r' 1>&2

It is not perfect, as it does only display the matches, and if they to long or differ to much in length there are errors, but it should provide you with the general idea.

Or a simple dots:

grep | tee "FILE.txt" | sed 's/.*//' | tr '\n' '.' 1>&2

2 Comments

How does this indicate status?
grep -e "STRING" | tee "FILE.txt" is hopefully the answer to your grep -e "STRING" --results="FILE.txt", but it is not meant to be a full status like x/total number of files. It just shows the number of already processed matches.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.