Progress bar for grep in bash:
ppGrep.sh - Progress parallel grep - bash script
I dit it, see at bottom of this post!!
Preamble
grep-ing thousands of files
Not only!! Following this answer, you will see my method is even usefull on small amount of very big files.
How this work?
By using /proc kernel variables and parallelization for monitoring tasks.
- Under
/proc/$pid/fd/ you could see which file descriptor are used by any currently running process (identified by $pid) as long you have permission (same user).
- Under
/proc/$pid/fdinfo/$fd you could read current pointer position in file pointed by /proc/$pid/fd/$fd.
With this informations, you could show progression of currently running grep process!
Here is a sample view of output:
cd /usr/lib/i386-linux-gnu/
myDualProgressGrep.sh "some pattern"

Intro
- for building nice progress bar I will take functions from How to add a progress bar to a shell script?
- as
shopt -s extglob, then files=(**) will take some unpredictable time, I will use a spinner during this step.
- I will use parallelization for running
grep as background task in order to be able to monitor them.
- For testing this, I ran my tests on my desktop, by using as root user:
# sync && echo 3 > /proc/sys/vm/drop_caches
for cleaning all caches between each tests.
- Finally, instead of just output to a file, I use
zstd to compress final result.
- This script is usefull for small amount of big files, see second part with two progress bar!
Begin of script, display functions:
#!/bin/bash
shopt -s globstar
percentBar () {
local prct totlen=$((8*$2)) lastchar barstring blankstring;
printf -v prct %.2f "$1"
((prct=10#${prct/.}*totlen/10000, prct%8)) &&
printf -v lastchar '\\U258%X' $(( 16 - prct%8 )) ||
lastchar=''
printf -v barstring '%*s' $((prct/8)) ''
printf -v barstring '%b' "${barstring// /\\U2588}$lastchar"
printf -v blankstring '%*s' $(((totlen-prct)/8)) ''
printf -v "$3" '%s%s' "$barstring" "$blankstring"
}
percent(){ local p=00$(($1*100000/$2));printf -v "$3" %.2f ${p::-3}.${p: -3};}
startSpinner () {
tput civis >&2
exec {doSpinner}> >(spinner "$@")
}
stopSpinner () {
echo >&"$doSpinner" &&
exec {doSpinner}>&-
tput cnorm >&2
printf '\r\e[K\e[A\e[K' >&2
doSpinner=0
}
spinner() {
local str shs
printf -v str '\e[A%s\e[B\e[4D%s,' ⠉⠉⠉⢹ ⠀⠀⠀⢸ ⠈⠉⠉⢹ ⠀⠀⠀⣸ ⠀⠉⠉⢹ ⠀⠀⢀⣸ ⠀⠈⠉⢹\
⠀⠀⣀⣸ ⠀⠀⠉⢹ ⠀⢀⣀⣸ ⠀⠀⠈⢹ ⠀⣀⣀⣸ ⠀⠀⠀⢹ ⢀⣀⣀⣸ ⠀⠀⠀⢸ ⣀⣀⣀⣸ ⠀⠀⠀⢰ ⣄⣀⣀⣸ \
⠀⠀⠀⢠ ⣆⣀⣀⣸ ⠀⠀⠀⢀ ⣇⣀⣀⣸ ⡀⠀⠀⠀ ⣇⣀⣀⣸ ⡄⠀⠀⠀ ⣇⣀⣀⣰ ⡆⠀⠀⠀ ⣇⣀⣀⣠ ⡇⠀⠀⠀ \
⣇⣀⣀⣀ ⡏⠀⠀⠀ ⣇⣀⣀⡀ ⡏⠁⠀⠀ ⣇⣀⣀⠀ ⡏⠉⠀⠀ ⣇⣀⡀⠀ ⡏⠉⠁⠀ ⣇⣀⠀⠀ ⡏⠉⠉⠀ ⣇⡀⠀⠀ \
⡏⠉⠉⠁ ⣇⠀⠀⠀ ⡏⠉⠉⠉ ⡇⠀⠀⠀ ⡏⠉⠉⠙ ⠇⠀⠀⠀ ⡏⠉⠉⠹ ⠃⠀⠀⠀ ⡏⠉⠉⢹ ⠁⠀⠀⠀ ⡏⠉⠉⢹ \
⠀⠀⠀⠈ ⠏⠉⠉⢹ ⠀⠀⠀⠘ ⠋⠉⠉⢹ ⠀⠀⠀⠸
IFS=, read -a shs <<<$str
local -i pnt
printf '\e7' 1>&2
while ! read -rsn1 -t "${1:-.02}"; do
printf '%s\e8' "${shs[pnt++%${#shs[@]}]}" 1>&2
done
}
declare -i doSpinner
# Main script:
printf '\nScan filesystem... ' >&2
startSpinner
bunch=5000
files=(**)
filecnt=${#files[@]}
col=$(tput cols)
exec {OutFD}> >(zstd >/tmp/grepResult.zst)
for (( i=0 ; i <= 1 + filecnt / bunch ; i++ )); do
sfiles=("${files[@]: bunch * i : bunch}")
(( ${#sfiles[@]} )) || continue
exec {grepFd}< <(grep -d skip -D skip "$1" "${sfiles[@]}" >&$OutFD 2>&1;
echo done)
declare -i bpid=$! gpid=0
printf -v sfmt '[%q]=cnt++ ' "${sfiles[@]}"
declare -Ai idxs="($sfmt)"
while [[ -d /proc/$bpid ]]; do
((gpid))|| gpid=$(ps --ppid $bpid ho pid)
file=$(readlink /proc/$gpid/fd/3)
if [[ $file ]] && (( crt=${idxs["${file#$PWD/}"]:-0}, crt >1 )); then
(( doSpinner )) && stopSpinner
percent $crt $filecnt pct
percentBar $pct $((col-8-2*${#filecnt})) bar
printf '\r%*s/%s\e[44;32m%s\e[0m%6.2f%%' \
${#filecnt} $crt $filecnt "$bar" "$pct" >&2
read -ru $grepFd -t .02 _
fi
done
done
exec {OutFD}>&-
(( doSpinner )) && stopSpinner
Some explanations:
This script accept only 1 argument: grep pattern.
You must be located in root's search path before running.
If you want to store output uncompressed, replace:
exec {OutFD}> >(zstd >/tmp/grepResult.zst)
by:
exec {OutFD}>/tmp/grepResult
or to send output to standard output (as monitor display use stderr):
exec {OutFD}>&1
In order to prevent argument line too long error, I split the file list stored in $files by bunch of 5'000, this could be tunned...
Immediately after scanning filesystem, by (**), I run a loop over number of files, by bunch,
then first, run grep as background task.
With the two lines printf -v sfmt... and declare -Ai idxs=...cnt++, I build (quickly because no loop) an associative array as files index, for knowing the position of all file in the bunch, over the whole $files list.
While grep is running, his parent pid have to exist!
The grep pid is the only child of backgrounded bash run by exec {grepFd}< <(....
And grep alway open the file he are reading in file descriptor 3 (except when grep from stdin).
So readlink will return full path of currently accessed file by grep.
We have to drop "$PWD/" for using file name as $idxs key. (Note file=(**) is starting from $PWD.) crt=${idxs["${file#$PWD/}"]}.
time grep -d skip -D skip -r --exclude='.*' 'some pattern' * |& zstd >/tmp/grepResult-00.zst
real 4m43.832s
user 0m5.701s
sys 0m7.375s
time myProgressGrep.sh 'some pattern'
80166/80170██████████████████████████████████████████████████████████████100.00%
real 4m24.766s
user 0m30.360s
sys 0m55.247s
One step further: add another progress bar for showing big file read.
As I plan to use stat repetitively, I will use stat loadable bash. This will require bash-builtins package to be installed! Under debian based distro, use apt install bash-builtins (other distro method welcome as comments!)
Then replace # Main script by:
enable -f /usr/lib/bash/stat stat
printf '\nScan filesystem... ' >&2
startSpinner
bunch=10000
files=(**)
filecnt=${#files[@]}
col=$(tput cols)
exec {OutFD}> >(zstd >/tmp/grepResult.zst)
# exec {OutFD}>&1
for (( i=0 ; i <= 1 + filecnt / bunch ; i++ )); do
sfiles=("${files[@]: bunch * i : bunch}")
(( ${#sfiles[@]} )) || continue
exec {grepFd}< <(grep -d skip -D skip "$1" "${sfiles[@]}" >&$OutFD 2>&1;
echo done)
declare -i bpid=$! gpid=0
printf -v sfmt '[%q]=cnt++ ' "${sfiles[@]}"
declare -Ai idxs="($sfmt)"
while [[ -d /proc/$bpid ]]; do
((gpid)) || gpid=$(ps --ppid $bpid ho pid)
file=$(readlink /proc/$gpid/fd/3)
if [[ $file ]] && (( crt=${idxs["${file#$PWD/}"]:-0}, crt >1 )); then
(( doSpinner )) && stopSpinner
{ read -r _ fpos </proc/$gpid/fdinfo/3 ;} 2>/dev/null || fpos=0
stat -A fstat "$file"
percent $((fpos<fstat[size]?fpos:fstat[size])) ${fstat[size]} fpct
percentBar $fpct $((col-47)) fbar
percent $crt $filecnt pct
percentBar $pct $((col-8-2*${#filecnt})) bar
file=${file##*/}
printf '\r%-40s\e[44;32m%s\e[0m%6s%%\n%*s/%s\e[44;32m%s\e[0m%6s%%\e[A'\
"${file::40}" "$fbar" "$fpct" \
${#filecnt} $crt $filecnt "$bar" "$pct" >&2
read -ru $grepFd -t .02 _
fi
done
done
exec {OutFD}>&-
printf '\n\n' >&2
- Where I use
read -r _ fpos </proc/$gpid/fdinfo/3 for knowing exact position, current grep pid are reading $file...
- Then use
stat for knowing $file size.
- This is subject to some kind of race condition, aka there is a lot of chance current
grep already closed file returned by readlink when fdinfo/3 are accessed. This is trapped by: { ... ;} 2>/dev/null and fpos < fstat[size] ? fpos : fstat[size]
Then I will see two progress bar!
Movie-0023.mp4 █████ 15.35%
79943/80170█████████████████████████████████████████████████████████████▊ 99.72%
- first line show for current reading file name and position of
grep pointer in it.
- second line show progression over whole file count.
About performances
All test I've done, using
sync && echo 3 > /proc/sys/vm/drop_caches
command as root, between each tests.
I've compared execution of both version, with:
time grep -d skip -D skip -r --exclude='.*' 'some pattern' * |& zstd >/tmp/grepResult-00.zst
I ran this several times and post here better result:
real 4m43.832s
user 0m5.701s
sys 0m7.375s
In order to compare result file with /tmp/grepResult.zst generated by script.
On my host, with my configuration.
time myProgressGrep.sh 'some pattern'
80166/80170██████████████████████████████████████████████████████████████100.00%
real 4m24.766s
user 0m30.360s
sys 0m55.247s
time myDualProgressGrep.sh 'some pattern'
Xterm.log.host.2023.11.21.18.55.11.340 ████████████████████████████████▎ 98.08%
80168/80170██████████████████████████████████████████████████████████████100.00%
real 4m21.073s
user 0m27.864s
sys 0m50.262s
Yes, surprisingly my script seem quicker than plain grep -r command.
(I suspect -r option to be not as performant than bash globstar).
ppGrep.sh bash script
You will find this here: ppGrep.sh
Usage:
Usage: ppGrep.sh [OPTIONS] <PATTERN> [FILE] [FILE...]
Options [-E|-F|-G|-P] [-l|-L] [-H] [-Z] [-a] [-c] [-b] [-i] [-l] [-n] [-o] [-s]
and [-v] [-w] [-x], as [-e "PATTERN"] and [-f "PATTERN FILE"] are bind
to 'grep' tasks (see man grep!).
-j NUM Max job to run together (default: "3").
-C PATH 'cd' to PATH before running (instead of "/home/felix/Work/Devel/bash").
-T FILE Files list from FILE.
-z Files list are null bytes separated.
-d Dump both STDOUT and STDERR as soon as possible (in right order)
default is to keep everything in memory until last job finish.
-W Display Warnings when killing some subpid.
-h Show this.
Note: FILE cannot be else than a file! There are no '-r' option.
Require
Note: This is a modern bash script, use recent feature, and use zstandard compression for storing process's outputs to be able to return them in correct order. So this script depend on external binaries:
For Debian based distribution, this require:
coreutils readlink, cat, sync, rm, stat
libc-bin getent, getconf
ncurses-bin tput
procps ps
zstd zstd, zstdcat
grep grep
Test and performance
My test show using this is quicker than plain grep, even with only 1 process!
But only while data are to be read from filesystems. If data are in memory cache, then my script's footprint will appear.
Here is a little comparison plain grep, vs ppGrep.sh, by using 1 to 6 parallel process, on my host.:
No cache Cached
grep 10' 58.352902" 7.646374"
ppGrep 1p. 10' 33.748950" 18.293024"
ppGrep 2p. 9' 23.600785" 16.992925"
ppGrep 3p. 8' 21.149429" 14.622010"
ppGrep 4p. 8' 8.025807" 14.604110"
ppGrep 5p. 7' 52.778152" 14.888845"
ppGrep 6p. 7' 12.015095" 12.992566"
When data arn't cached, on a 10' jobs, ppGrep could be 30" quicker.
When data are cached, ppGrep use ~7 more seconds...
By running 3 parallel process, you will gain ~2 minutes. This become noticeable.
You will find test script and my full test result in same dir, on my website.
Here is a sample accelerated ~3 time:

TODO
- grep on compressed files, like zgrep, zstdgrep...
- add option for barColor
- do tests on more different environments (fs, hdd, sdd, iscsi...)
- shuffle list to reduce chance that 1 folder with big files to 1 proc
- '-q' print 1st file found, quit and end all tasks
( Help and suggestion welcome! ;-)
/proc.