15

I usually use grep -rIn pattern_str big_source_code_dir to find some thing. but the grep is not parallel, how do I make it parallel? My system has 4 cores, if the grep can use all the cores, it would be faster.

2

4 Answers 4

14

There will not be speed improvement if you are using a HDD to store that directory you are searching in. Hard drives are pretty much single-threaded access units.

But if you really want to do parallel grep, then this website gives two hints of how to do it with find and xargs. E.g.

find . -type f -print0 | xargs -0 -P 4 -n 40 grep -i foobar
Sign up to request clarification or add additional context in comments.

4 Comments

I've copied wrong example from source website, sorry. I'll fix answer.
Be aware that with xargs you risk getting mixed output. To see this in action see: gnu.org/software/parallel/…
The benefit of parallel grep will be noticable when doing repeated grep on the same directory tree, for example, for different keywords in the same source code. The file contents will be cached in RAM (well, depending on how much RAM and source code you have, of course).
people don't use hdds anymore, welcome to 2022
10

The GNU parallel command is really useful for this.

sudo apt-get install parallel # if not available on debian based systems

Then, paralell man page provides an example:

EXAMPLE: Parallel grep
       grep -r greps recursively through directories. 
       On multicore CPUs GNU parallel can often speed this up.

       find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}

       This will run 1.5 job per core, and give 1000 arguments to grep.

In your case it could be:

find big_source_code_dir -type f | parallel -k -j150% -n 1000 -m grep -H -n pattern_str {}

Finally, the GNU parallel man page also provides a section describing differences betwenn xargs and parallel command, that should help understanding why parallel seems better in your case

DIFFERENCES BETWEEN xargs AND GNU Parallel
       xargs offer some of the same possibilities as GNU parallel.

       xargs deals badly with special characters (such as space, ' and "). To see the problem try this:

         touch important_file
         touch 'not important_file'
         ls not* | xargs rm
         mkdir -p "My brother's 12\" records"
         ls | xargs rmdir

       You can specify -0 or -d "\n", but many input generators are not optimized for using NUL as separator but are optimized for newline as separator. E.g head, tail, awk, ls, echo, sed, tar -v, perl (-0 and \0 instead of \n),
       locate (requires using -0), find (requires using -print0), grep (requires user to use -z or -Z), sort (requires using -z).

       So GNU parallel's newline separation can be emulated with:

       cat | xargs -d "\n" -n1 command

       xargs can run a given number of jobs in parallel, but has no support for running number-of-cpu-cores jobs in parallel.

       xargs has no support for grouping the output, therefore output may run together, e.g. the first half of a line is from one process and the last half of the line is from another process. The example Parallel grep cannot be
       done reliably with xargs because of this.
       ...

5 Comments

I disagree. When grepping, your limiting factor is IO throughput, not CPU time. Throwing more cores at the problem doesn't make your disks spin any faster.
I disagree to your disagree : # time grep -E ‘invalid user (\S+) from ([0-9]+\.[0-9]+\.[0-9]+\. [0-9]+) port ([0-9]+)’ /var/log/auth.log Shows 10 seconds on my i7 Then test the drive's speed: # dd if=/var/log/auth.log of=/dev/null bs=1M Giving 4 seconds for 600MB at 130MB/s But the grep above takes 3 more time, near 40MB/sec to read data. So, here process time of regular expression is the most expansive Running in parallel : parallel --pipe --block 16M grep -E ‘invalid user (\S+) from ([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+) port ([0-9]+)’ </var/log/auth.log Takes 3 seconds in place of 10...
This seems like the best answer, why is it not higher up? O_o
Ah - because it doesn't work.. luckily I was able to use comm for my particular problem, which was soooo much faster than grep in the same situation..
parallel grep would be handy for network mounts with high latency.
4

Note that you need to escape special characters in your parallel grep search term, for example:

parallel --pipe --block 10M --ungroup LC_ALL=C grep -F 'PostTypeId=\"1\"' < ~/Downloads/Posts.xml > questions.xml

Using standalone grep, grep -F 'PostTypeId="1"' would work without escaping the double quotes. It took me a while to figure that out!

Also note the use of LC_ALL=C and the -F flag (if you're just searching full strings) for additional speed-ups.

Comments

1

Here are 3 ways to do it, but you can't get line number for two of them.

(1) Run grep on multiple files in parallel, in this case all files in a directory and its subdirectories. Add /dev/null to force grep to prepend the filename to the matching line, because you're gonna want to know what file matched. Adjust the number of process -P for your machine.

find . -type f | xargs -n 1 -P 4 grep -n <grep-args> /dev/null

(2) Run grep on multiple files in serial but process 10M blocks in parallel. Adjust the block size for your machine and files. Here are two ways to do that.

# for-loop
for filename in `find . -type f`
do 
  parallel --pipepart --block 10M -a $filename -k "grep <grep-args> | awk -v OFS=: '{print \"$filename\",\$0}'"
done

# using xargs
find . -type f | xargs -I filename parallel --pipepart --block 10M -a filename -k "grep <grep-args> | awk -v OFS=: '{print \"filename\",\$0}'"

(3) Combine (1) and (2): run grep on multiple files in parallel and process their contents in blocks in parallel. Adjust block size and xargs parallelism for your machine.

find . -type f | xargs -n 1 -P 4 -I filename parallel --pipepart --block 10M -a filename -k "grep <grep-args> | awk -v OFS=: '{print \"filename\",\$0}'"

Beware that (3) may not be the best use of resources.

I've got a longer write-up, but that's the basic idea.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.