diff consuming huge amount of memory and cpu

I have two files, all.txt (525,953,272 records) and subset.txt (525,298,281 records). Each record is nothing but a 17-digit ASCII integer. Both files have been sorted, and duplicate records within each file have been removed. Every value in subset.txt also exists in all.txt. I wish to find the records in all.txt that are not in subset.txt.

I'm trying to run a diff between these two files, thinking it will write out the rows that are in all.txt but not in subset.txt. The machine has 64GB of memory. The diff has been running for a half hour and currently has acquired about 75% of the memory.

Can anyone speculate on what might be going on, and whether there are arguments to diff that might help? Is this just not what diff was meant to do, and is there a different approach I should use?

diff

Stack Exchange Network

Return to Question

Post Timeline

diff consuming huge amount of memory and cpu