Skip to main content
Tweeted twitter.com/StackUnix/status/691267945952059392
edited tags
Link
Jeff Schaller
  • 69k
  • 35
  • 122
  • 268
Source Link
Chap
  • 389
  • 3
  • 15

diff consuming huge amount of memory and cpu

I have two files, all.txt (525,953,272 records) and subset.txt (525,298,281 records). Each record is nothing but a 17-digit ASCII integer. Both files have been sorted, and duplicate records within each file have been removed. Every value in subset.txt also exists in all.txt. I wish to find the records in all.txt that are not in subset.txt.

I'm trying to run a diff between these two files, thinking it will write out the rows that are in all.txt but not in subset.txt. The machine has 64GB of memory. The diff has been running for a half hour and currently has acquired about 75% of the memory.

Can anyone speculate on what might be going on, and whether there are arguments to diff that might help? Is this just not what diff was meant to do, and is there a different approach I should use?