diff consuming huge amount of memory and cpu

Question

I have two files, all.txt (525,953,272 records) and subset.txt (525,298,281 records). Each record is nothing but a 17-digit ASCII integer. Both files have been sorted, and duplicate records within each file have been removed. Every value in subset.txt also exists in all.txt. I wish to find the records in all.txt that are not in subset.txt.

I'm trying to run a diff between these two files, thinking it will write out the rows that are in all.txt but not in subset.txt. The machine has 64GB of memory. The diff has been running for a half hour and currently has acquired about 75% of the memory.

Can anyone speculate on what might be going on, and whether there are arguments to diff that might help? Is this just not what diff was meant to do, and is there a different approach I should use?

525953272 × 17 bytes × 2 ≈ 16 GiB and 64 × 0.75 / 16 = 3. As a starting point. Not unlikely a diff utility manages to use 2x. What diff ? Something like diff --version — user367890
– user367890, Commented Jan 24, 2016 at 0:41
Did you consider writing a simple program to compute that. Take into account that each file is sorted. You'll advance linearly on both files at different paces. (Take inspiration from mergesort, if you don't see what I mean) — Basile Starynkevitch
– Basile Starynkevitch, Commented Jan 24, 2016 at 1:30
@user367890 $ diff --version diff (GNU diffutils) 2.8.1 Copyright (C) 2002 Free Software Foundation, Inc. — Chap
– Chap, Commented Jan 24, 2016 at 3:20
@JeffSchaller - comm appears to do exactly what I need. Give it as an answer and I'll accept it. Thx. — Chap
– Chap, Commented Jan 24, 2016 at 3:41

Jeff Schaller · Accepted Answer · 2016-01-24 14:12:27Z

Can anyone speculate on what might be going on, and whether there are arguments to diff that might help? Is this just not what diff was meant to do, and is there a different approach I should use?

This is not what diff was meant to do; when the inputs have been sorted (as your have), the tool for the job is comm.

$ seq 10 15 > subset.txt
$ seq 10 20 > all.txt
$ comm -13 subset.txt all.txt
16
17
18
19
20

The options to comm are a little unusual in that they turn off output. Column 1 has lines that are unique to file 1; column 2 has lines that are unique to file 2; and column 3 has lines that are "comm"on to both. By using options -13 we are asking comm to show us lines that are only in "all.txt".

replay · Accepted Answer · 2016-01-24 11:06:35Z

diff might not be the most suitable tool to do that. I would try to write a simple script which does specifically what you want.

All in memory

This is a very simple and generic solution. It loads the two files into minimalist memory structures, subtracts the records of subset.txt from the records of all.txt and writes out the remainder.

#!/usr/bin/env python

with open('diff.txt', 'w') as diff:

    for record in set(open('all.txt', 'r')) - set(open('subset.txt', 'r')):
        diff.write(record)

Save that into a file like create_diff.py, then chmod +x create_diff.py and run it in the directory where your two files are.

Only subset in memory

If you need to further optimize for a lower memory footprint it would also be possible to do that without loading the entire files into memory, especially all.txt does not need to be loaded into memory completely but could just be iterated over once.

#!/usr/bin/env python

subset_txt = open('subset.txt', 'r')
subset = subset_txt.readlines()
subset_txt.close()

with open('diff.txt', 'w') as diff_txt:
    with open('all.txt', 'r') as all_txt:
        for line in all_txt:
            if line not in subset:
                diff_txt.write(line)

I/O based

This should be the slowest variant because it's heavily reliant on I/O, but it has a low memory footprint because it doesn't require the entire files to be loaded into memory. It works no matter if your files are sorted / uniqued or not.

#!/usr/bin/env python

diff_txt = open('diff.txt', 'w')

with open('all.txt', 'r') as all_txt:
    with open('subset.txt', 'r') as subset_txt:
        for all_line in all_txt:
            found = False

            for sub_line in subset_txt:
                if all_line == sub_line:
                    found = True
                    break

            if found is False:
                diff_txt.write(all_line)
                subset_txt.seek(0)

diff_txt.close()

Only for sorted files without duplicates <- recommended in your case

If you're sure that both your files are ordered and contain no duplicates this should be the best solution. Both files are only read once and they do not need to be loaded into memory completely.

#!/usr/bin/env python

diff_txt = open('diff.txt', 'w')

with open('all.txt', 'r') as all_txt:
    with open('subset.txt', 'r') as subset_txt:
        subset_line = subset_txt.readline()

        for all_line in all_txt:
            if all_line == subset_line:
                subset_line = subset_txt.readline()
            else:
                diff_txt.write(all_line)

diff_txt.close()

Stack Exchange Network

diff consuming huge amount of memory and cpu

2 Answers 2

You must log in to answer this question.

Hot Network Questions

diff consuming huge amount of memory and cpu

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions