diff might not be the most suitable tool to do that. I would try to write a simple script which does specifically what you want.
All in memory
This is a very simple and generic solution. It loads the two files into minimalist memory structures, subtracts the records of subset.txt from the records of all.txt and writes out the remainder.
#!/usr/bin/env python
with open('diff.txt', 'w') as diff:
for record in set(open('all.txt', 'r')) - set(open('subset.txt', 'r')):
diff.write(record)
Save that into a file like create_diff.py, then chmod +x create_diff.py and run it in the directory where your two files are.
Only subset in memory
If you need to further optimize for a lower memory footprint it would also be possible to do that without loading the entire files into memory, especially all.txt does not need to be loaded into memory completely but could just be iterated over once.
#!/usr/bin/env python
subset_txt = open('subset.txt', 'r')
subset = subset_txt.readlines()
subset_txt.close()
with open('diff.txt', 'w') as diff_txt:
with open('all.txt', 'r') as all_txt:
for line in all_txt:
if line not in subset:
diff_txt.write(line)
I/O based
This should be the slowest variant because it's heavily reliant on I/O, but it has a low memory footprint because it doesn't require the entire files to be loaded into memory. It works no matter if your files are sorted / uniqued or not.
#!/usr/bin/env python
diff_txt = open('diff.txt', 'w')
with open('all.txt', 'r') as all_txt:
with open('subset.txt', 'r') as subset_txt:
for all_line in all_txt:
found = False
for sub_line in subset_txt:
if all_line == sub_line:
found = True
break
if found is False:
diff_txt.write(all_line)
subset_txt.seek(0)
diff_txt.close()
Only for sorted files without duplicates <- recommended in your case
If you're sure that both your files are ordered and contain no duplicates this should be the best solution. Both files are only read once and they do not need to be loaded into memory completely.
#!/usr/bin/env python
diff_txt = open('diff.txt', 'w')
with open('all.txt', 'r') as all_txt:
with open('subset.txt', 'r') as subset_txt:
subset_line = subset_txt.readline()
for all_line in all_txt:
if all_line == subset_line:
subset_line = subset_txt.readline()
else:
diff_txt.write(all_line)
diff_txt.close()
525953272 × 17 bytes × 2 ≈ 16 GiBand64 × 0.75 / 16 = 3. As a starting point. Not unlikely a diff utility manages to use 2x. Whatdiff? Something likediff --versioncomm