How to diff two files in bash efficiently?

Question

I have two files containing the similar pattern:

cmd1 [cmd2 {xx/xx[7] x/x[0] ...}] cmd3 [cmd4 {xx/x[12] ...}]
cmd5 [cmd6 {x/x[1] xx ...}]

I don't need to consider all the cmds in two files. Only need to sort the string lists' order in braces. Then I sort the two files separately and use comm to output the similarities and differences. I use the same flow to sort the string lists in two files. The flow is below:

matchedBraces=$(grep -o '\{[^}]*\}' $fileA)   #grep all the braces and strings in them

while read perMatch
do
sort_now "$perMatch" $fileA
done <<< "$matchedBraces"

function sort_now {
beforeSort=$(echo "$1" | sed 's?\[?\\[?g' | sed 's?\]?\\]?g')   #in case string has square brackets in it, change [...] to \[...\] for later use
afterSort=$(echo "$1" | grep -o '[^{} ]*' | sort | tr '\n' ' ')   #get all the individual strings in brace, sort them, put them as a string with a trailing white space 
afterSort={$(echo $afterSort)}   #delete the trailing white space,  add the brace back to the string list
afterSort=$(echo "$afterSort" | sed 's?\[?\\[?g' | sed 's?\]?\\]?g')   #in case string has square brackets in it, change [...] to \[...\] for later use
sed -i -f - $2 << eof
s?${beforeSort}?${afterSort}?g
eof   #the variables may be very large, so I have to use sed this way 
sort $2 -o $2
}

I changed the code's order for clarity. It works flawlessly but if a file only contains the pattern cmd1 [cmd2 {xx/xx}] only one string in brace, but there's 50k similar lines, then it's very time-consuming. Even if I put it running with 8cpus 200G mem, it keeps running after hours. Since I know in tcl, to append something to a string, the command append is much more efficient than set. I'm wondering if bash has similar commands or features. Or can someone optimise my code to save time?

People are more likely to make an effort to answer if you provide a minimal, testable example of your input and desired output. My first thought is that you could use perl to non-greedily match brace contents, split on whitespace, and sort the result - something like perl -pe 's/(?<=\{).*?(?=\})/join " ", sort split " ", $&/ge' file — steeldriver, Commented May 25, 2024 at 11:22
Yes, please add an example of your input and the output you want from that example. We can't help you parse data that we do not see. You might also want to read Why is using a shell loop to process text considered bad practice?. — terdon, Commented May 25, 2024 at 12:20
should afterSort=$(echo "$1" ... be afterSort=$(echo "$beforeSort" ...? — markp-fuso, Commented May 25, 2024 at 13:00
please provide 3-4 sample lines from both files, making sure to cover all of your possible patterns, some patterns only in file1, some patterns only in file2, some patterns in both files; update the question with the grep/sed/sort results for both files (as opposed to making us reverse engineer the code to figure out the desired sorted output); update the question with the final results — markp-fuso, Commented May 25, 2024 at 13:02
invoking 10 (?) subshells for each pattern is a massive time consumer; assuming all of those sed calls are actually required it's probably possible to consolidate many together; then again, eliminating the row-by-row processing of a bash loop and replacing with a more appropriate tool (eg, awk, perl, pyton, etc) is going to be another big time saver; I'm guessing most (if not all) of this code could be replaced with a single awk/perl/python script ... but we'll need a robust set of sample data and better description of your sorting algorithm — markp-fuso, Commented May 25, 2024 at 13:08

user611494 · Accepted Answer · 2024-05-25 12:10:45Z

#!/bin/bash

# Function to sort string lists within braces
sort_string_lists() {
    local file="$1"
    local matchedBraces=$(grep -o '\{[^}]*\}' "$file")
    
    while read perMatch; do
        sort_now "$perMatch" "$file"
    done <<< "$matchedBraces"
}

# Function to sort the content of a file
sort_now() {
    local match="$1"
    local file="$2"
    
    beforeSort=$(echo "$match" | sed 's?\[?\\[?g' | sed 's?\]?\\]?g')
    afterSort=$(echo "$match" | grep -o '[^{} ]*' | sort | tr '\n' ' ')
    afterSort="{$(echo $afterSort)}"
    afterSort=$(echo "$afterSort" | sed 's?\[?\\[?g' | sed 's?\]?\\]?g')
    
    sed -i -f - "$file" << EOF
s?${beforeSort}?${afterSort}?g
EOF
    
    sort "$file" -o "$file"
}

# Main script starts here
fileA="path/to/fileA.txt"
fileB="path/to/fileB.txt"

# Sort string lists in both files
sort_string_lists "$fileA"
sort_string_lists "$fileB"

# Compare sorted files using comm
comm -23 "$fileA" "$fileB"

This script does the following:

Defines a function sort_string_lists to extract and sort the string lists in curly braces for a given file.
Defines a helper function sort_now to perform the actual sorting of the string lists.
Calls sort_string_lists on both input files to make sure they are sorted according to the criteria you specify.
Uses comm -23 to compare the sorted files and output the differences (lines only for fileB).

Thanks for your patience. But you optimise it base on my code, it will be still very time-consuming. I'm considering using python to do it. Thanks for your help anyway:) — Alan Gatsby, Commented Aug 14, 2024 at 9:07
@AlanGatsby if you were to update the question with the requested data (samples from both files, matching and not-matching lines, expected results) you would improve your chances of getting an answer that improves on accuracy and performance — markp-fuso, Commented Aug 14, 2024 at 13:47
@markp-fuso Thanks for your patient reply. I've read other's comment above, and I know that shell commands may not be so efficient when looping through large file. So I'll try python to solve my challenge. — Alan Gatsby, Commented Aug 16, 2024 at 10:34

user2918098 · Accepted Answer · 2024-05-25 17:01:22Z

0

The answer to the question "How to diff two files in bash efficiently?" - use the diff command - a quick google will tell you that.

https://askubuntu.com/questions/515900/how-to-compare-two-files

I'm not sure we should be helping you with your hw assignment.

answered May 25, 2024 at 17:01

user2918098

1

That cannot solve my problem. But thanks for your kind response anyway:)
– Alan Gatsby
Commented Jul 27, 2024 at 3:29

Add a comment |

Stack Exchange Network

How to diff two files in bash efficiently?

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

How to diff two files in bash efficiently?

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions