Skip to main content
added 315 characters in body
Source Link
slm
  • 380.1k
  • 127
  • 793
  • 897
FNR == NR {
  a[$1f1[$1,$2,$4] = $0
  p[$1f1_c14[$1,$2,$4] = 1
  q[$1f1_c5[$1,$2,$4] = $5
  next
}  

p[$1f1_c14[$1,$2,$4] {
  if ($5 != q[$1f1_c5[$1,$2,$4]) print a[$1f1[$1,$2,$4];
}
  
a[$1f1[$1,$2,$4] {
  if ($5 != q[$1f1_c5[$1,$2,$4]) print $0;
}

Now run it like this:

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4

The script works as follows. This block creates 3 arrays, f1, f1_c14, and f1_c5. f1 contains all the lines of file1 in an array, indexed using the contents of the columns 1, 2, & 4 from file1. f1_c14 is another array with the same index (1, 2, & 4's contents) and a value of 1. The 3rd array uses the same index as the 1st 2, with the value of the 5th column from file1.

FNR == NR {
  f1[$1,$2,$4] = $0
  f1_c14[$1,$2,$4] = 1
  f1_c5[$1,$2,$4] = $5
  next
} 

The next block is responsible for printing lines from the 1st file, file1 under the conditions that the columns 1, 2, & 4 match the columns from file2, AND it will onlu print the line from file1 if the 5th columns of file1 and file2 do not match.

f1_c14[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}

The 3rd block is responsible for printing the associated line from file2 there's a corresponding line in the array f1 for file2's columns 1, 2, & 4. Again it only prints if the 5th columns do not match.

f1[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print $0;
}

Example

Running the above script like so:

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4 

You can use the column command to clean up the output slightly:

$ awk -f script.awk file1  file2 | column -t
sc2/80  20  .  A  T  86  PASS  N=2  F=5;U=4
sc2/80  20  .  A  C  80  PASS  N=2  F=5;U=4
sc2/60  55  .  G  T  76  PASS  N=2  F=5;U=4
sc2/60  55  .  G  C  72  PASS  N=2  F=5;U=4

How it works?

FNR == NR

This makes use of awk's ability to loop through files in a particular way. Here's we're looping through the files and when we're on a line that's from the first file, file, we want to run a particular block of code on this line from file1.

This example shows what FNR == NR is doing when we give it 2 simulated files. One has 4 lines in it while the other has 5 lines:

$ awk 'BEGIN {print "NR\tFNR\tline"} {print NR"\t"FNR"\t"$0}' \
     <(seq 1 4) <(seq 1 5)
NR  FNR line
1   1   1
2   2   2
3   3   3
4   4   4
5   1   1
6   2   2
7   3   3
8   4   4
9   5   5

other blocks

The other blocks, f1_c14[$1,$2,$4] AND f1[$1,$2,$4] only run when the values from those array elements has a value.

FNR == NR {
  a[$1,$2,$4] = $0
  p[$1,$2,$4] = 1
  q[$1,$2,$4] = $5
  next
}  

p[$1,$2,$4] {
  if ($5 != q[$1,$2,$4]) print a[$1,$2,$4];
}
  
a[$1,$2,$4] {
  if ($5 != q[$1,$2,$4]) print $0;
}

Now run it like this:

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4
FNR == NR {
  f1[$1,$2,$4] = $0
  f1_c14[$1,$2,$4] = 1
  f1_c5[$1,$2,$4] = $5
  next
}  

f1_c14[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}
  
f1[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print $0;
}

Now run it like this:

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4

The script works as follows. This block creates 3 arrays, f1, f1_c14, and f1_c5. f1 contains all the lines of file1 in an array, indexed using the contents of the columns 1, 2, & 4 from file1. f1_c14 is another array with the same index (1, 2, & 4's contents) and a value of 1. The 3rd array uses the same index as the 1st 2, with the value of the 5th column from file1.

FNR == NR {
  f1[$1,$2,$4] = $0
  f1_c14[$1,$2,$4] = 1
  f1_c5[$1,$2,$4] = $5
  next
} 

The next block is responsible for printing lines from the 1st file, file1 under the conditions that the columns 1, 2, & 4 match the columns from file2, AND it will onlu print the line from file1 if the 5th columns of file1 and file2 do not match.

f1_c14[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}

The 3rd block is responsible for printing the associated line from file2 there's a corresponding line in the array f1 for file2's columns 1, 2, & 4. Again it only prints if the 5th columns do not match.

f1[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print $0;
}

Example

Running the above script like so:

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4 

You can use the column command to clean up the output slightly:

$ awk -f script.awk file1  file2 | column -t
sc2/80  20  .  A  T  86  PASS  N=2  F=5;U=4
sc2/80  20  .  A  C  80  PASS  N=2  F=5;U=4
sc2/60  55  .  G  T  76  PASS  N=2  F=5;U=4
sc2/60  55  .  G  C  72  PASS  N=2  F=5;U=4

How it works?

FNR == NR

This makes use of awk's ability to loop through files in a particular way. Here's we're looping through the files and when we're on a line that's from the first file, file, we want to run a particular block of code on this line from file1.

This example shows what FNR == NR is doing when we give it 2 simulated files. One has 4 lines in it while the other has 5 lines:

$ awk 'BEGIN {print "NR\tFNR\tline"} {print NR"\t"FNR"\t"$0}' \
     <(seq 1 4) <(seq 1 5)
NR  FNR line
1   1   1
2   2   2
3   3   3
4   4   4
5   1   1
6   2   2
7   3   3
8   4   4
9   5   5

other blocks

The other blocks, f1_c14[$1,$2,$4] AND f1[$1,$2,$4] only run when the values from those array elements has a value.

added 315 characters in body
Source Link
slm
  • 380.1k
  • 127
  • 793
  • 897

You can use awkawk. Put the following in a script, script.awk:

$FNR awk== 'NR {print 
 $1 a[$1,$2,$4] = $0
  p[$1,$2,$4] $4= 1
  q[$1,$2,$4] = $5
  next
}' file2 |

p[$1,$2,$4] grep{
 -vf if ($5 != q[$1,$2,$4]) print a[$1,$2,$4];
}
  
a[$1,$2,$4] {
  if ($5 != q[$1,$2,$4]) print $0;
}

Now run it like this:

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/6080         5520      .        GA       T C        76 80      PASS     N=2     F=5;U=4  F=5;U=4
sc2/6860         2055      .        TG       CT         7176       PASS     N=2     F=5;U=4 
sc2/2460         2455      .        T G      G  C       31 72      PASS     N=2       F=5;U=4

You can use awk:

$ awk '{print $1, $2, $4}' file2 | grep -vf - file1
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/68         20      .        T       C         71       PASS     N=2     F=5;U=4
sc2/24         24      .        T       G         31       PASS     N=2     F=5;U=4

You can use awk. Put the following in a script, script.awk:

FNR == NR { 
  a[$1,$2,$4] = $0
  p[$1,$2,$4] = 1
  q[$1,$2,$4] = $5
  next
}  

p[$1,$2,$4] {
  if ($5 != q[$1,$2,$4]) print a[$1,$2,$4];
}
  
a[$1,$2,$4] {
  if ($5 != q[$1,$2,$4]) print $0;
}

Now run it like this:

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4
Post Undeleted by slm
Post Deleted by slm
Source Link
slm
  • 380.1k
  • 127
  • 793
  • 897

You can use awk:

$ awk '{print $1, $2, $4}' file2 | grep -vf - file1
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/68         20      .        T       C         71       PASS     N=2     F=5;U=4
sc2/24         24      .        T       G         31       PASS     N=2     F=5;U=4