0

I just write few lines to grep smallest value in my files and it is giving me correct result but repeating lines two times can you fix the bug

What I am doing:

  • Greping all files
  • Removing header
  • sorting in scientific notation using column nine
  • taking the first line that is the smallest after sort and printing using awk
  • I want file name so printed $i too

Script:

#!/bin/bash

for i in `ls -v *.txt` 
do 
smallestPValue=`sed 1d $i | sort -k9 -g | head -1 | awk '{print $0}'` 

echo  $i  $smallestPValue >> smallesttPvalueAll.txt
done

output

U1.text 4 rsxxx 1672175 A ADD 759 0.0751 4.918 1.074e-06
U1.txt 4 rsxxxx 1672175 A ADD 759 0.0751 4.918 1.074e-06
U2.txt  16 rsxxxx 596342 T ADD 734 -0.05458 -5.204 2.535e-07
U2.txt 16 rsxxxx 596342 T ADD 734 -0.05458 -5.204 2.535e-07
U3.txt 2 rsxxxx 12426 T ADD 722 0.06825 5.285 1.669e-07

I am getting repetitions for few lines while some are just fine as U3 above is coming once and that's what I want. I can easily get rid of duplicated lines by uniq or sort -u but just curious what is causing this

Desired output each line repeated once

3
  • What is the output of ls -v *.txt?
    – cherdt
    Commented Jul 28, 2017 at 16:49
  • 1
    my guess is that you're probably getting dupes because smallesttPvalueAll.txt matches *.txt so is processed along with all the other .txt files. but there's so many things wrong with the way you're trying to do this that it's not even worth trying to fix. see my answer below for a better method.
    – cas
    Commented Jul 29, 2017 at 4:49
  • Well in my folder i have just those thousand files I want to process
    – star
    Commented Jul 29, 2017 at 22:40

1 Answer 1

1

If I'm interpreting it right, you can probably do what you're trying to do with just awk and sort - no need for a loop, or parsing ls (subtle hint: DON'T DO THAT!), or head or sed.

awk 'FNR > 1 {print FILENAME, $0}' *.txt | sort -k10 -g | sort -u -k1,1

This skips the first line of each file, then prints all remaining lines prefixed with the filename and a space (awk's default output record separator or ORS). It then pipes it through sort to do a generic numeric sort on field 10. Finally, it does a unique sort of the first field only (-k1,1, the filename), so that only the first line with that filename is output.

Note that we have to sort on field 10 here, not field 9 because we've added the filename as the first field so all other field numbers are incremented by 1.

FNR and FILENAME are built-in awk variables. FNR is the line number ("input record number" in awk-lingo) of the current file, and FILENAME is the current filename.


here's another way of doing it, this time using only awk:

#!/usr/bin/awk -f

FNR > 1 && (! s[FILENAME] || $9 < s[FILENAME]) {
  s[FILENAME]=$9;
  l[FILENAME]=$0
};

END {
  for (f in s) {
    print f, l[f]
  }
}

save it as, e.g. smallest-pvalue.awk, make it executable with chmod +x smallest-pvalue.awk and run it as ./smallest-pvalue.awk *.txt.

This awk script keeps track of the smallest value seen for field 9 of each input file in an array called s, and also keeps the matching input line in array l.

Once it has processed all the files, it prints out the filename and the line containing the smallest 9th field for each file.

7
  • Well I need to understand this second part of array in awk as it's little more advanced. I appreciate your comment. I will try to understand and try to use.I am doing that way because I need to make three files using smallest five percent and ten percent value from my thousand files. And I just share first part of my script. I can get rid of these duplicated lines by sort - u but I thought may be there is any other way
    – star
    Commented Jul 29, 2017 at 22:56
  • The arrays aren't hard to understand. instead of using numbers as array indices, they use strings (the filenames) - e.g. s[U1.txt], and l[U1.txt]. the awk-only version loops through each file and if the line number is > 1 (FNR >1) and either (s[FILENAME] doesn't exist or the current line's pvalue ($9) is smaller than s[FILENAME]) then it sets s[FILENAME] to the current lines pvalue and l[FILENAME] to the entire current line ($0). The END {...} block is run when there's nothing left to do and prints out each filename along with its stored input line.
    – cas
    Commented Jul 30, 2017 at 2:00
  • what do you mean by current line's pvalue ($9) is smaller than s[FILENAME]) I want to sort in ascending order and record from all 1000 their smallest value so resulting file should have 1000 lines.
    – star
    Commented Jul 31, 2017 at 20:00
  • I thought you wanted the smallest such value from each file - that's what the ` sort -u -k1,1` and your original head -1 do. The standalone awk script is just another way to do that - instead of printing all lines and then using sort -u to throw away the smallest for each input file, it only prints the smallest values.
    – cas
    Commented Aug 1, 2017 at 1:50
  • i tried to run this script save it like u said execute it but name give error even with awk -f scriptname it doesnt work.
    – star
    Commented Aug 2, 2017 at 2:49

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.