3

I need to use grep and awk in order to match two types of patterns but I cannot figure out the syntax.

My file has values such as:

sample1,gicode1,123,4541,221,3661,Sodalis sp.1
sample2,gicode1,123,0322,12,112342,Sodalis sp.2
sample3,gicode1,112,4541,00,2342,Candidatus sp.
sample4,gicode1,2341,4541,00,9606,Homo sapiens

I need to grab the count of lines that have Sodalis. This can be in the name (so 7th column) or based on taxid since sometimes the naming that comes up is not accurate. The ID is the 6th column.

My issue is that sometimes the IDs in the 6th column can match to values in other columsn which are not ids. If I want the Sodalis species with the ID 2342, it shows up properly in sample 3, but it is also the scoring value in sample 4 (3rd column).

I can grab the ID in the proper column using awk -F, '$6==2342' or simply the name using grep 'Sodalis' but I am having an issue combining both as in the following:

cat myfile.txt | grep "Sodalis" OR awk -F, '$6==2342' | wc -l

The return should be 3, but I get either 2 (for grep) or just 1 (for awk). I've tried many variations of this with || or & even:

cat myfile.txt | grep "Sodalis" || cat myfile.txt | awk -F, '$6==2342'

But it gives the answer 1.

I know with grep I can also use grep -E 'Sodalis|2342' but this unfortunately returns 4 because the second pattern is matching to sample 4 where it the scoring value happens to be 2342. Is there a way to grep a value based on a certain column? I need the full line to also appear because I want to save those results as a separate file called Sodalis.txt.

1
  • 1
    Regarding I need to use grep and awk - no, you never need grep when you're using awk.
    – Ed Morton
    Commented Jul 26, 2021 at 21:41

1 Answer 1

7

No need for grep here - awk is perfectly capable of matching patterns:

awk -F, '/Sodalis/ || $6==2342' myfile.txt | wc -l

or

awk -F, '/Sodalis/ || $6==2342 {c++} END{print c}' myfile.txt

(responding to comments) if you want to restrict the match of Sodalis to the 7th column only, and perhaps read a list of 6th column IDs one per line from a file ids.txt:

awk -F, 'NR==FNR{ids[$1]; next} $7 ~ /Sodalis/ || $6 in ids' ids.txt myfile.txt
7
  • Oh I knewww I had to use || but I kept using it out of the quotes! Thanks so much!
    – KBwonder
    Commented Jul 26, 2021 at 15:58
  • 1
    Or even $7 ~ /Sodalis/ if there's ever need to restrict that condition to a particular column/field too.
    – ilkkachu
    Commented Jul 26, 2021 at 16:12
  • Is there also a way in awk to make it match a long list of strings? Some cases I have 10 IDs and I think it would be tedious to write $6 ==2342 || $6==310 || etc for each ID I want to match
    – KBwonder
    Commented Jul 26, 2021 at 16:16
  • 1
    @KBwonder, regex match with alternation in the pattern: $6 ~ /^(2342|310|1234)$/
    – ilkkachu
    Commented Jul 26, 2021 at 16:17
  • 1
    @KBwonder if you have more than a few, get them into an array (are the IDs in a file somewhere?), then you can use $6 in arr
    – rowboat
    Commented Jul 26, 2021 at 17:10

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.