0

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt.

I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my FASTA file. my code is here:

while IFS= read -r uniqueline
do
    counter=0
    while IFS= read headline
    do
        if [ "$uniqueline" == "$headline" ]
        then
            let "counter++"
            #append counter to the headline variable to number it.
            sed "$headline s/$/$counter/" -i headers
        if
    done < headers.txt
done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error

sed: -e expression #1, char 1: unknown command: 'M'

and

sed: -e expression #1, char 2: extra characters after command

Both files contain unique header names:

Mus musculus
Homo sapiens
Rattus norvegicus

How do I modify the sed command to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up) **** Gene sequences are all in one file

Mus musculus 
MDFJSGHDFSBGKJBDFSGKJBDFS
NGBJDFSBGKJDFSHNGKJDFSGHG
Rattus norvegicus
SNOFBDSFNLSFSFSFSJFJSDFSD
Mus musculus
NJALDJASJDLAJSJAPOJPOASDJG
DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1 
MDFJSGHDFSBGKJBDFSGKJBDFS
NGBJDFSBGKJDFSHNGKJDFSGHG
Rattus norvegicus
SNOFBDSFNLSFSFSFSJFJSDFSD
Mus musculus2
NJALDJASJDLAJSJAPOJPOASDJG
DSFHBDSFHSDFHDFSHJDFSJKSSF
5
  • What do you think the resulting sed command is?
    – choroba
    Commented Nov 5, 2020 at 9:27
  • Can you show us a few lines of both files and the output you are expecting? Doing this in the shell is incredibly inefficient. You probably just need a simple awk one-liner that will run in seconds (your loop will take several minutes for larger files).
    – terdon
    Commented Nov 5, 2020 at 9:54
  • I edited the post, that is basically all I wanted... @terdon
    – Jerry
    Commented Nov 5, 2020 at 10:15
  • Thanks for the clarification. Do I understand correctly that there is no "blank line" in between gene sequences, or anything else that would identify a header line (apart from gene sequences being all uppercase ;) )
    – AdminBee
    Commented Nov 5, 2020 at 10:47
  • Do you have two files or three? You only use two files in your script (headers.txt and uniqueheaders.txt) but you also seem to have a file that has both headers and sequences. Is that file headers.txt or is it a third file? And what do you mean that "both files contain unique header names"? Isn't the whole point that one of the files has duplicate header names?
    – terdon
    Commented Nov 5, 2020 at 11:01

1 Answer 1

1

An awk based solution comes to mind.

Since your header files are not identified by anything particular, we will take a "two-file double-pass" approach, in which the header file containing the unique headers is read first, and then the actual sequences file twice in order to print a disambiguation number to only those headers that occur more than once:

awk 'NR==FNR{tot[$0]=0;next}
     !final {if ($0 in tot) {tot[$0]++};next}
      final && ($0 in tot) {if (tot[$0]>1) $0=$0 (++cnt[$0])}1' uniqueheaders.txt sequence.txt final="1" sequence.txt 
  • If NR, the global line counter, is equal to FNR, the per-file line-counter, we know that we are in the first file to be processed (here uniqueheaders.txt). Then, we simply "record" the headers in an array tot that will later hold the total counts of header occurences.
  • Since we don't know how many header lines and sequence file lines there are, the line counter variables will not help us any further in identifying which file we are on (at least if we don't want to rely on any specific awk implementation). Since we have to process the "sequences file" (your example input) twice in order to suppress appending a 1 for those headers that only occur once (see this Q&A for a discussion), we will state it twice as argument to awk, but set a flag final for the second pass.
  • In the first pass of the sequences file (final is not yet set), we only look at those lines that contain headers (i.e. $0, the entire line, is in among the indices of the array tot) and increase the total occurence counter.
  • In the second pass of the sequences file (final is now set to 1) we generally print all lines, but on those lines that are header lines (again, indicated by $0 being among the indices of the array tot) we append a counter (stored in an array cnt), if we know that there are duplicates of this header (tot[$0]>1).

Note that if the header lines of your sequences file could be singled out by some criterion (e.g. all gene sequences are separated by blank lines) then we would not need the external uniqueheaders.txt and could do it all in one awk call.

1
  • Very nice solution. Also I guess OPs would generally be OK with a appending 1 to a unique header, which would make the problem much more simple. If only they knew...
    – Quasímodo
    Commented Nov 5, 2020 at 11:31

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.