So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt
). I removed all the duplicates in uniqueheaders.txt
.
I am trying to loop read a line of uniqueheaders.txt
then loop read headers.txt
to check for duplicates. The if
statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt
so I insert them back into my FASTA file.
my code is here:
while IFS= read -r uniqueline
do
counter=0
while IFS= read headline
do
if [ "$uniqueline" == "$headline" ]
then
let "counter++"
#append counter to the headline variable to number it.
sed "$headline s/$/$counter/" -i headers
if
done < headers.txt
done < uniqueheaders.txt
The issue is that the terminal keeps spitting out the error
sed: -e expression #1, char 1: unknown command: 'M'
and
sed: -e expression #1, char 2: extra characters after command
Both files contain unique header names:
Mus musculus
Homo sapiens
Rattus norvegicus
How do I modify the sed
command to prevent this error? Is there a better way of doing this in bash
?
Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up) **** Gene sequences are all in one file
Mus musculus
MDFJSGHDFSBGKJBDFSGKJBDFS
NGBJDFSBGKJDFSHNGKJDFSGHG
Rattus norvegicus
SNOFBDSFNLSFSFSFSJFJSDFSD
Mus musculus
NJALDJASJDLAJSJAPOJPOASDJG
DSFHBDSFHSDFHDFSHJDFSJKSSF
Desired output:
Mus musculus1
MDFJSGHDFSBGKJBDFSGKJBDFS
NGBJDFSBGKJDFSHNGKJDFSGHG
Rattus norvegicus
SNOFBDSFNLSFSFSFSJFJSDFSD
Mus musculus2
NJALDJASJDLAJSJAPOJPOASDJG
DSFHBDSFHSDFHDFSHJDFSJKSSF
headers.txt
anduniqueheaders.txt
) but you also seem to have a file that has both headers and sequences. Is that fileheaders.txt
or is it a third file? And what do you mean that "both files contain unique header names"? Isn't the whole point that one of the files has duplicate header names?