2
\$\begingroup\$

I had a sample file with a long string >50M like this.

CACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCNNNNNNNCACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtgtgGCCCTAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatataGCCCTAGGtcatgtgtgatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCCTAGGNNNNNNNGCCCTAGGNNNNNNNNNNNNNNAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgttggggtNNNNNNGgtgtgtatatatcatagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgggtgtgtggggttagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNtgttgttttattttcttacaggtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtAGCCCTAGGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGGCtggtgtgtggggttagggAtagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggttagggNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGaggcatattgatcCCCTCCAAGGATCaagctCCgaTNNNNNNNggttagggttNNNNNGgtgtCCCTAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNtgttgttttattttcttacaggtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtAGCCCTAGGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgttggggtNNNNNNGgtgtgtatatatcatagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgggtgtgtggggttagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

For everything substring with length k = 50 (which means there are length(file)-k+1 substring)

if the A||T||a||t (Upper & Lower case) is >40%,

replace every character in that substring with N or n (preserving case).

Sample output:

CACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCNNNNNNNCACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnggttaNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCCTAGGNNNNNNNGCCCTAGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGNNnnnnnnnnnnnnnnnnnnNnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnngggttagggNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGaggcatattgatcCCCTCCAAGGATCaagctCCgaTNNNNNNNggttagggttNNNNNGnnnnNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnngNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

I was using AWK in command line for ease, but it just runs extremely slow with string replacement... and consume only <5% CPU somehow.

Code here:

# Method 1
cat chr.fa | head -n1 > chr.edited.fa
cat chr.fa | tail -n+2 | awk -v k=100 -v r=40 '{
    printf("chr.fa: %d\n",length($0))
    i = 1;
    while (i <= length($0)-k+1) {
        x = substr($0, i, k)
        if (i == 1) {
            rate = gsub(/A/,"A",x) + gsub(/T/,"T",x) + gsub(/a/,"a",x) + gsub(/t/,"t",x)
        } else {
            prevx = substr($0,i-1,1)
            if (prevx == "A" || prevx == "a" || prevx == "T" || prevx == "t")
                rate -= 1
            nextx = substr(x,k,1)
            if (nextx == "A" || nextx == "a" || nextx == "T" || nextx == "t")
                rate += 1
        }
        if (rate>r*k/100) {
            h++
            highGC[i] = i
        }
        printf("index-r:%f%% high-AT:%d \r",i/(length($0)-k+1)*100,h)
        i += 1
    }
    printf("index-r:%f%% high-AT:%d\n\n",i/(length($0)-k+1)*100,h)
    for (j in highGC) {
        y = highGC[j]
        SUB++
        printf("sub-r:%f%% \r",SUB/h*100)
        x = substr($0, y, k)
        gsub (/[AGCT]/,"N",x)
        gsub (/[agct]/,"n",x)
        $0 = substr($0,1,y-1) x substr($0,y+k)
    }
    printf("sub-r:%f%%\nsubstituted:%d\n\n",SUB/h*100,SUB)
    printf("%s",$0) >> "chr.edited.fa"
}'

# Method 2
cat chr.fa | head -n1 > chr.edited2.fa
cat chr.fa | tail -n+2 | awk -v k="100" -v r=40 '{
    printf("chr.fa: %d\n",length($0))
    i = 1;
    h = 0;
    while (i <= length($0)-k+1) {
        x = substr($0, i, k)
        rate = gsub(/[ATX]/,"X",x) + gsub(/[atx]/,"x",x)
        if (rate>r/k*100) {
            h++
            gsub (/[GC]/,"N",x)
            gsub (/[gc]/,"n",x)
            $0 = substr($0,1,i-1) x substr($0,i+k)
        }
        printf("index-r:%f%% sub-r:%f%% \r",i/(length($0)-k+1)*100,h/544*100)
        i += 1
    }
    gsub (/X/,"N",$0)
    gsub (/x/,"n",$0)
    printf("index-r:%f%% sub-r:%f%% \n\n",i/(length($0)-k+1)*100,h/544*100)
    printf("%s",$0) >> "chr.edited2.fa"
}'

# Method 3
cat chr.fa | head -n1 > chr.edited3.fa
cat chr.fa | tail -n+2 | awk -v k="100" -v r=40 '{
    printf("chr.fa: %d\n",length($0))
    i = 1;
    h = 0;
    while (i <= length($0)-k+1) {
        x = substr($0, i, k)
        rate = gsub(/A/,"A",x) + gsub(/T/,"T",x) + gsub(/a/,"a",x) + gsub(/t/,"t",x)
        if (rate>r/k*100) {
          h++
          gsub(/[ACGT]/,"N",x)
          gsub(/[acgt]/,"n",x)
          if (i == 1) {
            s = x
          } else {
            s = substr(s,1,length(s)-k+1) x
          }
        } else  {
          if (i == 1) {
            s = x
          } else {
            s = s substr(x,k,1)
          }
        }
        printf("index-r:%f%% sub-r:%f%% \r",i/(length($0)-k+1)*100,h/544*100)
        i += 1
    }
    printf("index-r:%f%% sub-r:%f%% \n\n",i/(length($0)-k+1)*100,h/544*100)
    printf("%s",s) >> "chr.edited3.fa"
}'

# Method 4
cat chr.fa | awk -v k=100 -v r=40 '{
  if (gsub(/>/,">",$0) > 0) {
    printf("%s\n",$0) > "chr.edited4.fa"
  } else {
    printf("chr.fa: %d\n",length($0))
    i = 1;
    h = 0;
    while (i <= length($0)-k+1) {
      x = substr($0, i, k)
      if (i == 1) {
          rate = gsub(/A/,"A",x) + gsub(/T/,"T",x) + gsub(/a/,"a",x) + gsub(/t/,"t",x)
      } else {
          prevx = substr($0,i-1,1)
          rate -= gsub(/[ATat]/,"",prevx)
          nextx = substr(x,k,1)
          rate += gsub(/[ATat]/,"",nextx)
      }
      if (rate > r/k*100) {
        h++
        gsub(/[ACGT]/,"N",x)
        gsub(/[acgt]/,"n",x)
        if (i > 1) {
          s = substr(s,1,length(s)-k+1) x
        }
      } else  { 
        if (i > 1) {
          s = s substr(x,k,1)
        }
      }
      if (i == 1) {
         s = x
      }
      printf("index-r:%f%% sub-r:%f%% \r",(i-1)/(length($0)-k+1)*100,h/544*100)
      i += 1
    }
    printf("index-r:%f%% sub-r:%f%% \n\n",(i-1)/(length($0)-k+1)*100,h/544*100)
    printf("%s",s) >> "chr.edited4.fa"        
  }
}'

Are there any faster algorithm for this problem? If no, are there any language can perform string replacement faster?

More info: The AWK command consume ~30% CPU at WSL & GitBash, but only ~5% on windows cmd with an OpenSSH client, where the progress rate is similar.


The algorithm had improved from method 1 to method 4, and the estimated runtime decreased from ~3 days to ~16h.

\$\endgroup\$
5
  • \$\begingroup\$ Is this part of a programming challenge? Do you know what the type of input is called? \$\endgroup\$ Commented Sep 16, 2020 at 7:29
  • \$\begingroup\$ It is a part of my assignment... which the input is fasta .fa file. This problem is called as high-AT content masking, which is related to DNA and human genomes \$\endgroup\$ Commented Sep 16, 2020 at 10:58
  • \$\begingroup\$ Right, FASTA format. At first glance it looked a bit inconsistent for Fasta. \$\endgroup\$ Commented Sep 16, 2020 at 11:21
  • \$\begingroup\$ First step is understand the requirements. That is your job. Can you clarify the question? I think the inputfile only has the characters ACGTacgt, can help with a solution. The condition if the A||T||a||t (Upper & Lower case) is >40% is unclear, what do you mean with ||? Are you asking for substrings where at least 40% of the characters are in [ATat]? \$\endgroup\$ Commented Apr 18, 2021 at 18:34
  • \$\begingroup\$ How handle overlapping substrings? When in the inputstring starts with 60 characters A, and the ret is a G, what to do? The first 50 match the condition and should be replaced by 50 N's. In the original string the next 10 characters where part of 50 characters A-string, but after the replacement of the first 50 characters, they are just 10 isaloted A's. Should these A's be replaced by N's too? \$\endgroup\$ Commented Apr 18, 2021 at 18:40

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.