Faster algorithm/language for substring sliding window replacement

Question

I had a sample file with a long string >50M like this.

CACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCNNNNNNNCACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtgtgGCCCTAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatataGCCCTAGGtcatgtgtgatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCCTAGGNNNNNNNGCCCTAGGNNNNNNNNNNNNNNAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgttggggtNNNNNNGgtgtgtatatatcatagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgggtgtgtggggttagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNtgttgttttattttcttacaggtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtAGCCCTAGGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGGCtggtgtgtggggttagggAtagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggttagggNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGaggcatattgatcCCCTCCAAGGATCaagctCCgaTNNNNNNNggttagggttNNNNNGgtgtCCCTAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNtgttgttttattttcttacaggtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtAGCCCTAGGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgttggggtNNNNNNGgtgtgtatatatcatagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgggtgtgtggggttagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

For everything substring with length k = 50 (which means there are length(file)-k+1 substring)

if the A||T||a||t (Upper & Lower case) is >40%,

replace every character in that substring with N or n (preserving case).

Sample output:

CACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCNNNNNNNCACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnggttaNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCCTAGGNNNNNNNGCCCTAGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGNNnnnnnnnnnnnnnnnnnnNnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnngggttagggNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGaggcatattgatcCCCTCCAAGGATCaagctCCgaTNNNNNNNggttagggttNNNNNGnnnnNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnngNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

I was using AWK in command line for ease, but it just runs extremely slow with string replacement... and consume only <5% CPU somehow.

Code here:

# Method 1
cat chr.fa | head -n1 > chr.edited.fa
cat chr.fa | tail -n+2 | awk -v k=100 -v r=40 '{
    printf("chr.fa: %d\n",length($0))
    i = 1;
    while (i <= length($0)-k+1) {
        x = substr($0, i, k)
        if (i == 1) {
            rate = gsub(/A/,"A",x) + gsub(/T/,"T",x) + gsub(/a/,"a",x) + gsub(/t/,"t",x)
        } else {
            prevx = substr($0,i-1,1)
            if (prevx == "A" || prevx == "a" || prevx == "T" || prevx == "t")
                rate -= 1
            nextx = substr(x,k,1)
            if (nextx == "A" || nextx == "a" || nextx == "T" || nextx == "t")
                rate += 1
        }
        if (rate>r*k/100) {
            h++
            highGC[i] = i
        }
        printf("index-r:%f%% high-AT:%d \r",i/(length($0)-k+1)*100,h)
        i += 1
    }
    printf("index-r:%f%% high-AT:%d\n\n",i/(length($0)-k+1)*100,h)
    for (j in highGC) {
        y = highGC[j]
        SUB++
        printf("sub-r:%f%% \r",SUB/h*100)
        x = substr($0, y, k)
        gsub (/[AGCT]/,"N",x)
        gsub (/[agct]/,"n",x)
        $0 = substr($0,1,y-1) x substr($0,y+k)
    }
    printf("sub-r:%f%%\nsubstituted:%d\n\n",SUB/h*100,SUB)
    printf("%s",$0) >> "chr.edited.fa"
}'

# Method 2
cat chr.fa | head -n1 > chr.edited2.fa
cat chr.fa | tail -n+2 | awk -v k="100" -v r=40 '{
    printf("chr.fa: %d\n",length($0))
    i = 1;
    h = 0;
    while (i <= length($0)-k+1) {
        x = substr($0, i, k)
        rate = gsub(/[ATX]/,"X",x) + gsub(/[atx]/,"x",x)
        if (rate>r/k*100) {
            h++
            gsub (/[GC]/,"N",x)
            gsub (/[gc]/,"n",x)
            $0 = substr($0,1,i-1) x substr($0,i+k)
        }
        printf("index-r:%f%% sub-r:%f%% \r",i/(length($0)-k+1)*100,h/544*100)
        i += 1
    }
    gsub (/X/,"N",$0)
    gsub (/x/,"n",$0)
    printf("index-r:%f%% sub-r:%f%% \n\n",i/(length($0)-k+1)*100,h/544*100)
    printf("%s",$0) >> "chr.edited2.fa"
}'

# Method 3
cat chr.fa | head -n1 > chr.edited3.fa
cat chr.fa | tail -n+2 | awk -v k="100" -v r=40 '{
    printf("chr.fa: %d\n",length($0))
    i = 1;
    h = 0;
    while (i <= length($0)-k+1) {
        x = substr($0, i, k)
        rate = gsub(/A/,"A",x) + gsub(/T/,"T",x) + gsub(/a/,"a",x) + gsub(/t/,"t",x)
        if (rate>r/k*100) {
          h++
          gsub(/[ACGT]/,"N",x)
          gsub(/[acgt]/,"n",x)
          if (i == 1) {
            s = x
          } else {
            s = substr(s,1,length(s)-k+1) x
          }
        } else  {
          if (i == 1) {
            s = x
          } else {
            s = s substr(x,k,1)
          }
        }
        printf("index-r:%f%% sub-r:%f%% \r",i/(length($0)-k+1)*100,h/544*100)
        i += 1
    }
    printf("index-r:%f%% sub-r:%f%% \n\n",i/(length($0)-k+1)*100,h/544*100)
    printf("%s",s) >> "chr.edited3.fa"
}'

# Method 4
cat chr.fa | awk -v k=100 -v r=40 '{
  if (gsub(/>/,">",$0) > 0) {
    printf("%s\n",$0) > "chr.edited4.fa"
  } else {
    printf("chr.fa: %d\n",length($0))
    i = 1;
    h = 0;
    while (i <= length($0)-k+1) {
      x = substr($0, i, k)
      if (i == 1) {
          rate = gsub(/A/,"A",x) + gsub(/T/,"T",x) + gsub(/a/,"a",x) + gsub(/t/,"t",x)
      } else {
          prevx = substr($0,i-1,1)
          rate -= gsub(/[ATat]/,"",prevx)
          nextx = substr(x,k,1)
          rate += gsub(/[ATat]/,"",nextx)
      }
      if (rate > r/k*100) {
        h++
        gsub(/[ACGT]/,"N",x)
        gsub(/[acgt]/,"n",x)
        if (i > 1) {
          s = substr(s,1,length(s)-k+1) x
        }
      } else  { 
        if (i > 1) {
          s = s substr(x,k,1)
        }
      }
      if (i == 1) {
         s = x
      }
      printf("index-r:%f%% sub-r:%f%% \r",(i-1)/(length($0)-k+1)*100,h/544*100)
      i += 1
    }
    printf("index-r:%f%% sub-r:%f%% \n\n",(i-1)/(length($0)-k+1)*100,h/544*100)
    printf("%s",s) >> "chr.edited4.fa"        
  }
}'

Are there any faster algorithm for this problem? If no, are there any language can perform string replacement faster?

More info: The AWK command consume ~30% CPU at WSL & GitBash, but only ~5% on windows cmd with an OpenSSH client, where the progress rate is similar.

The algorithm had improved from method 1 to method 4, and the estimated runtime decreased from ~3 days to ~16h.

Is this part of a programming challenge? Do you know what the type of input is called? — Mast
– Mast ♦, Commented Sep 16, 2020 at 7:29
It is a part of my assignment... which the input is fasta .fa file. This problem is called as high-AT content masking, which is related to DNA and human genomes — hey0god
– hey0god, Commented Sep 16, 2020 at 10:58
Right, FASTA format. At first glance it looked a bit inconsistent for Fasta. — Mast
– Mast ♦, Commented Sep 16, 2020 at 11:21
First step is understand the requirements. That is your job. Can you clarify the question? I think the inputfile only has the characters ACGTacgt, can help with a solution. The condition if the A||T||a||t (Upper & Lower case) is >40% is unclear, what do you mean with ||? Are you asking for substrings where at least 40% of the characters are in [ATat]? — Walter A
– Walter A, Commented Apr 18, 2021 at 18:34
How handle overlapping substrings? When in the inputstring starts with 60 characters A, and the ret is a G, what to do? The first 50 match the condition and should be replaced by 50 N's. In the original string the next 10 characters where part of 50 characters A-string, but after the replacement of the first 50 characters, they are just 10 isaloted A's. Should these A's be replaced by N's too? — Walter A
– Walter A, Commented Apr 18, 2021 at 18:40

Stack Exchange Network

Faster algorithm/language for substring sliding window replacement

0

You must log in to answer this question.

Hot Network Questions

Faster algorithm/language for substring sliding window replacement

0

You must log in to answer this question.

Related

Hot Network Questions