1

I am looking for python code to perform a run length encoding to obtain a regex-like summary of a string s, for a known length k for the blocks. How should I tackle this?

e.g.

s=TATTTTATTTTATTTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTACATTATTTTA

with k=5 could become

(TATTT)3(TATGT)9TACATTATTTTA
10
  • Why not (TATTT)3(TATGT)9TACATTA(T)4A? Commented Jun 17, 2024 at 20:16
  • 1
    You can use a regexp like (.{5})\1* to look for all the repetitions of a 5-character sequence. Commented Jun 17, 2024 at 20:17
  • 1
    I thought DNA worked in blocks of 3 (codons). Commented Jun 17, 2024 at 20:18
  • 2
    DNA/RNA works in blocks of 3 to be translated into aminoacids, but a DNA tandem repeat can have different motif lengths if it is non-coding. Commented Jun 17, 2024 at 20:19
  • 1
    @Barmar I think it would be (.{5})\1+, no? Also, for OP I think the pseudo-code would be capture using k, surround in parenthesis, divide capture by k, append number after close parenthesis Commented Jun 17, 2024 at 20:21

4 Answers 4

7

Instead of a regex pattern, you could split and group with itertools.groupby:

from itertools import groupby

s = "TATTTTATTTTATTTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTACATTATTTTA"

k = 5
parts = [s[i:i+k] for i in range(0, len(s), k)]

for k, g in groupby(parts):
    print(k, len(list(g)))
    

For your specific string this would yield

TATTT 3
TATGT 9
TACAT 1
TATTT 1
TA 1

Or - if you need to stick with your specific format as well:

lst = []
for k, g in groupby(parts):
    _len = len(list(g))
    if _len > 1:
        lst.append(f"({k}){_len}")
    else:
        lst.append(k)
    

print("".join(lst))
# (TATTT)3(TATGT)9TACATTATTTTA
Sign up to request clarification or add additional context in comments.

1 Comment

This is very elegant, awesome! Now I realized that, depending on the start position of the first block, more compact representations can be obtained (see stackoverflow.com/a/78634653/6631639). And even then it is not the best possible representation, as it can be that leaving, between two repeated blocks, a set of <k characters improves the compactness of the final result.
4

Use a capture group and back-reference to match repeated strings.

k=5
s='TATTTTATTTTATTTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTACATTATTTTA'
rx = re.compile(rf'(.{{{k}}})\1*')
matches = rx.finditer(s)

output = ''
total_len = 0
for match in matches: 
    repeats = len(match.group()) // k
    total_len += len(match.group())
    if repeats == 1:
        output += match.group(1)
    else:
        output += f'({match.group(1)}){repeats}'

# Get the unmatched part at the end
output += s[total_len:]

print(output)

4 Comments

Thanks! It seems your solution lost the last "TA" of the string, though. Regexes hurt my brain :-)
Yeah, this assumes the input is a multiple of k. You could calculate the total length of all the matches and then grab the slice of the input after that.
I added that to my answer.
Awesome, yes, that solves it. I did realize that this is not guaranteed to be the most compact representation, as it may be that leaving a sequence of length < k between blocks improves (minimizes) the total length of the string. But that was not the primary objective :)
3

I slightly modified the answer from @Jan to provide the shortest out of the k possible encodings (with k different start sites for the first block). This can make quite a difference, for some cases.

from itertools import groupby

def rle(sequence, motif_length):
    solutions = []
    for start in range(0, motif_length):
        parts = [sequence[i : i + motif_length] for i in range(start, len(sequence), motif_length)]
        lst = [sequence[0:start] if start > 0 else '']
        for k, g in groupby(parts):
            _len = len(list(g))
            if _len > 1:
                lst.append(f"({k}){_len}")
            else:
                lst.append(k)
        solutions.append("".join(lst))
    # return the shortest solution
    return min(solutions, key=len)    

rle("TATTTTATTTTATTTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTACATTATTTTA", 5)
rle("TCTTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTA", 5)

Comments

0

I don't know Python but you could translate this PHP code if you'd like:

$s = 'TATTTTATTTTATTTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTACATTATTTTA';
$k = 5;

echo preg_replace_callback( '/(.{'.$k.'})\1+/', function( $matches ) use ( $k ) {
    return '('.$matches[ 1 ].')'.( strlen( $matches[ 0 ] )  / $k );
}, $s );

1 Comment

Note that the question is explicitly asking for a Python solution.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.