Create regex-like (run length encoding) of string s for blocks of a given length k

Question

I am looking for python code to perform a run length encoding to obtain a regex-like summary of a string s, for a known length k for the blocks. How should I tackle this?

e.g.

s=TATTTTATTTTATTTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTACATTATTTTA

with k=5 could become

(TATTT)3(TATGT)9TACATTATTTTA

You can use a regexp like (.{5})\1* to look for all the repetitions of a 5-character sequence. — Barmar
– Barmar, Commented Jun 17, 2024 at 20:17
DNA/RNA works in blocks of 3 to be translated into aminoacids, but a DNA tandem repeat can have different motif lengths if it is non-coding. — Wouter De Coster
– Wouter De Coster, Commented Jun 17, 2024 at 20:19
@Barmar I think it would be (.{5})\1+, no? Also, for OP I think the pseudo-code would be capture using k, surround in parenthesis, divide capture by k, append number after close parenthesis — MonkeyZeus
– MonkeyZeus, Commented Jun 17, 2024 at 20:21

Jan · Accepted Answer · 2024-06-17 20:24:13Z

7

Instead of a regex pattern, you could split and group with itertools.groupby:

from itertools import groupby

s = "TATTTTATTTTATTTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTACATTATTTTA"

k = 5
parts = [s[i:i+k] for i in range(0, len(s), k)]

for k, g in groupby(parts):
    print(k, len(list(g)))

For your specific string this would yield

TATTT 3
TATGT 9
TACAT 1
TATTT 1
TA 1

Or - if you need to stick with your specific format as well:

lst = []
for k, g in groupby(parts):
    _len = len(list(g))
    if _len > 1:
        lst.append(f"({k}){_len}")
    else:
        lst.append(k)
    

print("".join(lst))
# (TATTT)3(TATGT)9TACATTATTTTA

answered Jun 17, 2024 at 20:24

Jan

43.3k11 gold badges57 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Wouter De Coster Over a year ago

This is very elegant, awesome! Now I realized that, depending on the start position of the first block, more compact representations can be obtained (see stackoverflow.com/a/78634653/6631639). And even then it is not the best possible representation, as it can be that leaving, between two repeated blocks, a set of <k characters improves the compactness of the final result.

Barmar · Accepted Answer · 2024-06-17 21:07:12Z

4

Use a capture group and back-reference to match repeated strings.

k=5
s='TATTTTATTTTATTTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTACATTATTTTA'
rx = re.compile(rf'(.{{{k}}})\1*')
matches = rx.finditer(s)

output = ''
total_len = 0
for match in matches: 
    repeats = len(match.group()) // k
    total_len += len(match.group())
    if repeats == 1:
        output += match.group(1)
    else:
        output += f'({match.group(1)}){repeats}'

# Get the unmatched part at the end
output += s[total_len:]

print(output)

edited Jun 17, 2024 at 21:07

answered Jun 17, 2024 at 20:33

Barmar

789k57 gold badges555 silver badges669 bronze badges

4 Comments

Wouter De Coster Over a year ago

Thanks! It seems your solution lost the last "TA" of the string, though. Regexes hurt my brain :-)

Barmar Over a year ago

Yeah, this assumes the input is a multiple of k. You could calculate the total length of all the matches and then grab the slice of the input after that.

Barmar Over a year ago

I added that to my answer.

Wouter De Coster Over a year ago

Awesome, yes, that solves it. I did realize that this is not guaranteed to be the most compact representation, as it may be that leaving a sequence of length < k between blocks improves (minimizes) the total length of the string. But that was not the primary objective :)

Wouter De Coster · Accepted Answer · 2024-06-17 21:01:03Z

I slightly modified the answer from @Jan to provide the shortest out of the k possible encodings (with k different start sites for the first block). This can make quite a difference, for some cases.

from itertools import groupby

def rle(sequence, motif_length):
    solutions = []
    for start in range(0, motif_length):
        parts = [sequence[i : i + motif_length] for i in range(start, len(sequence), motif_length)]
        lst = [sequence[0:start] if start > 0 else '']
        for k, g in groupby(parts):
            _len = len(list(g))
            if _len > 1:
                lst.append(f"({k}){_len}")
            else:
                lst.append(k)
        solutions.append("".join(lst))
    # return the shortest solution
    return min(solutions, key=len)    

rle("TATTTTATTTTATTTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTACATTATTTTA", 5)
rle("TCTTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTATTTTA", 5)

MonkeyZeus · Accepted Answer · 2024-06-17 20:32:01Z

0

I don't know Python but you could translate this PHP code if you'd like:

$s = 'TATTTTATTTTATTTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTACATTATTTTA';
$k = 5;

echo preg_replace_callback( '/(.{'.$k.'})\1+/', function( $matches ) use ( $k ) {
    return '('.$matches[ 1 ].')'.( strlen( $matches[ 0 ] )  / $k );
}, $s );

answered Jun 17, 2024 at 20:32

MonkeyZeus

20.8k4 gold badges41 silver badges83 bronze badges

1 Comment

Alex Reynolds Over a year ago

Note that the question is explicitly asking for a Python solution.

Collectives™ on Stack Overflow

Create regex-like (run length encoding) of string s for blocks of a given length k

4 Answers 4

1 Comment

4 Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

4 Comments

Comments

1 Comment

Linked

Related