Python: efficient way to find indexes of differences in strings

Question

I have a large amount of DNA sequences formatted in a text file like this:

//
BLOCK_1
AATTCAT
AGTTCAT
AATTCGT
//
BLOCK_2
AGAGGA
AGAGGA
AGAGGA

Each line in a block corresponds to one sample. The real dataset has 120 characters per sequence line, 250 samples/lines per 'block' and a total of roughly 10000 such blocks.

I now need to remove those positions in each block that are the same in all samples. The desired output looks like this:

//
BLOCK_1
AA
GA
AG
//
BLOCK_2

Getting rid of those blocks where all sequences are the same was quickly done by creating a list of hashes for each line and following this on SO: Python: determine if all items of a list are the same item. However, I'm now struggling to find a way to efficiently find the positions in those strings where at least one of them differs.

Selcuk · Accepted Answer · 2018-05-07 15:49:45Z

2

This is one way of doing it. Note that this only processes a single block, so you have to extend it to work with multiple blocks:

>>> block1 = ['AATTCAT', 'AGTTCAT', 'AATTCGT']
>>> [''.join(b) for b in zip(*(r for r in zip(*block1) if len(set(r)) > 1))]
['AA', 'GA', 'AG']

This basically transposes the 2d array so that you can eliminate the lists when all the letters in the same row (previously in the same column) are the same. Then you re-transpose the result.

answered May 7, 2018 at 15:49

Selcuk

59.6k12 gold badges111 silver badges115 bronze badges

1

Thank you very much, this did indeed solve my problem. I already had each block as a list of strings in my code so this was easily expanded and it worked efficiently on the entire dataset. I am struggling with understanding how zip() works but I'll do some reading there.
– Sam
Commented May 8, 2018 at 4:46

Add a comment |

Collectives™ on Stack Overflow

Python: efficient way to find indexes of differences in strings

1 Answer 1

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Linked

Related