2

I have a large amount of DNA sequences formatted in a text file like this:

//
BLOCK_1
AATTCAT
AGTTCAT
AATTCGT
//
BLOCK_2
AGAGGA
AGAGGA
AGAGGA

Each line in a block corresponds to one sample. The real dataset has 120 characters per sequence line, 250 samples/lines per 'block' and a total of roughly 10000 such blocks.

I now need to remove those positions in each block that are the same in all samples. The desired output looks like this:

//
BLOCK_1
AA
GA
AG
//
BLOCK_2

Getting rid of those blocks where all sequences are the same was quickly done by creating a list of hashes for each line and following this on SO: Python: determine if all items of a list are the same item. However, I'm now struggling to find a way to efficiently find the positions in those strings where at least one of them differs.

0

1 Answer 1

2

This is one way of doing it. Note that this only processes a single block, so you have to extend it to work with multiple blocks:

>>> block1 = ['AATTCAT', 'AGTTCAT', 'AATTCGT']
>>> [''.join(b) for b in zip(*(r for r in zip(*block1) if len(set(r)) > 1))]
['AA', 'GA', 'AG']

This basically transposes the 2d array so that you can eliminate the lists when all the letters in the same row (previously in the same column) are the same. Then you re-transpose the result.

1
  • 1
    Thank you very much, this did indeed solve my problem. I already had each block as a list of strings in my code so this was easily expanded and it worked efficiently on the entire dataset. I am struggling with understanding how zip() works but I'll do some reading there.
    – Sam
    Commented May 8, 2018 at 4:46

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.