I have a large amount of DNA sequences formatted in a text file like this:
//
BLOCK_1
AATTCAT
AGTTCAT
AATTCGT
//
BLOCK_2
AGAGGA
AGAGGA
AGAGGA
Each line in a block corresponds to one sample. The real dataset has 120 characters per sequence line, 250 samples/lines per 'block' and a total of roughly 10000 such blocks.
I now need to remove those positions in each block that are the same in all samples. The desired output looks like this:
//
BLOCK_1
AA
GA
AG
//
BLOCK_2
Getting rid of those blocks where all sequences are the same was quickly done by creating a list of hashes for each line and following this on SO: Python: determine if all items of a list are the same item. However, I'm now struggling to find a way to efficiently find the positions in those strings where at least one of them differs.