1

I have a big file with entries as opened in python as:

 fh_in=open('/xzy/abc', 'r') 
 parsed_in=csv.reader(fh_in, delimiter=',')
 for element in parsed_in:
  print(element)

RESULT:

['ABC', 'chr9', '3468582', 'NAME1', 'UGA', 'GGU']

['DEF', 'chr9', '14855289', NAME19', 'UCG', 'GUC']

['TTC', 'chr9', '793946', 'NAME178', 'CAG', 'GUC']

['ABC', 'chr9', '3468582', 'NAME272', 'UGT', 'GCU']

I have to extract only the unique entries and to remove entries with same values in col1, col2 and col3. Like in this case last line is same as line 1 on the basis of col1, col2 and col3.

I have tried two methods but failed:

Method 1:

outlist=[]

for element in parsed_in:     
  if element[0:3] not in outlist[0:3]:
    outlist.append(element)

Method 2:

outlist=[]
parsed_list=list(parsed_in)
for element in range(0,len(parsed_list)):
  if parsed_list[element] not in parsed_list[element+1:]:
    outlist.append(parsed_list[element])

These both gives back all the entries and not unique entries on basis of first 3 columns.

Please suggest me a way to do so

AK

2

2 Answers 2

3

You probably want to use an O(1) lookup to save yourself a full scan of the elements while adding, and like Caol Acain said, sets is a good way to do it.

What you want to do is something like:

outlist=[]
added_keys = set()

for row in parsed_in:
    # We use tuples because they are hashable
    lookup = tuple(row[:3])    
    if lookup not in added_keys:
        outlist.append(row)
        added_keys.add(lookup)

You could alternately have used a dictionary mapping the key to the row, but this would have the caveat that you would not preserve the ordering of the input, so having the list and the key set allows you to keep the ordering as in-file.

1
  • First good answer, much better than the one I was going to post. +1
    – MitMaro
    Commented Mar 1, 2012 at 21:09
0

Convert your lists to sets!

http://docs.python.org/tutorial/datastructures.html#sets

1
  • I thought this first as well but if you read the problem closer you will see that sets won't work. Each item in the list is unique only on the first three elements on the sub lists.
    – MitMaro
    Commented Mar 1, 2012 at 21:06

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.