2

I have a sample data set:

import pandas as pd
import re

df = {'READID': [1,2,3  ,4,5    ,6,7    ,8,9],
  'VG': ['LV5-F*01','LV5-F*01'  ,'LV5-A*02','LV5-D*01','LV5-E*01','LV5-C*01','LV5-D*01','LV5-E*01','LV5-F*01'],
  'Pro': [1,1,1,0.33,0.59,1,0.96,1,1]}

df = pd.DataFrame(df)

it looks like this:

df
Out[12]: 
     Pro    READID        VG
0   1.00       1      LV5-F*01
1   1.00       2      LV5-F*01
2   1.00       3      LV5-A*02
3   0.33       4      LV5-D*01
4   0.59       5      LV5-E*01
5   1.00       6      LV5-C*01
6   0.96       7      LV5-D*01
7   1.00       8      LV5-E*01
8   1.00       9      LV5-F*01

i want to groupby column 'VG' but only the part before '*' for each row, and then group by the same values and output them into separate files.

my concept is:

  1. group the dataset 'df' by column 'VG'
  2. for each row of column 'VG' look at only the part before the '*', e.g. 'LV5-F', 'LV5-A', 'LV5-D', etc.
  3. group the dataset once again but this time for the same values from step 2
  4. output each different grouped set to a separate file.

desire output, individual separate files:

'LV5-F.txt':
     Pro    READID        VG
0   1.00       1      LV5-F*01
1   1.00       2      LV5-F*01
8   1.00       9      LV5-F*01


'LV5-A.txt':
     Pro    READID        VG
2   1.00       3      LV5-A*02


'LV5-D.txt':
     Pro    READID        VG
3   0.33       4      LV5-D*01
6   0.96       7      LV5-D*01


'LV5-E.txt':
     Pro    READID        VG
4   0.59       5      LV5-E*01
7   1.00       8      LV5-E*01


'LV5-C.txt':
    Pro    READID        VG
5   1.00       6      LV5-C*01

my attempt:

(df.groupby('VG')
   .apply(lambda x: re.findall('([0-9A-Z-]+)\*',x) )
   .groupby('VG')
   .apply(lambda gp: gp.to_csv('{}.txt'.format(gp.name), sep='\t',   index=False))
 )

but it failed at the '.apply(lambda x: re.findall('([0-9A-Z-]+)*',x)' step and i'm not sure why it doesn't work because when i ran that code by itself without in the context of being a lambda function, it worked fine.

2 Answers 2

2

You'll have to adjust the function below to_csv to suit your needs. In particular, instead of printing, just provide a file name somehow.

But I'd structure it this way:

def to_csv(df):
    print df.to_csv()

#    extract
#     within
#     parens
#    /------\
# r'^([^\*]+)'
#   ^ \----/
#   |   \__________________________
# match       |          |         |
# beginning  [^this]    \*        '+'
# of string  matches   have to    match
#            not this  escape *   one or more
#
df.groupby(df.VG.str.extract(r'^([^\*]+)', expand=False)).apply(to_csv)

,Pro,READID,VG
2,1.0,3,LV5-A*02

,Pro,READID,VG
2,1.0,3,LV5-A*02

,Pro,READID,VG
5,1.0,6,LV5-C*01

,Pro,READID,VG
3,0.33,4,LV5-D*01
6,0.96,7,LV5-D*01

,Pro,READID,VG
4,0.59,5,LV5-E*01
7,1.0,8,LV5-E*01

,Pro,READID,VG
0,1.0,1,LV5-F*01
1,1.0,2,LV5-F*01
8,1.0,9,LV5-F*01
Sign up to request clarification or add additional context in comments.

3 Comments

i got an error: 'typeError: extract() got an unexpected keyword argument 'expand', also why does the output you show contain commas? is there a way to produce the output that i desired?
@Jessica drop that argument. It'll complain if you don't have it in pandas version 0.18.1. Prior to that, it complains that you have it at all.
can you explain to me the regex part? r'^([^*]+)' thanks
1

I modified my code with help from @piRSquared and it worked :

df.groupby(df.VG.str.extract(r'^([^\*]+)')).apply(lambda gp: gp.to_csv('{}.txt'.format(gp.name), sep='\t', index=False))

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.