I have a sample data set:
import pandas as pd
import re
df = {'READID': [1,2,3 ,4,5 ,6,7 ,8,9],
'VG': ['LV5-F*01','LV5-F*01' ,'LV5-A*02','LV5-D*01','LV5-E*01','LV5-C*01','LV5-D*01','LV5-E*01','LV5-F*01'],
'Pro': [1,1,1,0.33,0.59,1,0.96,1,1]}
df = pd.DataFrame(df)
it looks like this:
df
Out[12]:
Pro READID VG
0 1.00 1 LV5-F*01
1 1.00 2 LV5-F*01
2 1.00 3 LV5-A*02
3 0.33 4 LV5-D*01
4 0.59 5 LV5-E*01
5 1.00 6 LV5-C*01
6 0.96 7 LV5-D*01
7 1.00 8 LV5-E*01
8 1.00 9 LV5-F*01
i want to groupby column 'VG' but only the part before '*' for each row, and then group by the same values and output them into separate files.
my concept is:
- group the dataset 'df' by column 'VG'
- for each row of column 'VG' look at only the part before the '*', e.g. 'LV5-F', 'LV5-A', 'LV5-D', etc.
- group the dataset once again but this time for the same values from step 2
- output each different grouped set to a separate file.
desire output, individual separate files:
'LV5-F.txt':
Pro READID VG
0 1.00 1 LV5-F*01
1 1.00 2 LV5-F*01
8 1.00 9 LV5-F*01
'LV5-A.txt':
Pro READID VG
2 1.00 3 LV5-A*02
'LV5-D.txt':
Pro READID VG
3 0.33 4 LV5-D*01
6 0.96 7 LV5-D*01
'LV5-E.txt':
Pro READID VG
4 0.59 5 LV5-E*01
7 1.00 8 LV5-E*01
'LV5-C.txt':
Pro READID VG
5 1.00 6 LV5-C*01
my attempt:
(df.groupby('VG')
.apply(lambda x: re.findall('([0-9A-Z-]+)\*',x) )
.groupby('VG')
.apply(lambda gp: gp.to_csv('{}.txt'.format(gp.name), sep='\t', index=False))
)
but it failed at the '.apply(lambda x: re.findall('([0-9A-Z-]+)*',x)' step and i'm not sure why it doesn't work because when i ran that code by itself without in the context of being a lambda function, it worked fine.