Storing outputdata in CSV using python

Question

I have extracted data from different excel sheets spread in different folders, I have organized the folders numerically from 2015 to 2019 and each folder has twelve subfolders (from 1 to 12) here's my code:

import os
from os import walk
import pandas as pd 

path = r'C:\Users\Sarah\Desktop\IOMTest'
my_files = []
for (dirpath, dirnames, filenames) in walk(path):
    my_files.extend([os.path.join(dirpath, fname) for fname in filenames])


all_sheets = []
for file_name in my_files:

    #Display sheets names using pandas
    pd.set_option('display.width',300)
    mosul_file = file_name
    xl = pd.ExcelFile(mosul_file)
    mosul_df = xl.parse(0, header=[1], index_col=[0,1,2])

    #Read Excel and Select columns

    mosul_file = pd.read_excel(file_name, sheet_name = 0 , 
    index_clo=None, na_values= ['NA'], usecols = "A, E, G, H , L , M" )

    #Remove NaN values

    data_mosul_df = mosul_file.apply (pd.to_numeric, errors='coerce')
    data_mosul_df = mosul_file.dropna()
    print(data_mosul_df)

then I saved the extracted columns in a csv file

def save_frames(frames, output_path):

        for frame in frames:
            frame.to_csv(output_path, mode='a+', header=False)

if __name__ == '__main__':
       frames =[pd.DataFrame(data_mosul_df)]
       save_frames(frames, r'C:\Users\Sarah\Desktop\tt\c.csv')

My problem is that when I open the csv file it seems that it doesn't store all the data but only the last excel sheet that it has read or sometimes the two last excel sheets. however, when I print my data inside the console (in Spyder) I see that all the data are treated

    data_mosul_df = mosul_file.apply (pd.to_numeric, errors='coerce')
    data_mosul_df = mosul_file.dropna()
    print(data_mosul_df)

the picture below shows the output csv created. I am wondering if it is because from Column A to Column E the information are the same ? so that's why it overwrite ?

I would like to know how to modify the code so that it extract and store the data chronologically from folders (2015 to 2019) taking into accout subfolders (from 1 to 12) in each folder and how to create a csv that stores all the data ? thank you

You are overwriting data_mosul_df in your loop, you need to collect all data_mosul_df results... — 576i
– 576i, Commented Jan 16, 2020 at 8:51
First, initiate df=pd.DataFrame(). Second, read df_=pd.read_excel(). Last df = pd.concat([df, df_]). Do second and last in cycle. Return the result from your function. — Sergey Bushmanov
– Sergey Bushmanov, Commented Jan 16, 2020 at 8:52
@SergeyBushmanov thank you , Sorry i am a bit confused these step should be done after data_mosul_df ? — sf61
– sf61, Commented Jan 16, 2020 at 8:59

Sergey Bushmanov · Accepted Answer · 2020-01-16 09:25:47Z

Rewrite your loop:

for file_name in my_files:

    #Display sheets names using pandas
    pd.set_option('display.width',300)
    mosul_file = file_name
    xl = pd.ExcelFile(mosul_file)
    mosul_df = xl.parse(0, header=[1], index_col=[0,1,2])

    #Read Excel and Select columns
    mosul_file = pd.read_excel(file_name, sheet_name = 0 , 
    index_clo=None, na_values= ['NA'], usecols = "A, E, G, H , L , M" )

    #Remove NaN values
    data_mosul_df = mosul_file.apply (pd.to_numeric, errors='coerce')
    data_mosul_df = mosul_file.dropna()

    #Make a list of df's
    all_sheets.append(data_mosul_df)

Rewrite your save_frames:

def save_frames(frames, output_path):
    frames.to_csv(output_path, mode='a+', header=False)

Rewrite your main:

if __name__ == '__main__':
   frames = pd.concat(all_sheets)
   save_frames(frames, r'C:\Users\Sarah\Desktop\tt\c.csv')

Collectives™ on Stack Overflow

Storing outputdata in CSV using python

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related