0

I have extracted data from different excel sheets spread in different folders, I have organized the folders numerically from 2015 to 2019 and each folder has twelve subfolders (from 1 to 12) here's my code:

import os
from os import walk
import pandas as pd 

path = r'C:\Users\Sarah\Desktop\IOMTest'
my_files = []
for (dirpath, dirnames, filenames) in walk(path):
    my_files.extend([os.path.join(dirpath, fname) for fname in filenames])


all_sheets = []
for file_name in my_files:

    #Display sheets names using pandas
    pd.set_option('display.width',300)
    mosul_file = file_name
    xl = pd.ExcelFile(mosul_file)
    mosul_df = xl.parse(0, header=[1], index_col=[0,1,2])

    #Read Excel and Select columns

    mosul_file = pd.read_excel(file_name, sheet_name = 0 , 
    index_clo=None, na_values= ['NA'], usecols = "A, E, G, H , L , M" )

    #Remove NaN values

    data_mosul_df = mosul_file.apply (pd.to_numeric, errors='coerce')
    data_mosul_df = mosul_file.dropna()
    print(data_mosul_df)

then I saved the extracted columns in a csv file

def save_frames(frames, output_path):

        for frame in frames:
            frame.to_csv(output_path, mode='a+', header=False)

if __name__ == '__main__':
       frames =[pd.DataFrame(data_mosul_df)]
       save_frames(frames, r'C:\Users\Sarah\Desktop\tt\c.csv')

My problem is that when I open the csv file it seems that it doesn't store all the data but only the last excel sheet that it has read or sometimes the two last excel sheets. however, when I print my data inside the console (in Spyder) I see that all the data are treated

    data_mosul_df = mosul_file.apply (pd.to_numeric, errors='coerce')
    data_mosul_df = mosul_file.dropna()
    print(data_mosul_df) 

the picture below shows the output csv created. I am wondering if it is because from Column A to Column E the information are the same ? so that's why it overwrite ? enter image description here

I would like to know how to modify the code so that it extract and store the data chronologically from folders (2015 to 2019) taking into accout subfolders (from 1 to 12) in each folder and how to create a csv that stores all the data ? thank you

3
  • You are overwriting data_mosul_df in your loop, you need to collect all data_mosul_df results... Commented Jan 16, 2020 at 8:51
  • First, initiate df=pd.DataFrame(). Second, read df_=pd.read_excel(). Last df = pd.concat([df, df_]). Do second and last in cycle. Return the result from your function. Commented Jan 16, 2020 at 8:52
  • @SergeyBushmanov thank you , Sorry i am a bit confused these step should be done after data_mosul_df ? Commented Jan 16, 2020 at 8:59

1 Answer 1

1

Rewrite your loop:

for file_name in my_files:

    #Display sheets names using pandas
    pd.set_option('display.width',300)
    mosul_file = file_name
    xl = pd.ExcelFile(mosul_file)
    mosul_df = xl.parse(0, header=[1], index_col=[0,1,2])

    #Read Excel and Select columns
    mosul_file = pd.read_excel(file_name, sheet_name = 0 , 
    index_clo=None, na_values= ['NA'], usecols = "A, E, G, H , L , M" )

    #Remove NaN values
    data_mosul_df = mosul_file.apply (pd.to_numeric, errors='coerce')
    data_mosul_df = mosul_file.dropna()

    #Make a list of df's
    all_sheets.append(data_mosul_df)

Rewrite your save_frames:

def save_frames(frames, output_path):
    frames.to_csv(output_path, mode='a+', header=False)

Rewrite your main:

if __name__ == '__main__':
   frames = pd.concat(all_sheets)
   save_frames(frames, r'C:\Users\Sarah\Desktop\tt\c.csv')
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.