Parsing csv in python

Question

I'm trying to parse a csv file in python and print the sum of order_total for each day. Below is the sample csv file

  order_total   created_datetime                                                                                                
24.99   2015-06-01 00:00:12                                                                                             
0   2015-06-01 00:03:15                                                                                             
164.45  2015-06-01 00:04:05                                                                                             
24.99   2015-06-01 00:08:01                                                                                             
0   2015-06-01 00:08:23                                                                                             
46.73   2015-06-01 00:08:51                                                                                             
0   2015-06-01 00:08:58                                                                                             
47.73   2015-06-02 00:00:25                                                                                             
101.74  2015-06-02 00:04:11                                                                                             
119.99  2015-06-02 00:04:35                                                                                             
38.59   2015-06-02 00:05:26                                                                                             
73.47   2015-06-02 00:06:50                                                                                             
34.24   2015-06-02 00:07:36                                                                                             
27.24   2015-06-03 00:01:40                                                                                             
82.2    2015-06-03 00:12:21                                                                                             
23.48   2015-06-03 00:12:35

My objective here is to print the sum(order_total) for each day. For example the result should be

2015-06-01 -> 261.16
2015-06-02 -> 415.75
2015-06-03 -> 132.92

I have written the below code - its does not perform the logic yet, but I'm trying to see if its able to parse and loop as required by printing some sample statements.

def sum_orders_test(self,start_date,end_date):
        initial_date = datetime.date(int(start_date.split('-')[0]),int(start_date.split('-')[1]),int(start_date.split('-')[2]))
        final_date = datetime.date(int(end_date.split('-')[0]),int(end_date.split('-')[1]),int(end_date.split('-')[2]))
        day = datetime.timedelta(days=1)
        with open("file1.csv", 'r') as data_file:
            next(data_file)
            reader = csv.reader(data_file, delimiter=',')
            if initial_date <= final_date:
                for row in reader:
                    if str(initial_date) in row[1]:
                        print 'initial_date : ' + str(initial_date)
                        print 'Date : ' + row[1]
                    else:
                        print 'Else'
                        initial_date = initial_date + day

based on my current logic I'm running into this issue -

As you can see in the sample csv there are 7 rows for 2015-06-01, 6 rows for 2015-06-02 and 3 rows for 2015-06-03.
My output of above code is printing 7 values for 2015-06-01, 5 for 2015-06-02 and 2 for 2015-06-03

Calling the function using sum_orders_test('2015-06-01','2015-06-03');

I know there is some silly logical issue, but being new to programming and python I'm unable to figure it out.

delimiter=',')... Please tell me where the commas in the file are — OneCricketeer
– OneCricketeer, Commented Sep 3, 2017 at 8:22
its a csv file, and hence used ',', but its not there in file. — Firstname
– Firstname, Commented Sep 3, 2017 at 8:24
That's exactly your problem... Python does not care about file extensions. Change the delimeter so you can actually read the data correctly — OneCricketeer
– OneCricketeer, Commented Sep 3, 2017 at 8:25

afagarap · Accepted Answer · 2017-09-03 09:05:48Z

2

I've re-read the question, and if your data is really tab-separated, here's the following source to do the job (using pandas):

import pandas as pd

df = pd.DataFrame(pd.read_csv('file.csv', names=['order_total', 'created_datetime'], sep='\t'))
df['created_datetime'] = pd.to_datetime(df.created_datetime).dt.date
df = df.groupby(['created_datetime']).sum()
print(df)

Gives the following result:

                  order_total
created_datetime             
2015-06-01             261.16
2015-06-02             415.76
2015-06-03             132.92

Less codes, and probably lower algorithm complexity.

edited Sep 3, 2017 at 9:05

answered Sep 3, 2017 at 8:27

afagarap

6492 gold badges10 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Firstname Over a year ago

It loks much easier, but my file is a csv file, although there isn't any tab or comma in the file. its a normal excel file saved as csv When I replace the '\t' with ',' and run I get below error df['created_datetime'] = pd.to_datetime(df.created_datetime).dt.date File "/Library/Python/2.7/site-packages/pandas/core/tools/datetimes.py", line 509, in to_datetime values = _convert_listlike(arg._values, False, format) File "/Library/Python/2.7/site-packages/pandas/core/tools/datetimes.py", line 447, in _convert_listlike raise e ValueError: Unknown string format @Abien

afagarap Over a year ago

Will you please give a link to a sample of your data?

OneCricketeer Over a year ago

It certainly is :)

Pythonist · Accepted Answer · 2017-09-03 08:39:34Z

This one should do the job.

csv module has DictReader, in which you can include fieldnames so instead of accessing columns by index (row[0]), you can predefine columns names(row['date']).

from datetime import datetime, timedelta
from collections import defaultdict


def sum_orders_test(self, start_date, end_date):
    FIELDNAMES = ['orders', 'date']
    sum_of_orders = defaultdict(int)

    initial_date = datetime.strptime(start_date, '%Y-%m-%d').date()
    final_date = datetime.strptime(end_date, '%Y-%m-%d').date()
    day = timedelta(days=1)
    with open("file1.csv", 'r') as data_file:
        next(data_file)  # Skip the headers
        reader = csv.DictReader(data_file, fieldnames=FIELDNAMES)
        if initial_date <= final_date:
            for row in reader:
                if str(initial_date) in row['date']:
                    sum_of_orders[str(initial_date)] += int(row['orders'])
                else:
                    initial_date += day
    return sum_of_orders

How does defaultdict work ? When I try to print sum_of_orders it shows defaultdict(<type 'int'>, {}) @Pythonist
Simply saying, it allows you to add new keys to a dictionary, of given type, without checking if they're in. Docs will say more than I can.

Thirupathi Thangavel · Accepted Answer · 2017-09-03 08:51:39Z

You might have a .csv file extension, but your file seems to be a tab separated file actually.

You can load it as pandas dataframe but specifying the separator.

import pandas as pd
data = pd.read_csv('file.csv', sep='\t')

Then split the datetime column into date and time

data = pd.DataFrame(data.created_datetime.str.split(' ',1).tolist(),
                               columns = ['date','time'])

Then for each unique date, compute it's order_total sum

for i in data.date.unique():
    print i, data[data['date'] == i].order_total.sum()

Collectives™ on Stack Overflow

Parsing csv in python

3 Answers 3

3 Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

Comments

Linked

Related