Parsing log file and returning timestamp ordered output

Question

Problem statement: I have the following access log, get the count by timestamp(hh:mm) and sort the count based on minute. And write it into the CSV file.

access.log:

172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET /api/endpoint HTTP/1.1" 401 80500 "domain" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"
172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET /api/endpoint HTTP/1.1" 200 80500 "domain" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"
172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET /api/endpoint HTTP/1.1" 401 80500 "domain" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"
172.16.1.3 - - [25/Sep/2002:14:04:19 +0200] "GET /api/endpoint HTTP/1.1" 401 80500 "domain" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"
172.16.1.3 - - [25/Sep/2002:14:05:19 +0200] "GET /api/endpoint HTTP/1.1" 200 80500 "domain" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"
172.16.1.3 - - [25/Sep/2002:14:05:19 +0200] "GET /api/endpoint HTTP/1.1" 200 80500 "domain" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"

expected output:

#cat output.csv:
14:04, 4
14:05, 2

My solution:

import re
import csv

log_list = {}

with open('access.log','r') as loglines:
    for line in loglines:
        if line != "\n":
            re_pattern = '([\d\.?]+) - - \[(.*?)\] "(.*?)" (.*?) (.*?) "(.*?)" "(.*?)"'
            re_match = re.match(re_pattern, line)
            timestamp = re_match.groups()[1]
            hh_mm = timestamp[12:17]
            if hh_mm in log_list:
                log_list[hh_mm] += 1
            else:
                log_list[hh_mm] = 1
    srtd_list = sorted(log_list.items(), key=lambda i:i[0])
    with open('parsed_file.csv','w') as parsed_file:
        csv_writer = csv.writer(parsed_file)
        for item in srtd_list:
            csv_writer.writerow(item)

Follow up questions:

Any other efficient way to perform this?
How can i improve the book keeping of the count, if the file size is 10GB or if the file is ever growing.
Efficient way to do the same for a file which gonna keep rotate hourly.

Why is this being done in the first place? There are multiple strange things going on here, including that you're divorcing the time from the date. Is this for some kind of log traffic histogram only paying attention to the time of day? — Reinderien
– Reinderien, Commented Jun 23, 2021 at 13:17
Incorporating advice from an answer into the question violates the question-and-answer nature of this site. You could post improved code as a new question, as an answer, or as a link to an external site - as described in I improved my code based on the reviews. What next?. I have rolled back the edit, so the answers make sense again. — Toby Speight
– Toby Speight, Commented Jun 23, 2021 at 19:54

AJNeufeld · Accepted Answer · 2021-06-23 14:39:00Z

3

log_list = {}
...
            if hh_mm in log_list:
                log_list[hh_mm] += 1
            else:
                log_list[hh_mm] = 1
...

This can be replaced with a Counter where unknown keys default to zero.

from collections import Counter

log_list = Counter()
...
            log_list[hh_mm] += 1
...

re_pattern does not need to be redefined each time through the loop. It should be moved outside of the loop, possibly to the top of the file as a constant (ie, named in all caps: RE_PATTERN), and maybe even compiled using re.compile(…)

There is no point capturing patterns you aren’t using. You just need (…) around the timestamp re pattern.

From srtd_list on, you can outdent one level. You’ve finished reading the loglines, so you should allow the file to be closed immediately.

The key=… in sorted(…) is unnecessary, since you are sorting on the guaranteed-unique first element of a tuple.

You don’t need regex at all to do this. You could write the first part of the code as:

import csv
from collections import Counter

with open('access.log','r') as loglines:
    log_list = Counter(line.split()[3][13:18]
                       for line in loglines
                       if line != '\n')
srtd_list = sorted(log_list.items())
…

This is simultaneously simpler and more cryptic. Comment it well if you use it!

answered Jun 23, 2021 at 14:39

AJNeufeld

35.3k5 gold badges41 silver badges103 bronze badges

\$\begingroup\$ 1. Good suggestion on the collections 2. re_pattern in the loop is not an intended one - will correct it, thanks for highlighting it 3. Sorted is required, because i need sort it based on hh:mm 4. Regarding regex vs split, will test out which one would be performant and use that. Thanks for the suggestion there too. Any thoughts on the followup questions? Thanks! \$\endgroup\$

dm90
– dm90

2021-06-23 17:50:15 +00:00
Commented Jun 23, 2021 at 17:50
\$\begingroup\$ Regarding: "3. Sorted is required, because i need sort it based on hh:mm". I think you misunderstood me. srtd_list = sorted(log_list.items(), key=lambda i:i[0]) will behave the same way srtd_list = sorted(log_list.items()) does, because without a key=... argument, it sorts based on the natural ordering of each item, and since the items are dictionary key-value pairs, the first element of each item (the key) is unique, so it ends up sorting based on the dictionary keys, as desired. \$\endgroup\$

AJNeufeld
– AJNeufeld

2021-06-23 19:25:50 +00:00
Commented Jun 23, 2021 at 19:25
\$\begingroup\$ Your CSV file will have, at most 1440 lines in it, regardless of the size of the input log (10GB, 10TB, 10PB, 10EB, 10ZB, 10YB ...) since there are only 24*60 distinct minutes in the day. \$\endgroup\$

AJNeufeld
– AJNeufeld

2021-06-23 19:37:06 +00:00
Commented Jun 23, 2021 at 19:37

Add a comment |

Kate · Accepted Answer · 2021-06-23 20:58:00Z

Just an idea, but instead of regexes you could use built-in Python libraries for parsing datetimes, for example:

>>> from datetime import datetime
>>> d = "25/Sep/2002:14:04:19 +0200"
>>> datetime.strptime(d, '%d/%b/%Y:%H:%M:%S %z')
datetime.datetime(2002, 9, 25, 14, 4, 19, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)))

This code will parse your timestamp into a timezone-aware datetime. This approach could be interesting in case you want to be able to parse different formats, or you have varying time zones in your logs. Without a timezone, timestamps are meaningless unless you are operating against a known, stable location, but daylight saving time could come into play.

Maybe you can ignore the timezone in your use case, depending on how the log was produced. Or not.

Also, this could offer better performance than regex but I haven't tested it.

It's obvious that this is a webserver log, and it's rather common to see bash scripts being used for that kind of purpose. A combination of grep, cut and uniq -c could get the job done in 10 lines or less I think.

Now if you are interested in computing live statistics by constantly monitoring the file, you need to implement some sort of bash tail feature. In Python this could be done using process.popen and then you read from the stream line by line, and increment your counters accordingly (I could post some sample code if you're interested).

Stack Exchange Network

Parsing log file and returning timestamp ordered output

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Parsing log file and returning timestamp ordered output

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions