Problem statement: I have the following access log, get the count by timestamp(hh:mm) and sort the count based on minute. And write it into the CSV file.
access.log:
172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET /api/endpoint HTTP/1.1" 401 80500 "domain" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"
172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET /api/endpoint HTTP/1.1" 200 80500 "domain" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"
172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET /api/endpoint HTTP/1.1" 401 80500 "domain" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"
172.16.1.3 - - [25/Sep/2002:14:04:19 +0200] "GET /api/endpoint HTTP/1.1" 401 80500 "domain" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"
172.16.1.3 - - [25/Sep/2002:14:05:19 +0200] "GET /api/endpoint HTTP/1.1" 200 80500 "domain" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"
172.16.1.3 - - [25/Sep/2002:14:05:19 +0200] "GET /api/endpoint HTTP/1.1" 200 80500 "domain" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"
expected output:
#cat output.csv:
14:04, 4
14:05, 2
My solution:
import re
import csv
log_list = {}
with open('access.log','r') as loglines:
for line in loglines:
if line != "\n":
re_pattern = '([\d\.?]+) - - \[(.*?)\] "(.*?)" (.*?) (.*?) "(.*?)" "(.*?)"'
re_match = re.match(re_pattern, line)
timestamp = re_match.groups()[1]
hh_mm = timestamp[12:17]
if hh_mm in log_list:
log_list[hh_mm] += 1
else:
log_list[hh_mm] = 1
srtd_list = sorted(log_list.items(), key=lambda i:i[0])
with open('parsed_file.csv','w') as parsed_file:
csv_writer = csv.writer(parsed_file)
for item in srtd_list:
csv_writer.writerow(item)
Follow up questions:
- Any other efficient way to perform this?
- How can i improve the book keeping of the count, if the file size is 10GB or if the file is ever growing.
- Efficient way to do the same for a file which gonna keep rotate hourly.