I am working on a school project in which I am supposed to clip unnecessary data from a log file and we were told to use python. First, I am supposed to divide the data of the log into columns and then proceed with everything else.
So this is the solution I came up with, though it's not quite working:
import pandas as pd
import pytz
from datetime import datetime
import re
def parse_str(x):
if x is None:
return '-'
else:
return x[1:-1]
def parse_datetime(x):
try:
dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))
except ValueError:
return datetime.now()
def parse_int(x):
return int(x) if x is not None else 0
data = pd.read_csv(
'Log_jeden_den.log',
sep=r'\s(?=(?:[^"]*"[^"]/")*[^"]*$)(?![^\\[]*\\])',
engine='python',
na_values='-',
header=None,
usecols=['ip', 'request', 'status', 'size', 'referer', 'user_agent'],
names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
converters={'time': parse_datetime,
'request': parse_str,
'status': parse_int,
'size': parse_int,
'referer': parse_str,
'user_agent': parse_str})
print(data.head())
Basically I need each part of the log to be split into the mentioned columns.
Log file looks like this:
193.87.12.30 - - [19/Feb/2020:06:25:50 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"
193.87.12.30 - - [19/Feb/2020:06:25:55 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"
193.87.12.30 - - [19/Feb/2020:06:25:56 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"
193.87.12.30 - - [19/Feb/2020:06:25:57 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"
193.87.12.30 - - [19/Feb/2020:06:25:49 +0100] "GET / HTTP/1.1" 200 20925 "-" "libwww-perl/6.08"
23.100.232.233 - - [19/Feb/2020:06:25:49 +0100] "GET /media-a-marketing/dianie-na-univerzite/kalendar-udalosti/815-den-otvorenych-dveri-2018 HTTP/1.1" 200 26802 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)"
193.87.12.30 - - [19/Feb/2020:06:25:46 +0100] "GET / HTTP/1.1" 200 20925 "-" "libwww-perl/6.08"
I tried this:
usecols=[0, 3, 4, 5, 6, 7, 8]
But I am getting error:
To sum up. I need to break the log data to broken down to columns. My code somewhat does that but incorrectly. It creates the columns but does not divide the data inbetween them, it puts them all into the first column. Naming the columns didn't help and marking them with numbers caused ParserError.
I know I can just open the file in excel and split it into columns but I really would like to do it "proper" hard way. Is it possible, though? Thanks for any advice in advance.


