0

I am working on a school project in which I am supposed to clip unnecessary data from a log file and we were told to use python. First, I am supposed to divide the data of the log into columns and then proceed with everything else.

So this is the solution I came up with, though it's not quite working:

import pandas as pd
import pytz
from datetime import datetime
import re

def parse_str(x):
    if x is None:
        return '-'
    else:
        return x[1:-1]

def parse_datetime(x):
    try:
        dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
        dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
        return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))
    except ValueError:
        return datetime.now()

def parse_int(x):
    return int(x) if x is not None else 0

data = pd.read_csv(
    'Log_jeden_den.log',
    sep=r'\s(?=(?:[^"]*"[^"]/")*[^"]*$)(?![^\\[]*\\])',
    engine='python',
    na_values='-',
    header=None,
    usecols=['ip', 'request', 'status', 'size', 'referer', 'user_agent'],
    names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
    converters={'time': parse_datetime,
                'request': parse_str,
                'status': parse_int,
                'size': parse_int,
                'referer': parse_str,
                'user_agent': parse_str})
print(data.head())

This is what I get: Output

Basically I need each part of the log to be split into the mentioned columns.

Log file looks like this:

193.87.12.30 - - [19/Feb/2020:06:25:50 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:55 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:56 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:57 +0100] "GET /navbar/navbar-ukf.html HTTP/1.0" 200 7584 "-" "-"

193.87.12.30 - - [19/Feb/2020:06:25:49 +0100] "GET / HTTP/1.1" 200 20925 "-" "libwww-perl/6.08"

23.100.232.233 - - [19/Feb/2020:06:25:49 +0100] "GET /media-a-marketing/dianie-na-univerzite/kalendar-udalosti/815-den-otvorenych-dveri-2018 HTTP/1.1" 200 26802 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)"

193.87.12.30 - - [19/Feb/2020:06:25:46 +0100] "GET / HTTP/1.1" 200 20925 "-" "libwww-perl/6.08"

I tried this:

usecols=[0, 3, 4, 5, 6, 7, 8]

But I am getting error:

ParserError

To sum up. I need to break the log data to broken down to columns. My code somewhat does that but incorrectly. It creates the columns but does not divide the data inbetween them, it puts them all into the first column. Naming the columns didn't help and marking them with numbers caused ParserError.

I know I can just open the file in excel and split it into columns but I really would like to do it "proper" hard way. Is it possible, though? Thanks for any advice in advance.

3
  • You need to provide a sample of a log file (as text - not an image) and explain what output you expect (from that sample) versus what you're actually getting Commented Mar 9 at 11:04
  • The log file sample provided Commented Mar 10 at 12:10
  • Please avoid images whenever you can: Useful website idownvotedbecau.se/imageofcode. Please format your log file, possible also in a code block. Commented Mar 16 at 10:01

1 Answer 1

1
data = pd.read_csv(
    'Log_jeden_den.log',
 sep=r'\s+(?=(?:[^"]*"[^"]*")*[^"]*$)(?=(?:[^\[]*\[[^\]]*\])*[^\]]*$)',
    engine='python',
    na_values='-',
    header=None,
    usecols=[0, 3, 4, 5, 6, 7, 8],
    names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
    converters={'time': parse_datetime,
                'request': parse_str,
                'status': parse_int,
                'size': parse_int,
                'referer': parse_str,
                'user_agent': parse_str})

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

as I have mentioned, I did that and got ParserError
use same regex i used @o_d3n1s_o colab.research.google.com/drive/…
So I uploaded my log file into your colab file but I got "ValueError: invalid literal for int() with base 10: '"GET /studium/11-studium/176-zabezpecenie-zdravotnej-starostlivosti HTTP/1.1"'"

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.