How to parse this custom log file in Python

Question

I am using Python logging to generate log files when processing and I am trying to READ those log files into a list/dict which will then be converted into JSON and loaded into a nosql database for processing.

The file gets generated with the following format.

2015-05-22 16:46:46,985 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:46:56,645 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:47:46,488 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:48:48,180 - __main__ - ERROR - Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/
Traceback (most recent call last):
  File "<ipython-input-16-132cda1c011d>", line 10, in <module>
    if numFilesDownloaded == 0:
NameError: name 'numFilesDownloaded' is not defined
2015-05-22 16:49:17,918 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:49:32,160 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:49:39,329 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:53:30,706 - __main__ - INFO - Starting to Wait for Files

NOTE: There are actually \n breaks before each NEW date you see but cant seem to represent it here.

Basically I am trying to read in this text file and produce a json object that looks like this:

{
    'Date': '2015-05-22 16:46:46,985',
    'Type': 'INFO',
    'Message':'Starting to Wait for Files'
}
...

{
    'Date': '2015-05-22 16:48:48,180',
    'Type': 'ERROR',
    'Message':'Failed: Waiting for files the Files from Cloud Storage:  gs://folder/anotherfolder/ Traceback (most recent call last):
               File "<ipython-input-16-132cda1c011d>", line 10, in <module> if numFilesDownloaded == 0: NameError: name 'numFilesDownloaded' is not defined '
}

The problem I am having:

I can add each line into a list or dict etc BUT the ERROR message sometimes goes over multiple lines so I end up splitting it up incorrectly.

Tried:

I have tried to use code like the below to only split the lines on valid dates but I cant seem to get the error messages that go across multiple lines. I also tried regular expressions and think that's a possible solution but cant seem to find the right regex to use...NO CLUE how it works so tried a bunch of copy paste but without any success.

with open(filename,'r') as f:
    for key,group in it.groupby(f,lambda line: line.startswith('2015')):
        if key:
            for line in group:
                listNew.append(line)

Tried some crazy regex but no luck here either:

logList = re.split(r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])', fileData)

Would appreciate any help...thanks

EDIT:

Posted a Solution below for anyone else struggling with the same thing.

@PralhadNarsinhSonar thanks for the link but not sure you following what I am asking. I already create the log file using the logging library. What I need to do is read each log record into a list or dict so that I can generate the json as described. — steven.levey, Commented Jun 4, 2015 at 6:37

steven.levey · Accepted Answer · 2015-06-04 14:50:13Z

Using @Joran Beasley's answer I came up with the following solution and it seems to work:

Main Points:

My log files ALWAYS follow the same structure: {Date} - {Type} - {Message} so I used string slicing and splitting to get the items broken up how I needed them. Example the {Date} is always 23 characters and I only want the first 19 characters.
Using line.startswith("2015") is crazy as dates will change eventually so created a new function that uses some regex to match a date format I am expecting. Once again, my log Dates follow a specific pattern so I could get specific.
The file is read into the first function "generateDicts()" and then calls the "matchDate()" function to see IF the line being processed matches a {Date} format I am looking for.
A NEW dict is created everytime a valid {Date} format is found and everything is processed until the NEXT valid {Date} is encountered.

Function to split up the log files.

def generateDicts(log_fh):
    currentDict = {}
    for line in log_fh:
        if line.startswith(matchDate(line)):
            if currentDict:
                yield currentDict
            currentDict = {"date":line.split("__")[0][:19],"type":line.split("-",5)[3],"text":line.split("-",5)[-1]}
        else:
            currentDict["text"] += line
    yield currentDict

with open("/Users/stevenlevey/Documents/out_folder/out_loyaltybox/log_CardsReport_20150522164636.logs") as f:
    listNew= list(generateDicts(f))

Function to see if the line being processed starts with a {Date} that matches the format I am looking for

    def matchDate(line):
        matchThis = ""
        matched = re.match(r'\d\d\d\d-\d\d-\d\d\ \d\d:\d\d:\d\d',line)
        if matched:
            #matches a date and adds it to matchThis            
            matchThis = matched.group() 
        else:
            matchThis = "NONE"
        return matchThis

It probably would be type":line.split("-",5)[4] instead of [3]? — Phi, Commented Jan 27, 2017 at 13:57

Joran Beasley · Accepted Answer · 2015-06-03 18:34:38Z

11

create a generator (Im on a generator bend today)

def generateDicts(log_fh):
    currentDict = {}
    for line in log_fh:
        if line.startswith("2015"): #you might want a better check here
           if currentDict:
              yield currentDict
           currentDict = {"date":line.split("-")[0],"type":line.split("-")[2],"text":line.split("-")[-1]}
       else:
          currentDict["text"] += line
    yield currentDict

 with open("logfile.txt") as f:
    print list(generateDicts(f))

there may be a few minor typos... I didnt actually run this

answered Jun 3, 2015 at 18:34

Joran Beasley

114k13 gold badges167 silver badges187 bronze badges

{'date': '2015', 'text': ' 0 and FileCount 1\n', 'type': '22 16:49:39,329 '} the date is separated by hypen
– heinst
Commented Jun 3, 2015 at 18:45
oh whoops ... well Im sure you can get the date out (as you were before) ... the question wasnt really how to parse the date part ...
– Joran Beasley
Commented Jun 3, 2015 at 18:49
@JoranBeasley Thanks for this. It gets me closer. I need to replace the line.startswith("2015") to something like line.startswith(validDateFormat) or line.startswith(dateFormat(yyyy-MM-dd HH:mm:ss)). Would you have any ideas on that? I tried regex but failed.
– steven.levey
Commented Jun 4, 2015 at 6:44

Add a comment |

hoefling · Accepted Answer · 2021-05-20 07:29:31Z

I recently had a similar task of parsing the log records, but along with exception tracebacks for further analysis. Instead of banging my head against home brewed regular expressions, I used two wonderful libraries: parse for parsing records (this is actually a very cool library, practically an inverse function to stdlib's string.format) and boltons for parsing tracebacks. Here is a sample code I extracted from the impl of mine, adapted to the log in question:

import datetime
import logging
import os
from pathlib import Path
from boltons.tbutils import ParsedException
from parse import parse, with_pattern


LOGGING_DEFAULT_DATEFMT = f"{logging.Formatter.default_time_format},%f"


# TODO better pattern
@with_pattern(r"\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d,\d\d\d")
def parse_logging_time(raw):
    return datetime.datetime.strptime(raw, LOGGING_DEFAULT_DATEFMT)


def from_log(file: os.PathLike, fmt: str):
    chunk = ""
    custom_parsers = {"asctime": parse_logging_time}

    with Path(file).open() as fp:
        for line in fp:
            parsed = parse(fmt, line, custom_parsers)
            if parsed is not None:
                yield parsed
            else:  # try parsing the stacktrace
                chunk += line
                try:
                    yield ParsedException.from_string(chunk)
                    chunk = ""
                except (IndexError, ValueError):
                    pass


if __name__ == "__main__":
    for parsed_record in from_log(
        file="so.log",
        fmt="{asctime:asctime} - {module} - {levelname} - {message}"
    ):
        print(parsed_record)

When executed, this yields

<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 46, 46, 985000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting to Wait for Files\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 46, 56, 645000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 47, 46, 488000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 48, 48, 180000), 'module': '__main__', 'levelname': 'ERROR', 'message': 'Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/\n'}>
ParsedException('NameError', "name 'numFilesDownloaded' is not defined", frames=[{'filepath': '<ipython-input-16-132cda1c011d>', 'lineno': '10', 'funcname': '<module>', 'source_line': 'if numFilesDownloaded == 0:'}])
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 49, 17, 918000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting to Wait for Files\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 49, 32, 160000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 49, 39, 329000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 53, 30, 706000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting to Wait for Files\n'}>

Notes

If you are specifying the log format using the { style, chances are high that you can simply pass the logging format string to parse and it will just work. In this example, I had to improvise and use a custom parser for timestamps to match the question's requirements; if the timestamps would be of a common format, e.g. ISO 8601, one could just use fmt="{asctime:ti} - {module} - {levelname} - {message}" and throw out parse_logging_time and custom_parsers from the example code. parse supports several common timestamp formats out of the box; check out the "Format Specification" section in its readme.

The parse.Results are dict-like objects, so parsed_record["message"] returns the parsed message etc.

Notice the ParsedException object printed - this is the exception parsed from the traceback.

Lincoln Randall McFarland · Accepted Answer · 2016-11-12 04:05:10Z

You can get the fields you are looking for directly from the regex using groups. You can even name them:

>>> import re
>>> date_re = re.compile('(?P<a_year>\d{2,4})-(?P<a_month>\d{2})-(?P<a_day>\d{2}) (?P<an_hour>\d{2}):(?P<a_minute>\d{2}):(?P<a_second>\d{2}[.\d]*)')
>>> found = date_re.match('2016-02-29 12:34:56.789')
>>> if found is not None:
...     print found.groupdict()
... 
{'a_year': '2016', 'a_second': '56.789', 'a_day': '29', 'a_minute': '34', 'an_hour': '12', 'a_month': '02'}
>>> found.groupdict()['a_month']
'02'

Then create a date class where the constructor's kwargs match the group names. Use a little **magic to create an instance of the object directly from the regex groupdict and you are cooking with gas. In the constructor you can then figure out if 2016 is a leap year and Feb 29 exits.

-lrm

Iván · Accepted Answer · 2019-12-25 22:04:52Z

list = []
with open('bla.txt', 'r') as file:
  for line in file.readlines():
    if len(line.split(' - ')) >= 4:
      d = dict()
      d['Date'] = line.split(' - ')[0]
      d['Type'] = line.split(' - ')[2]
      d['Message'] = line.split(' - ')[3]
      list.append(d)
print(list)

Output:

[{
    'Date': '2015-05-22 16:46:46,985',
    'Message': 'Starting to Wait for Files\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:46:56,645',
    'Message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:47:46,488',
    'Message': 'Success: Downloading the Files from Cloud Storage: Return Code',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:48:48,180',
    'Message': 'Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/\n',
    'Type': 'ERROR'
}, {
    'Date': '2015-05-22 16:49:17,918',
    'Message': 'Starting to Wait for Files\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:49:32,160',
    'Message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:49:39,329',
    'Message': 'Success: Downloading the Files from Cloud Storage: Return Code',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:53:30,706',
    'Message': 'Starting to Wait for Files',
    'Type': 'INFO'
}]

➡️ My fellas, never name a variable 'list'
– Anton Frolov
Commented May 6, 2022 at 23:48 — Anton Frolov, Commented May 6, 2022 at 23:48

Deepak · Accepted Answer · 2016-03-20 17:43:52Z

0

The solution provided by @steven.levey is perfect. One addition to it that I would like to make is to use this regex pattern to determine if the line is proper and extract the required values. So that we don't have to work on splitting the lines once again after determining the format using regex.

pattern = '(^[0-9\-\s\:\,]+)\s-\s__main__\s-\s([A-Z]+)\s-\s([\s\S]+)'

answered Mar 20, 2016 at 17:43

Deepak

31 silver badge5 bronze badges

Add a comment |

Brian Tompsett - 汤莱恩 · Accepted Answer · 2023-06-17 14:00:06Z

Easiest solution without using Regex and complex functions.

def main():
    
    # read the log file 
    
    with open("\\data\\logFile1.txt") as f:
        
        finall = []
        
        for ln in f:
            curdic = {} 
            temp = []      
            if ln.startswith('2015'):

                temp = ln.split(' - ')
                curdic['Date'] = temp[0]
                curdic['Type'] = temp[2]
                curdic['Message'] = temp[3]
                finall.append(curdic)
            else:
                finall[-1]['Message'] += ln
            
    for i in finall:

        print(i)        

if __name__ == '__main__':
    main()

Collectives™ on Stack Overflow

How to parse this custom log file in Python

7 Answers 7

Function to split up the log files.

Function to see if the line being processed starts with a {Date} that matches the format I am looking for

Notes

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Function to split up the log files.

Function to see if the line being processed starts with a {Date} that matches the format I am looking for

Notes

Linked

Related