13

I am using Python logging to generate log files when processing and I am trying to READ those log files into a list/dict which will then be converted into JSON and loaded into a nosql database for processing.

The file gets generated with the following format.

2015-05-22 16:46:46,985 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:46:56,645 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:47:46,488 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:48:48,180 - __main__ - ERROR - Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/
Traceback (most recent call last):
  File "<ipython-input-16-132cda1c011d>", line 10, in <module>
    if numFilesDownloaded == 0:
NameError: name 'numFilesDownloaded' is not defined
2015-05-22 16:49:17,918 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:49:32,160 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:49:39,329 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:53:30,706 - __main__ - INFO - Starting to Wait for Files

NOTE: There are actually \n breaks before each NEW date you see but cant seem to represent it here.

Basically I am trying to read in this text file and produce a json object that looks like this:

{
    'Date': '2015-05-22 16:46:46,985',
    'Type': 'INFO',
    'Message':'Starting to Wait for Files'
}
...

{
    'Date': '2015-05-22 16:48:48,180',
    'Type': 'ERROR',
    'Message':'Failed: Waiting for files the Files from Cloud Storage:  gs://folder/anotherfolder/ Traceback (most recent call last):
               File "<ipython-input-16-132cda1c011d>", line 10, in <module> if numFilesDownloaded == 0: NameError: name 'numFilesDownloaded' is not defined '
}

The problem I am having:

I can add each line into a list or dict etc BUT the ERROR message sometimes goes over multiple lines so I end up splitting it up incorrectly.

Tried:

I have tried to use code like the below to only split the lines on valid dates but I cant seem to get the error messages that go across multiple lines. I also tried regular expressions and think that's a possible solution but cant seem to find the right regex to use...NO CLUE how it works so tried a bunch of copy paste but without any success.

with open(filename,'r') as f:
    for key,group in it.groupby(f,lambda line: line.startswith('2015')):
        if key:
            for line in group:
                listNew.append(line)

Tried some crazy regex but no luck here either:

logList = re.split(r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])', fileData)

Would appreciate any help...thanks

EDIT:

Posted a Solution below for anyone else struggling with the same thing.

1
  • @PralhadNarsinhSonar thanks for the link but not sure you following what I am asking. I already create the log file using the logging library. What I need to do is read each log record into a list or dict so that I can generate the json as described. Commented Jun 4, 2015 at 6:37

7 Answers 7

12

Using @Joran Beasley's answer I came up with the following solution and it seems to work:

Main Points:

  • My log files ALWAYS follow the same structure: {Date} - {Type} - {Message} so I used string slicing and splitting to get the items broken up how I needed them. Example the {Date} is always 23 characters and I only want the first 19 characters.
  • Using line.startswith("2015") is crazy as dates will change eventually so created a new function that uses some regex to match a date format I am expecting. Once again, my log Dates follow a specific pattern so I could get specific.
  • The file is read into the first function "generateDicts()" and then calls the "matchDate()" function to see IF the line being processed matches a {Date} format I am looking for.
  • A NEW dict is created everytime a valid {Date} format is found and everything is processed until the NEXT valid {Date} is encountered.

Function to split up the log files.

def generateDicts(log_fh):
    currentDict = {}
    for line in log_fh:
        if line.startswith(matchDate(line)):
            if currentDict:
                yield currentDict
            currentDict = {"date":line.split("__")[0][:19],"type":line.split("-",5)[3],"text":line.split("-",5)[-1]}
        else:
            currentDict["text"] += line
    yield currentDict

with open("/Users/stevenlevey/Documents/out_folder/out_loyaltybox/log_CardsReport_20150522164636.logs") as f:
    listNew= list(generateDicts(f))

Function to see if the line being processed starts with a {Date} that matches the format I am looking for

    def matchDate(line):
        matchThis = ""
        matched = re.match(r'\d\d\d\d-\d\d-\d\d\ \d\d:\d\d:\d\d',line)
        if matched:
            #matches a date and adds it to matchThis            
            matchThis = matched.group() 
        else:
            matchThis = "NONE"
        return matchThis
1
  • 1
    It probably would be type":line.split("-",5)[4] instead of [3]?
    – Phi
    Commented Jan 27, 2017 at 13:57
11

create a generator (Im on a generator bend today)

def generateDicts(log_fh):
    currentDict = {}
    for line in log_fh:
        if line.startswith("2015"): #you might want a better check here
           if currentDict:
              yield currentDict
           currentDict = {"date":line.split("-")[0],"type":line.split("-")[2],"text":line.split("-")[-1]}
       else:
          currentDict["text"] += line
    yield currentDict

 with open("logfile.txt") as f:
    print list(generateDicts(f))

there may be a few minor typos... I didnt actually run this

3
  • {'date': '2015', 'text': ' 0 and FileCount 1\n', 'type': '22 16:49:39,329 '} the date is separated by hypen
    – heinst
    Commented Jun 3, 2015 at 18:45
  • oh whoops ... well Im sure you can get the date out (as you were before) ... the question wasnt really how to parse the date part ... Commented Jun 3, 2015 at 18:49
  • @JoranBeasley Thanks for this. It gets me closer. I need to replace the line.startswith("2015") to something like line.startswith(validDateFormat) or line.startswith(dateFormat(yyyy-MM-dd HH:mm:ss)). Would you have any ideas on that? I tried regex but failed. Commented Jun 4, 2015 at 6:44
6

I recently had a similar task of parsing the log records, but along with exception tracebacks for further analysis. Instead of banging my head against home brewed regular expressions, I used two wonderful libraries: parse for parsing records (this is actually a very cool library, practically an inverse function to stdlib's string.format) and boltons for parsing tracebacks. Here is a sample code I extracted from the impl of mine, adapted to the log in question:

import datetime
import logging
import os
from pathlib import Path
from boltons.tbutils import ParsedException
from parse import parse, with_pattern


LOGGING_DEFAULT_DATEFMT = f"{logging.Formatter.default_time_format},%f"


# TODO better pattern
@with_pattern(r"\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d,\d\d\d")
def parse_logging_time(raw):
    return datetime.datetime.strptime(raw, LOGGING_DEFAULT_DATEFMT)


def from_log(file: os.PathLike, fmt: str):
    chunk = ""
    custom_parsers = {"asctime": parse_logging_time}

    with Path(file).open() as fp:
        for line in fp:
            parsed = parse(fmt, line, custom_parsers)
            if parsed is not None:
                yield parsed
            else:  # try parsing the stacktrace
                chunk += line
                try:
                    yield ParsedException.from_string(chunk)
                    chunk = ""
                except (IndexError, ValueError):
                    pass


if __name__ == "__main__":
    for parsed_record in from_log(
        file="so.log",
        fmt="{asctime:asctime} - {module} - {levelname} - {message}"
    ):
        print(parsed_record)

When executed, this yields

<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 46, 46, 985000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting to Wait for Files\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 46, 56, 645000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 47, 46, 488000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 48, 48, 180000), 'module': '__main__', 'levelname': 'ERROR', 'message': 'Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/\n'}>
ParsedException('NameError', "name 'numFilesDownloaded' is not defined", frames=[{'filepath': '<ipython-input-16-132cda1c011d>', 'lineno': '10', 'funcname': '<module>', 'source_line': 'if numFilesDownloaded == 0:'}])
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 49, 17, 918000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting to Wait for Files\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 49, 32, 160000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 49, 39, 329000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 53, 30, 706000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting to Wait for Files\n'}>

Notes

If you are specifying the log format using the { style, chances are high that you can simply pass the logging format string to parse and it will just work. In this example, I had to improvise and use a custom parser for timestamps to match the question's requirements; if the timestamps would be of a common format, e.g. ISO 8601, one could just use fmt="{asctime:ti} - {module} - {levelname} - {message}" and throw out parse_logging_time and custom_parsers from the example code. parse supports several common timestamp formats out of the box; check out the "Format Specification" section in its readme.

The parse.Results are dict-like objects, so parsed_record["message"] returns the parsed message etc.

Notice the ParsedException object printed - this is the exception parsed from the traceback.

5

You can get the fields you are looking for directly from the regex using groups. You can even name them:

>>> import re
>>> date_re = re.compile('(?P<a_year>\d{2,4})-(?P<a_month>\d{2})-(?P<a_day>\d{2}) (?P<an_hour>\d{2}):(?P<a_minute>\d{2}):(?P<a_second>\d{2}[.\d]*)')
>>> found = date_re.match('2016-02-29 12:34:56.789')
>>> if found is not None:
...     print found.groupdict()
... 
{'a_year': '2016', 'a_second': '56.789', 'a_day': '29', 'a_minute': '34', 'an_hour': '12', 'a_month': '02'}
>>> found.groupdict()['a_month']
'02'

Then create a date class where the constructor's kwargs match the group names. Use a little **magic to create an instance of the object directly from the regex groupdict and you are cooking with gas. In the constructor you can then figure out if 2016 is a leap year and Feb 29 exits.

-lrm

2
list = []
with open('bla.txt', 'r') as file:
  for line in file.readlines():
    if len(line.split(' - ')) >= 4:
      d = dict()
      d['Date'] = line.split(' - ')[0]
      d['Type'] = line.split(' - ')[2]
      d['Message'] = line.split(' - ')[3]
      list.append(d)
print(list)

Output:

[{
    'Date': '2015-05-22 16:46:46,985',
    'Message': 'Starting to Wait for Files\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:46:56,645',
    'Message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:47:46,488',
    'Message': 'Success: Downloading the Files from Cloud Storage: Return Code',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:48:48,180',
    'Message': 'Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/\n',
    'Type': 'ERROR'
}, {
    'Date': '2015-05-22 16:49:17,918',
    'Message': 'Starting to Wait for Files\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:49:32,160',
    'Message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:49:39,329',
    'Message': 'Success: Downloading the Files from Cloud Storage: Return Code',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:53:30,706',
    'Message': 'Starting to Wait for Files',
    'Type': 'INFO'
}]
1
  • 3
    ➡️ My fellas, never name a variable 'list' Commented May 6, 2022 at 23:48
0

The solution provided by @steven.levey is perfect. One addition to it that I would like to make is to use this regex pattern to determine if the line is proper and extract the required values. So that we don't have to work on splitting the lines once again after determining the format using regex.

pattern = '(^[0-9\-\s\:\,]+)\s-\s__main__\s-\s([A-Z]+)\s-\s([\s\S]+)'
0

Easiest solution without using Regex and complex functions.

def main():
    
    # read the log file 
    
    with open("\\data\\logFile1.txt") as f:
        
        finall = []
        
        for ln in f:
            curdic = {} 
            temp = []      
            if ln.startswith('2015'):

                temp = ln.split(' - ')
                curdic['Date'] = temp[0]
                curdic['Type'] = temp[2]
                curdic['Message'] = temp[3]
                finall.append(curdic)
            else:
                finall[-1]['Message'] += ln
            
    for i in finall:

        print(i)        

if __name__ == '__main__':
    main()

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.