Reading set of large log files line by line and count how many hostnames appear on each line

Question

I have around 300 log files in a directory and each log file contains around 3300000 lines. I need to read through each file line by line and count how many hostnames that appear on each line. I wrote basic code for that task, but it takes more than 1 hour to run and takes lots of memory as well. How can I improve this code to make it run faster?

import pandas as pd
import gzip

directory=os.fsdecode("/home/scratch/mdsadmin/publisher_report/2018-07-25")#folder with 300 log files
listi=os.listdir(directory)#converting the logfiles into a list

for file in listi:#taking eaching log file in the list 
    tt=os.path.join(directory,file)# joining log file name along with the directory path
    with gzip.open(tt,'rt') as f: #unzipping the log file           
        rows=[]#clearing the list after every loop
        for line in f: #reading each line in the file 
            s=len(line.split('|'))
            a=line.split('|')[s-3]
            b=a.split('/')[0] #slicing just the hostname out of each line in the log file                
            if len(b.split('.'))==None:
                ''
            else:
                b=b.split('.')[0]
            rows.append(b) # appending it to a list

    df_temp= pd.DataFrame(columns=['hostname'],data=rows) #append list to the dataframe after every file is read
    df_final=df_final.append(df_temp,ignore_index=True) #appending above dataframe to a new one to avoid overwriting
    del df_temp #deleting temp dataframe to clear memory
df_final=df_final.groupby(["hostname"]).size().reset_index(name="Topic_Count") #doing the count

Sample log lines

tx:2018-05-05T20:44:37:626 BST|rx:2018-05-05T20:44:37:626 BST|dt:0|**wokpa22**.sx.sx.com/16604/#001b0001|244/5664|2344|455
tx:2018-05-05T20:44:37:626 BST|rx:2018-05-05T20:44:37:626 BST|dt:0|**wokdd333**.sc.sc.com/16604/#001b0001|7632663/2344|342344|23244

Desired output

Well, you are searching a huge database. Why did you think it would be faster? — FreezePhoenix
– FreezePhoenix, Commented Aug 9, 2018 at 16:05
@FreezePhoenixI am not sure ! will any kind of multiprocessing help it make run faster ? — Mamatha
– Mamatha, Commented Aug 9, 2018 at 17:34
I think the calculation of s is not necessary, you can directly do a=line.split('|')[-3] it should return the right value. — Ben.T
– Ben.T, Commented Aug 9, 2018 at 18:13
@Ben.T Thanks :) made the change, but its still going slow ! — Mamatha
– Mamatha, Commented Aug 9, 2018 at 18:18
@Ben.T sorry my bad !had changed names to make it more meaningful here, forgot to change that part. updated ! — Mamatha
– Mamatha, Commented Aug 9, 2018 at 18:29

Ben.T · Accepted Answer · 2018-08-10 00:56:59Z

So I think you can improve the efficiency of your code like this.

First, as I said in one comment, you can replace:

s=len(line.split('|'))
a=line.split('|')[s-3]

by

a=line.split('|')[-3]

as no need to know the total length of a list to get the third element from the end.

Second, assigning a then b with a value take some times, you can do it in one line:

a=line.split('|')[-3]
b=a.split('/')[0]

becomes

b=line.split('|')[-3].split('/')[0]

Third, I'm not sure len can equal to None, maybe you wanted to check for 0, but if our code run like this, I would say that:

if len(b.split('.'))==None:
    ''
else:
    b=b.split('.')[0]

is not useful so you can calculate directly the final b with:

b=line.split('|')[-3].split('/')[0].split('.')[0]

Forth, because actually you don't need to assign b anymore, you can append the value into rows directly, such as:

rows=[]
for line in f:
    rows.append(line.split('|')[-3].split('/')[0].split('.')[0])

or as a list comprehension:

rows = [line.split('|')[-3].split('/')[0].split('.')[0] for line in f]

Fifth, again, you create df_temp to use it once and then delete it, you can append directly into df_final such as:

df_temp= pd.DataFrame(columns=['hostname'],data=rows)
df_final=df_final.append(df_temp,ignore_index=True) 
del df_temp

is better this way:

df_final=df_final.append(pd.DataFrame(columns=['hostname'],data=rows),
                         ignore_index=True)

Ultimately, rows is not necessary anymore, so all the code from the line with ... until the line del ... can be written:

with gzip.open(tt,'rt') as f:
    df_final=df_final.append(pd.DataFrame(columns=['hostname'],
                                          data=[line.split('|')[-3].split('/')[0].split('.')[0] for line in f]),
                             ignore_index=True)

So far, I think we saved some time, but I know that appening dataframe in a loop is not the best practice, especially because you need to assign again df_final each time. It's better to add all the dataframes that you want to append together in a list, and then use pd.concat outside of the loop. Your code becomes:

list_final = []
for file in listi:
    tt=os.path.join(directory,file)
    with gzip.open(tt,'rt') as f:           
        list_final.append(pd.DataFrame(columns=['hostname'],
                                       data=[line.split('|')[-3].split('/')[0].split('.')[0] 
                                             for line in f]))
df_final = (pd.concat(list_final,ignore_index=True)
              .groupby(["hostname"]).size().reset_index(name="Topic_Count"))

Timing

I create one file with around 3 millions of rows, running you method was 8.9 seconds while mine was 5.8 (a gain more than 30%). I run the code on a listi containg 10 of this file, and your method gave more than 91 seconds (a bit more than stricly 10 times the method with one file) while mine was about 57 seconds (a bit less than 10 times the method for just one file).

I don't know about all the multiprocessing or serializing calculations in Python, but it may be a good option too.

thank you for such a detailed explanation !doing all the split in one line really helped :) — Mamatha
– Mamatha, Commented Aug 13, 2018 at 14:57
@Mamatha you are welcome. good luck in improving your code :) — Ben.T
– Ben.T, Commented Aug 13, 2018 at 16:31

AJNeufeld · Accepted Answer · 2018-08-13 20:34:31Z

Splitting a string, and only taking one substring from that split is doing a lot of work, just to throw away most of the results.

Consider the following line:

tx:2018-05-05T20:44:37:626 BST|rx:2018-05-05T20:44:37:626 BST|dt:0|**wokpa22**.sx.sx.com/16604/#001b0001|244/5664|2344|455

It seems safe to assume that tx: ... |rx: ... | is a fixed format. Starting at dt:, we might see some variation. For instance, dt:10 is longer than dt:0. So while the position of the hostname might vary a little bit, it seems easy to get the starting point: just after the first | character after the first 62 characters. Similarly, finding the end point: the first . (if any) before the first /:

start = line.index('|', 62)+1
slash = line.index('/', start)
dot = line.find('.', start, slash)
end = dot if dot > 0 else slash
b = line[start:end]

Running timing tests, I find this isolates the hostname in 40% of the time of:

a = line.split('|')[-4]
b = a.split('/')[0]
if len(b.split('.')) > 0:
    b = b.split('.')[0]

Finally, if all you are doing is getting a total count of each hostname across all of the files, appending the hostname to a rows list, and using pandas to count the occurrences is painful. Simply use a Counter:

import collections

counter = collections.Counter()

for file in ...:
   for line in ...:

       ...
       hostname = line[start:end]

       counter[hostname] += 1

And then create your panda from the counter, with the hostname counts already totaled.

Finally, as shown above, use better variable names, such as hostname instead of b.

Assuming that you are not I/O bound, you may be able to gain some speed using the multiprocessing. Below, the list of files distributed to a number of workers, one per CPU. Each Process takes a file, unzips and read it line-by-line, counting hostnames, and returning the counter. The main process receives the results for each file from the pool of processes, and accumulates the results into a single counter using sum(). Since the order of the results does not matter, .imap_unordered() can be used to avoid the overhead of ensuring the order of results matches the order of the inputs.

from multiprocessing import Pool
from collections import Counter

def count_hostnames(file):
    counter = Counter()
    with gzip.open(file, 'rt') as f:

        for line in f:
            # ... omitted ...
            hostname = line[start:end]
            counter[hostname] += 1

    return counter

if __name__ == '__main__':   # Guard for multiprocessing re-import of __main__

    files = os.listdir(directory)
    files = [ os.path.join(directory, file) for file in files ]

    with Pool() as pool:

        counter = sum( pool.imap_unordered(count_hostnames, files), Counter() )

    print(counter)   # or create your panda

I read few times about the Counter method but never think about using it, this is awesome :) — Ben.T
– Ben.T, Commented Aug 10, 2018 at 1:17
Yes counter became my life saver ! program came down to 45 mins, so would that be the max threshold of speed and no way we can make it more faster ? — Mamatha
– Mamatha, Commented Aug 13, 2018 at 14:53
Python is an interpreted, loosely typed language; you could speed up your program by rewriting it in C. If you want to leave it in Python, you may get additional speed from bigger picture optimizations. Are all 300 log files different each time you run, or are some “history” files that you can cache the hostname counts from? Processing each file in a separate process (not Python thread!) may help if you are not I/O bound. (See the multiprocessing package) — AJNeufeld
– AJNeufeld, Commented Aug 13, 2018 at 15:16
Added attempted speedup using multiprocessing.Pool. If you have more than one CPU, and are not I/O bound, you may gain some speed. I'd be interested in hearing your result. — AJNeufeld
– AJNeufeld, Commented Aug 13, 2018 at 20:29

Stack Exchange Network

Reading set of large log files line by line and count how many hostnames appear on each line

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Reading set of large log files line by line and count how many hostnames appear on each line

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions