I have around 300 log files in a directory and each log file contains around 3300000 lines. I need to read through each file line by line and count how many hostnames that appear on each line. I wrote basic code for that task, but it takes more than 1 hour to run and takes lots of memory as well. How can I improve this code to make it run faster?
import pandas as pd
import gzip
directory=os.fsdecode("/home/scratch/mdsadmin/publisher_report/2018-07-25")#folder with 300 log files
listi=os.listdir(directory)#converting the logfiles into a list
for file in listi:#taking eaching log file in the list
tt=os.path.join(directory,file)# joining log file name along with the directory path
with gzip.open(tt,'rt') as f: #unzipping the log file
rows=[]#clearing the list after every loop
for line in f: #reading each line in the file
s=len(line.split('|'))
a=line.split('|')[s-3]
b=a.split('/')[0] #slicing just the hostname out of each line in the log file
if len(b.split('.'))==None:
''
else:
b=b.split('.')[0]
rows.append(b) # appending it to a list
df_temp= pd.DataFrame(columns=['hostname'],data=rows) #append list to the dataframe after every file is read
df_final=df_final.append(df_temp,ignore_index=True) #appending above dataframe to a new one to avoid overwriting
del df_temp #deleting temp dataframe to clear memory
df_final=df_final.groupby(["hostname"]).size().reset_index(name="Topic_Count") #doing the count
Sample log lines
tx:2018-05-05T20:44:37:626 BST|rx:2018-05-05T20:44:37:626 BST|dt:0|**wokpa22**.sx.sx.com/16604/#001b0001|244/5664|2344|455
tx:2018-05-05T20:44:37:626 BST|rx:2018-05-05T20:44:37:626 BST|dt:0|**wokdd333**.sc.sc.com/16604/#001b0001|7632663/2344|342344|23244
Desired output

sis not necessary, you can directly doa=line.split('|')[-3]it should return the right value. \$\endgroup\$