I frequently run a script similar to the one below to analyze an arbitrary number of files in parallel on a computer with 8 cores.
I use Popen to control each thread, but sometimes run into problems when there is much stdout or stderr, as the buffer fills up. I solve this by frequently reading from the streams. I also print the streams from one of the threads to help me follow the progress of the analysis.
I'm curious on alternative methods to thread using Python, and general comments about the implementation, which, as always, has room for improvement. Thanks!
import os, sys
import time
import subprocess
def parallelize(analysis_program_path, filenames, N_CORES):
'''
Function that parallelizes an analysis on a list of files on N_CORES number of cores
'''
running = []
sys.stderr.write('Starting analyses\n')
while filenames or running:
while filenames and len(running) < N_CORES:
# Submit new analysis
filename = filenames.pop(0)
cmd = '%s %s' % (analysis_program_path, filename)
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
sys.stderr.write('Analyzing %s\n' % filename)
running.append((cmd, p))
i = 0
while i < len(running):
(cmd, p) = running[i]
returncode = p.poll()
st_out = p.stdout.read()
st_err = p.stderr.read() # Read the buffer! Otherwise it fills up and blocks the script
if i == 0: # Just print one of the processes
sys.stderr.write(st_err)
if returncode is not None:
st_out = p.stdout.read()
st_err = p.stderr.read()
sys.stderr.write(st_err)
running.remove((cmd, p))
else:
i += 1
time.sleep(1)
sys.stderr.write('Completely done!\n')