Python parallelization using Popen

Question

I frequently run a script similar to the one below to analyze an arbitrary number of files in parallel on a computer with 8 cores.

I use Popen to control each thread, but sometimes run into problems when there is much stdout or stderr, as the buffer fills up. I solve this by frequently reading from the streams. I also print the streams from one of the threads to help me follow the progress of the analysis.

I'm curious on alternative methods to thread using Python, and general comments about the implementation, which, as always, has room for improvement. Thanks!

import os, sys
import time
import subprocess

def parallelize(analysis_program_path, filenames, N_CORES):
    '''
    Function that parallelizes an analysis on a list of files on N_CORES number of cores
    '''
    running = []
    sys.stderr.write('Starting analyses\n')
    while filenames or running:
        while filenames and len(running) < N_CORES:
            # Submit new analysis                                                                                                                                                          
            filename = filenames.pop(0)
            cmd = '%s %s' % (analysis_program_path, filename)
            p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            sys.stderr.write('Analyzing %s\n' % filename)
            running.append((cmd, p))
        i = 0
        while i < len(running):
            (cmd, p) = running[i]
            returncode = p.poll()
            st_out = p.stdout.read()                                                                                                                                                       
            st_err = p.stderr.read()  # Read the buffer! Otherwise it fills up and blocks the script                                                                            
            if i == 0:  # Just print one of the processes                                                                                                                                   
                sys.stderr.write(st_err)
            if returncode is not None:
                st_out = p.stdout.read()
                st_err = p.stderr.read()
                sys.stderr.write(st_err)
                running.remove((cmd, p))
            else:
                i += 1
        time.sleep(1)
    sys.stderr.write('Completely done!\n')

Gareth Rees · Accepted Answer · 2013-01-14 17:23:54Z

Python has what you want built into the standard library: see the multiprocessing module, and in particular the map method of the Pool class.

So you can implement what you want in one line, perhaps like this:

from multiprocessing import Pool

def parallelize(analysis, filenames, processes):
    '''
    Call `analysis` for each file in the sequence `filenames`, using
    up to `processes` parallel processes. Wait for them all to complete
    and then return a list of results.
    '''
    return Pool(processes).map(analysis, filenames, chunksize = 1)

Stack Exchange Network

Python parallelization using Popen

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Python parallelization using Popen

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions