Skip to main content
2 of 5
bug fixes

faster tests with dependency analysis

I have a performance issue causing test->code->test cycle to be really slow. The slowness is hard to avoid, since I am doing some heavy image processing, and I am trying to accelerate things a bit by not running functions when it is not needed. For ex: compute some numbers from big images -> serialize results in text files -> resume computations from text files.

Currently, I commit the text files containing the results to share them with the team, and let other people run the tests quickly. But I have to make sure that whenever one dependency is updated (src images, preprocessing functions, etc), those value are recomputed. I looked for a tool which would allow me to describe dependencies and production rules to automate the process and avoid committing data in the codebase. I thought about makefile, but I read a lot of negative advices against it. I found stuff like Scons and other build automation tools (http://en.wikipedia.org/wiki/List_of_build_automation_software), but those are for building softwares, and they do not seem adapted for my task.

I decided to write a small lib to tackle this issue (my environnement is mainly python & some c). The goal was to have objects knowing the production rule for their own output, and aware of the other objects they depend upon. An object know if its output is up to date by comparing the current md5 checksum of the file against the last one (stored somewhere in a small temp file).

My questions are: Am I reinventing a wheel here? Is there a tool out there I should use instead of a custom lib? If not, is this a good pattern for taking care of this problem?

edit: some bug fixes

import os
import path
from md5 import md5 as stdMd5

hashDir = os.path.join(path.root, '_hashes') #path.root is our repo root

def md5(toHash):
    return stdMd5(toHash).hexdigest()

class BaseRule(object):
    '''base object, managing work like checking if results are up to date, and
    calling production rules if they are not'''

    outPath = None

    def __init__(self, func):
        self.func = func

    @property
    def hashPath(self):
        return os.path.join(hashDir, md5(self.outPath))

    def getOutHash(self):
        with open(self.outPath) as f:
            fileHash = md5(f.read())
        return fileHash

    @property
    def updated(self):
        if not os.path.exists(self.outPath):
            return True
        if not os.path.exists(self.hashPath):
            return True
        with open(self.outPath) as f:
            fileHash = self.getOutHash()
        with open(self.hashPath) as f:
            storedHash = f.read().strip()
        return storedHash != fileHash

    def storeHash(self):
        if not os.path.exists(hashDir):
            os.makedirs(hashDir)
        with open(self.hashPath, 'w') as f:
            f.write(self.getOutHash())

    def get(self):
        args = dict([(key, inp.get()) for key, inp in self.inputs.items()])
        updateNeeded = any(inp.updated for inp in self.inputs.values())
        for inp in self.inputs.values():
            inp.storeHash()
        args['out'] = self.outPath
        if updateNeeded:
            self.func(**args)
        return self.outPath

class StableSrc(BaseRule):
    'source file that never change'
    inputs = {}
    def __init__(self, path):
        self.outPath = path
    @property
    def updated(self):
        return False

class Src(BaseRule):
    inputs = {}
    def func(self, out):
        return None
    def __init__(self, path):
        self.outPath = path

def rule(**kwargs):
    'decorator used to declare dependencies'
    class Rule(BaseRule):
        inputs = dict([(key, val) for key, val in kwargs.items() if key != 'out'])
        outPath = kwargs['out']
    return Rule

def copyTest():
    'test function'
    import shutil
    @rule(inp=Src('test.txt'), out='test2.txt')
    def newTest(inp, out):
        print 'copy'
        shutil.copy(inp, out)
    @rule(inp=newTest, out='test3.txt')
    def newNewTest(inp, out):
        print 'copy2'
        shutil.copy(inp, out)
    newNewTest.get()
    newNewTest.storeHash()

if __name__ == '__main__':
    # will copy test.txt to test2.txt and test3.txt.
    # If ran a second time, won't copy anything
    copyTest()

note: I could not find relevant tags for this question. Is this the good place for this question?