Skip to main content
deleted 117 characters in body; edited title; edited tags
Source Link
Jamal
  • 35.2k
  • 13
  • 134
  • 238

faster Faster tests with dependency analysis

Currently, I commit the text files containing the results to share them with the team, and let other people run the tests quickly. But I have to make sure that whenever one dependency is updated (src images, preprocessing functions, etc), those value are recomputed. I looked for a tool which would allow me to describe dependencies and production rules to automate the process and avoid committing data in the codebase. I thought about makefile, but I read a lot of negative advicesadvice against it. I found stuff like Scons and other build automation tools (http://en.wikipedia.org/wiki/List_of_build_automation_softwareautomation tools), but those are for building softwaressoftware, and they do not seem adapted for my task.

I decided to write a small lib to tackle this issue (my environnementenvironment is mainly python &Python and some cC). The goal was to have objects knowing the production rule for their own output, and aware of the other objects they depend upon. An object knowknows if its output is up to date-to-date by comparing the current md5MD5 checksum of the file against the last one (stored somewhere in a small temp file).

My questions are: Am I reinventing a wheel here? Is there a tool out there I should use instead of a custom lib? If not, is this a good pattern for taking care of this problem?

note: I could not find relevant tags for this question. Is this the good place for this question?

faster tests with dependency analysis

Currently, I commit the text files containing the results to share them with the team, and let other people run the tests quickly. But I have to make sure that whenever one dependency is updated (src images, preprocessing functions, etc), those value are recomputed. I looked for a tool which would allow me to describe dependencies and production rules to automate the process and avoid committing data in the codebase. I thought about makefile, but I read a lot of negative advices against it. I found stuff like Scons and other build automation tools (http://en.wikipedia.org/wiki/List_of_build_automation_software), but those are for building softwares, and they do not seem adapted for my task.

I decided to write a small lib to tackle this issue (my environnement is mainly python & some c). The goal was to have objects knowing the production rule for their own output, and aware of the other objects they depend upon. An object know if its output is up to date by comparing the current md5 checksum of the file against the last one (stored somewhere in a small temp file).

My questions are: Am I reinventing a wheel here? Is there a tool out there I should use instead of a custom lib? If not, is this a good pattern for taking care of this problem?

note: I could not find relevant tags for this question. Is this the good place for this question?

Faster tests with dependency analysis

Currently, I commit the text files containing the results to share them with the team, and let other people run the tests quickly. But I have to make sure that whenever one dependency is updated (src images, preprocessing functions, etc), those value are recomputed. I looked for a tool which would allow me to describe dependencies and production rules to automate the process and avoid committing data in the codebase. I thought about makefile, but I read a lot of negative advice against it. I found stuff like Scons and other build automation tools, but those are for building software, and they do not seem adapted for my task.

I decided to write a small lib to tackle this issue (my environment is mainly Python and some C). The goal was to have objects knowing the production rule for their own output, and aware of the other objects they depend upon. An object knows if its output is up-to-date by comparing the current MD5 checksum of the file against the last one (stored somewhere in a small temp file).

Am I reinventing a wheel here? Is there a tool out there I should use instead of a custom lib? If not, is this a good pattern for taking care of this problem?

Rollback to Revision 1
Source Link
Jamal
  • 35.2k
  • 13
  • 134
  • 238

edit: some bug fixes

edit: style fix

import os
import path
from md5 import md5 as stdMd5

HASHDIR# path.root is the root of our repo
hashDir = os.path.join(path.root, '_hashes') #path.root is our repo root

def md5(toHash):
    return stdMd5(toHash).hexdigest()

class BaseRule(object):
    '''base object, managing work like checking if results are up to date, and
    calling production rules if they are not'''

    OUTPATHoutPath = None

    def __init__(self, func):
        self.func = func

    @property
    def hashPath(self):
        return os.path.join(HASHDIRhashDir, md5(self.OUTPATHoutPath))

    def getOutHash(self):
        with open(self.OUTPATHoutPath) as f:
            fileHash = md5(f.read())
        return fileHash

    @property
    def updatedisUpToDate(self):
        if not os.path.exists(self.OUTPATHoutPath):
            return TrueFalse
        if not os.path.exists(self.hashPath):
            return TrueFalse
        with open(self.outPath) as f:
            fileHash = self.getOutHash()
        with open(self.hashPath) as f:
            storedHash = f.read().strip()
        return storedHash !=== fileHash

    def storeHash(self):
        if not os.path.exists(HASHDIRhashDir):
            os.makedirs(HASHDIRhashDir)
        with open(self.hashPath, 'w') as f:
            f.write(self.getOutHash())

    def get(self):
        argsinputPathes = dict([(key, inp.get()) for key, inp in self.INPUTSinputs.items())
        updateNeeded = any(inp.updated for inp in self.INPUTS.values()])
        for inpif innot self.INPUTS.valuesisUpToDate():
            inpself.storeHashfunc()
        args['out'] = selfoutPath=self.OUTPATH
        ifoutPath, updateNeeded:**inputPathes)
            self.funcstoreHash(**args)
        return self.OUTPATHoutPath

class StableSrc(BaseRule):
    'source file that never change' 

    INPUTSinputs = {} 

    def __init__(self, path):
        self.OUTPATHoutPath = path
    @property
    def updatedisUpToDate(self):
        return False
True
class Src(BaseRule):
    INPUTS = {}
    def func(self, out):
        return None
    def __init__rule(self_inputs, path):
        self.OUTPATH = path

def rule(**kwargs_outPath):
    'decorator used to declare dependencies' 

    class Rule(BaseRule):
        INPUTS = dict((key, val) for key, val in kwargs.items() if keyinputs != 'out')_inputs
        OUTPATHoutPath = kwargs['out']_outPath

    return Rule

def copyTest():
    'test function' 

    import shutil 

    @rule(inp=Src{'inp' : StableSrc('test.txt')}, out='test2'test2.txt')
    def newTest(inp, outoutPath):
        print 'copy'
        shutil.copy(inp, outoutPath) 


    @rule(inp=newTest{'inp' : newTest}, out='test3'test3.txt')
    def newNewTest(inp, outoutPath):
        print 'copy2'
        shutil.copy(inp, outoutPath)
    newNewTest.get()
    return newNewTest.storeHashget()

if __name__ == '__main__':
    copyTest() # will copy test.txt to test2.txt and test3.txt.
    # If ran a second time, won't copy anything
    copyTest()

edit: some bug fixes

edit: style fix

import os
import path
from md5 import md5 as stdMd5

HASHDIR = os.path.join(path.root, '_hashes') #path.root is our repo root

def md5(toHash):
    return stdMd5(toHash).hexdigest()

class BaseRule(object):
    '''base object, managing work like checking if results are up to date, and
    calling production rules if they are not'''

    OUTPATH = None

    def __init__(self, func):
        self.func = func

    @property
    def hashPath(self):
        return os.path.join(HASHDIR, md5(self.OUTPATH))

    def getOutHash(self):
        with open(self.OUTPATH) as f:
            fileHash = md5(f.read())
        return fileHash

    @property
    def updated(self):
        if not os.path.exists(self.OUTPATH):
            return True
        if not os.path.exists(self.hashPath):
            return True
        fileHash = self.getOutHash()
        with open(self.hashPath) as f:
            storedHash = f.read().strip()
        return storedHash != fileHash

    def storeHash(self):
        if not os.path.exists(HASHDIR):
            os.makedirs(HASHDIR)
        with open(self.hashPath, 'w') as f:
            f.write(self.getOutHash())

    def get(self):
        args = dict((key, inp.get()) for key, inp in self.INPUTS.items())
        updateNeeded = any(inp.updated for inp in self.INPUTS.values())
        for inp in self.INPUTS.values():
            inp.storeHash()
        args['out'] = self.OUTPATH
        if updateNeeded:
            self.func(**args)
        return self.OUTPATH

class StableSrc(BaseRule):
    'source file that never change'
    INPUTS = {}
    def __init__(self, path):
        self.OUTPATH = path
    @property
    def updated(self):
        return False

class Src(BaseRule):
    INPUTS = {}
    def func(self, out):
        return None
    def __init__(self, path):
        self.OUTPATH = path

def rule(**kwargs):
    'decorator used to declare dependencies'
    class Rule(BaseRule):
        INPUTS = dict((key, val) for key, val in kwargs.items() if key != 'out')
        OUTPATH = kwargs['out']
    return Rule

def copyTest():
    'test function'
    import shutil
    @rule(inp=Src('test.txt'), out='test2.txt')
    def newTest(inp, out):
        print 'copy'
        shutil.copy(inp, out)
    @rule(inp=newTest, out='test3.txt')
    def newNewTest(inp, out):
        print 'copy2'
        shutil.copy(inp, out)
    newNewTest.get()
    newNewTest.storeHash()

if __name__ == '__main__':
    # will copy test.txt to test2.txt and test3.txt.
    # If ran a second time, won't copy anything
    copyTest()
import os
import path
from md5 import md5 as stdMd5

# path.root is the root of our repo
hashDir = os.path.join(path.root, '_hashes')

def md5(toHash):
    return stdMd5(toHash).hexdigest()

class BaseRule(object):
    '''base object, managing work like checking if results are up to date, and
    calling production rules if they are not'''

    outPath = None

    def __init__(self, func):
        self.func = func

    @property
    def hashPath(self):
        return os.path.join(hashDir, md5(self.outPath))

    def getOutHash(self):
        with open(self.outPath) as f:
            fileHash = md5(f.read())
        return fileHash

    def isUpToDate(self):
        if not os.path.exists(self.outPath):
            return False
        if not os.path.exists(self.hashPath):
            return False
        with open(self.outPath) as f:
            fileHash = self.getOutHash()
        with open(self.hashPath) as f:
            storedHash = f.read().strip()
        return storedHash == fileHash

    def storeHash(self):
        if not os.path.exists(hashDir):
            os.makedirs(hashDir)
        with open(self.hashPath, 'w') as f:
            f.write(self.getOutHash())

    def get(self):
        inputPathes = dict([(key, inp.get()) for key, inp in self.inputs.items()])
        if not self.isUpToDate():
            self.func(outPath=self.outPath, **inputPathes)
            self.storeHash()
        return self.outPath

class StableSrc(BaseRule):
    'source file that never change' 

    inputs = {} 

    def __init__(self, path):
        self.outPath = path

    def isUpToDate(self):
        return True


def rule(_inputs, _outPath):
    'decorator used to declare dependencies' 

    class Rule(BaseRule):
        inputs = _inputs
        outPath = _outPath

    return Rule

def copyTest():
    'test function' 

    import shutil 

    @rule({'inp' : StableSrc('test.txt')}, 'test2.txt')
    def newTest(inp, outPath):
        print 'copy'
        shutil.copy(inp, outPath) 


    @rule({'inp' : newTest}, 'test3.txt')
    def newNewTest(inp, outPath):
        print 'copy2'
        shutil.copy(inp, outPath)

    return newNewTest.get()

if __name__ == '__main__':
    copyTest() # will copy test.txt to test2.txt and test3.txt. If ran a second time, won't copy anything
Tweeted twitter.com/#!/StackCodeReview/status/116587934521626624
deleted 28 characters in body
Source Link

edit: style fix

import os
import path
from md5 import md5 as stdMd5

hashDirHASHDIR = os.path.join(path.root, '_hashes') #path.root is our repo root

def md5(toHash):
    return stdMd5(toHash).hexdigest()

class BaseRule(object):
    '''base object, managing work like checking if results are up to date, and
    calling production rules if they are not'''

    outPathOUTPATH = None

    def __init__(self, func):
        self.func = func

    @property
    def hashPath(self):
        return os.path.join(hashDirHASHDIR, md5(self.outPathOUTPATH))

    def getOutHash(self):
        with open(self.outPathOUTPATH) as f:
            fileHash = md5(f.read())
        return fileHash

    @property
    def updated(self):
        if not os.path.exists(self.outPathOUTPATH):
            return True
        if not os.path.exists(self.hashPath):
            return True
        with open(self.outPath) as f:
            fileHash = self.getOutHash()
        with open(self.hashPath) as f:
            storedHash = f.read().strip()
        return storedHash != fileHash

    def storeHash(self):
        if not os.path.exists(hashDirHASHDIR):
            os.makedirs(hashDirHASHDIR)
        with open(self.hashPath, 'w') as f:
            f.write(self.getOutHash())

    def get(self):
        args = dict([(key, inp.get()) for key, inp in self.inputsINPUTS.items()])
        updateNeeded = any(inp.updated for inp in self.inputsINPUTS.values())
        for inp in self.inputsINPUTS.values():
            inp.storeHash()
        args['out'] = self.outPathOUTPATH
        if updateNeeded:
            self.func(**args)
        return self.outPathOUTPATH

class StableSrc(BaseRule):
    'source file that never change'
    inputsINPUTS = {}
    def __init__(self, path):
        self.outPathOUTPATH = path
    @property
    def updated(self):
        return False

class Src(BaseRule):
    inputsINPUTS = {}
    def func(self, out):
        return None
    def __init__(self, path):
        self.outPathOUTPATH = path

def rule(**kwargs):
    'decorator used to declare dependencies'
    class Rule(BaseRule):
        inputsINPUTS = dict([(key, val) for key, val in kwargs.items() if key != 'out']'out')
        outPathOUTPATH = kwargs['out']
    return Rule

def copyTest():
    'test function'
    import shutil
    @rule(inp=Src('test.txt'), out='test2.txt')
    def newTest(inp, out):
        print 'copy'
        shutil.copy(inp, out)
    @rule(inp=newTest, out='test3.txt')
    def newNewTest(inp, out):
        print 'copy2'
        shutil.copy(inp, out)
    newNewTest.get()
    newNewTest.storeHash()

if __name__ == '__main__':
    # will copy test.txt to test2.txt and test3.txt.
    # If ran a second time, won't copy anything
    copyTest()
import os
import path
from md5 import md5 as stdMd5

hashDir = os.path.join(path.root, '_hashes') #path.root is our repo root

def md5(toHash):
    return stdMd5(toHash).hexdigest()

class BaseRule(object):
    '''base object, managing work like checking if results are up to date, and
    calling production rules if they are not'''

    outPath = None

    def __init__(self, func):
        self.func = func

    @property
    def hashPath(self):
        return os.path.join(hashDir, md5(self.outPath))

    def getOutHash(self):
        with open(self.outPath) as f:
            fileHash = md5(f.read())
        return fileHash

    @property
    def updated(self):
        if not os.path.exists(self.outPath):
            return True
        if not os.path.exists(self.hashPath):
            return True
        with open(self.outPath) as f:
            fileHash = self.getOutHash()
        with open(self.hashPath) as f:
            storedHash = f.read().strip()
        return storedHash != fileHash

    def storeHash(self):
        if not os.path.exists(hashDir):
            os.makedirs(hashDir)
        with open(self.hashPath, 'w') as f:
            f.write(self.getOutHash())

    def get(self):
        args = dict([(key, inp.get()) for key, inp in self.inputs.items()])
        updateNeeded = any(inp.updated for inp in self.inputs.values())
        for inp in self.inputs.values():
            inp.storeHash()
        args['out'] = self.outPath
        if updateNeeded:
            self.func(**args)
        return self.outPath

class StableSrc(BaseRule):
    'source file that never change'
    inputs = {}
    def __init__(self, path):
        self.outPath = path
    @property
    def updated(self):
        return False

class Src(BaseRule):
    inputs = {}
    def func(self, out):
        return None
    def __init__(self, path):
        self.outPath = path

def rule(**kwargs):
    'decorator used to declare dependencies'
    class Rule(BaseRule):
        inputs = dict([(key, val) for key, val in kwargs.items() if key != 'out'])
        outPath = kwargs['out']
    return Rule

def copyTest():
    'test function'
    import shutil
    @rule(inp=Src('test.txt'), out='test2.txt')
    def newTest(inp, out):
        print 'copy'
        shutil.copy(inp, out)
    @rule(inp=newTest, out='test3.txt')
    def newNewTest(inp, out):
        print 'copy2'
        shutil.copy(inp, out)
    newNewTest.get()
    newNewTest.storeHash()

if __name__ == '__main__':
    # will copy test.txt to test2.txt and test3.txt.
    # If ran a second time, won't copy anything
    copyTest()

edit: style fix

import os
import path
from md5 import md5 as stdMd5

HASHDIR = os.path.join(path.root, '_hashes') #path.root is our repo root

def md5(toHash):
    return stdMd5(toHash).hexdigest()

class BaseRule(object):
    '''base object, managing work like checking if results are up to date, and
    calling production rules if they are not'''

    OUTPATH = None

    def __init__(self, func):
        self.func = func

    @property
    def hashPath(self):
        return os.path.join(HASHDIR, md5(self.OUTPATH))

    def getOutHash(self):
        with open(self.OUTPATH) as f:
            fileHash = md5(f.read())
        return fileHash

    @property
    def updated(self):
        if not os.path.exists(self.OUTPATH):
            return True
        if not os.path.exists(self.hashPath):
            return True
        fileHash = self.getOutHash()
        with open(self.hashPath) as f:
            storedHash = f.read().strip()
        return storedHash != fileHash

    def storeHash(self):
        if not os.path.exists(HASHDIR):
            os.makedirs(HASHDIR)
        with open(self.hashPath, 'w') as f:
            f.write(self.getOutHash())

    def get(self):
        args = dict((key, inp.get()) for key, inp in self.INPUTS.items())
        updateNeeded = any(inp.updated for inp in self.INPUTS.values())
        for inp in self.INPUTS.values():
            inp.storeHash()
        args['out'] = self.OUTPATH
        if updateNeeded:
            self.func(**args)
        return self.OUTPATH

class StableSrc(BaseRule):
    'source file that never change'
    INPUTS = {}
    def __init__(self, path):
        self.OUTPATH = path
    @property
    def updated(self):
        return False

class Src(BaseRule):
    INPUTS = {}
    def func(self, out):
        return None
    def __init__(self, path):
        self.OUTPATH = path

def rule(**kwargs):
    'decorator used to declare dependencies'
    class Rule(BaseRule):
        INPUTS = dict((key, val) for key, val in kwargs.items() if key != 'out')
        OUTPATH = kwargs['out']
    return Rule

def copyTest():
    'test function'
    import shutil
    @rule(inp=Src('test.txt'), out='test2.txt')
    def newTest(inp, out):
        print 'copy'
        shutil.copy(inp, out)
    @rule(inp=newTest, out='test3.txt')
    def newNewTest(inp, out):
        print 'copy2'
        shutil.copy(inp, out)
    newNewTest.get()
    newNewTest.storeHash()

if __name__ == '__main__':
    # will copy test.txt to test2.txt and test3.txt.
    # If ran a second time, won't copy anything
    copyTest()
bug fixes
Source Link
Loading
Source Link
Loading