Automatic database migration system

Question

Github here: https://github.com/cmancone/mygrations

I'm working on a different kind of database migration system for helping manage web applications. I asked a question about part of the system just yesterday and now I'm looking to get the main processing loop looked at. You can read my post from yesterday (or the github readme) for more background, but what I'm posting about here is the real center of the whole system. This is the "migration" calculator. It takes in one or two databases, finds differences between them, calculates the operations need to make the structure of one match the structure of other, and (most importantly!) takes into account foreign key constraints so that a 1215 error won't be triggered during the migration.

The latter constraint means that it has to be smart about the order of operations. It can't add a table in until all its dependencies are added, it can't remove a column if that column is a part of a dependency on another table, it has to account for tables that might depend upon each other, etc...

I also mentioned that this takes one or two databases as input. It might seem strange that accepts only one database as input, but this is actually a necessary special case: if you have an empty database and need to create a bunch of tables on it, it simply has to consider the proper order of table creation and doesn't have to worry about changing/dropping tables, which complicates the process of determining the right order of operations.

Quick notes regarding codereview standards:

I've converted to PEP8 before posting here, but the actual project uses a slight variation on PEP8. If you notice any deviations from PEP8, feel free to edit the question
I have extensive comments in the main processing loop. I normally try to avoid that (comment noise and what-not), but since this is fairly complicated (and in particular some steps rely implicitly on previous steps), I figured it was better to be very detailed with comments here

Things I'm specifically looking for:

Mainly, this doesn't seem very elegant to me. The process by which I iteratively decide upon which operations to apply to the database and when seems fairly ad-hoc to me, and as a result, error prone (although it is passing all the tests I've written thus far). I'm effectively just looping through all the tables repeatedly to find ones that have all their foreign key constraints fulfilled, adding those, and then repeating until I'm out of tables or nothing can be added (i.e. tables which mutually-depend upon each other). I had initially considered coming up with a dependency tree and using that to determine operation order, but didn't end up doing so (partially because I didn't think it would help with determining the order while modifying tables). I'm especially interested in any suggestions that would help me redesign the algorithm to be less hacky, and would be happy to start up a bounty to give extra rep to an answer that gives coherent help in that area.
I'm especially unhappy with the calling sequence of _process_adds and _process_updates as they both receive input, return values, but also modify their input in-place. I normally avoid modifying input parameters like the plague, and this is the only place in the entire project where it happens. It ended up happening primarily because those methods don't really make good functions: they are tightly integrated with all the other variables in the main _process method, and I split them off only so that the _process method would be a bit more readable. Again, this suggest to me that a more streamlined algorithm is in order.

Onto the code:

import copy

from mygrations.formats.mysql.definitions.database import database
from mygrations.formats.mysql.mygrations.operations.alter_table import alter_table
from mygrations.formats.mysql.mygrations.operations.add_constraint import add_constraint
from mygrations.formats.mysql.mygrations.operations.remove_table import remove_table

class mygration:
    """ Creates a migration plan to update a database to a given spec.

    If only one database is passed in, it treats it as a structure to migrate
    to starting from a blank slate, which primariliy means just figuring out
    what order to add tables in to account for foreign key constraints.

    If two tables are present, then it will treat the second as the current
    database constraint, and will figure out steps to execute in action
    to update that second database to match the structure of the first.

    The general steps are:

    1. Check for foreign key errors: if the two database we are trying to migrate to doesn't support its own foreign key constraints, there is a show stopper
    2. Loop through tables to be added, "creating" them as their foreign key constraints are filled, and ignoring any that have interlocking dependencies
    3. Remove any FK constraints that are no longer used (done early to prevent trouble in the next step)
    4. Modify tables as needed.  If modifying tables involves adding foreign key constraints that are not fulfilled, ignore those and add them later
    5. See if we can add any remaining tables now that some table modifications have been applied
    6. If there are still any outstanding tables to add, remove any foreign key constraints that are not fulfilled and add the tables without them
    7. Apply all foreign key constraints that were previously ignored in steps 4 & 6
    8. Remove any tables that need to be removed
    """
    def __init__(self, db_to, db_from = None):
        """ Create a migration plan

        :param db_to: The target database structure to migrate to
        :param db_from: The current database structure to migrate from
        :type db_to: mygrations.formats.mysql.definitions.database
        :type db_from: mygrations.formats.mysql.definitions.database
        """

        self.db_to = db_to
        self.db_from = db_from
        [ self._errors_1215, self._operations ] = self._process()

    @property
    def operations(self):
        """ Public getter.  Returns list of operations to bring db_from to db_to

        If db_from doesn't exist then it will be a list of operations to
        create db_to.

        :returns: A list of table operations
        :rtype: [mygrations.formats.mysql.mygrations.operations.operation]
        """
        return self._operations

    @property
    def errors_1215(self):
        """ Public getter.  Returns list of 1215 errors (as strings)

        :returns: A list of 1215 error messages
        :rtype: [string]
        """
        return self._errors_1215

    def __len__(self):
        return len(self._operations)

    def __bool__(self):
        return True if len(self._operations) else False

    def __str__( self ):
        return "\n".join([ str(x) for x in self._operations ])

    def __iter__(self):
        return self._operations.__iter__()

    def _differences(self, a, b):
        """
        Calculates the difference between two OrderedDicts.

        https://codereview.stackexchange.com/a/176303/140581

        Duplication!!!! (formats.mysql.create_parser).  Sue me.

        :param a: OrderedDict
        :param b: OrderedDict
        :return: (added, removed, overlap)
        """
        return (
            [key for key in b if key not in a],
            [key for key in a if key not in b],
            [key for key in a if key in b]
        )

    def _process(self):
        """ Figures out the operations (and proper order) need to get to self.db_to

        Excessively commented because there are a lot of details and this is a critical
        part of the process
        """

        # Our primary output is a list of operations, but there is more that we need
        # to make all of this happen.  We need a database to keep track of the
        # state of the database we are building after each operation is "applied"
        tracking_db = copy.deepcopy(self.db_from) if self.db_from else database()

        # a little bit of extra processing will simplify our algorithm by a good chunk.
        # The situation is much more complicated when we have a database we are migrating
        # from, because columns might be added/removed/changed, and it might be (for instance)
        # that the removal of a column breaks a foreign key constraint.  The general
        # ambiguities introduced by changes happening in different times/ways makes it
        # much more difficult to figure out when foreign key constraints can properly
        # be added without triggering a 1215 error.  The simplest way to straighten this
        # all out is to cheat: "mygrate" the "to" database all by itself.  Without a "from"
        # the operations are more straight-forward, and we can figure out with less effort
        # whether or not all FK constraints can be fulfilled.  If they aren't all fulfilled,
        # then just quit now before we do anything.  If they are all fulfilled then we
        # know our final table will be fine, so if we can just split off any uncertain
        # foreign key constraints and apply them all at the end when our database is done
        # being updated.  Simple(ish)!
        if self.db_from:
            check = mygration(self.db_to)
            if check.errors_1215:
                return [ check.errors_1215, [] ]

        # First figure out the status of individual tables
        db_from_tables = self.db_from.tables if self.db_from else {}
        (tables_to_add, tables_to_remove, tables_to_update) = self._differences(db_from_tables, self.db_to.tables)

        # IMPORTANT! tracking db and tables_to_add are both passed in by reference
        # (like everything in python), but in this case I actually modify them by reference.
        # not my preference, but it makes it easier here
        (errors_1215, operations) = self._process_adds(tracking_db, tables_to_add)

        # if we have errors we are done
        if errors_1215:
            return [ errors_1215, operations ]

        # now apply table updates.  This acts differently: it returns a dictionary with
        # two sets of operations: one to update the tables themselves, and one to update
        # the foreign keys.  The latter are applied after everything else has happened.
        fk_operations = []
        split_operations = self._process_updates(tracking_db, tables_to_update)
        # remove fks first to avoid 1215 errors caused by trying to remove a column
        # that is being used in a FK constraint that hasn't yet been removed
        if split_operations['removed_fks']:
            operations.extend(split_operations['removed_fks'])
        if split_operations['kitchen_sink']:
            operations.extend(split_operations['kitchen_sink'])
        if split_operations['fks']:
            fk_operations.extend(split_operations['fks'])

        # now that we got some tables modified let's try adding tables again
        # if we have any left.  Remember that tracking_db and tables_to_add
        # are modified in-place.  The point here is that there may be some
        # tables to add that we were not able to add before because they
        # relied on adding a column to a table before a foreign key could
        # be supported.
        if tables_to_add:
            (errors_1215, more_operations) = self._process_adds(tracking_db, tables_to_add)
            if more_operations:
                operations = operations.extend(more_operations)
            if errors_1215:
                if fk_operations:
                    operations.extend(fk_operations)
                retrun [ errors_1215, operations ]

        # At this point in time if we still have tables to add it is because
        # they have mutually-dependent foreign key constraints.  The way to
        # fix that is to be a bit smarter (but not too smart) and remove
        # from the tables all foreign key constraints that can't be added yet.
        # Then run the CREATE TABLE operations, and add the foreign key
        # constraints afterward
        for table_to_add in tables_to_add:
            new_table = self.db_to.tables[table_to_add]
            bad_constraints = tracking_db.unfulfilled_fks(new_table)
            new_table_copy = copy.deepcopy(new_table)
            create_fks = alter_table(table_to_add)
            for constraint in bad_constraints.values():
                create_fks.add_operation(add_constraint(constraint['foreign_key']))
                new_table_copy.remove_constraint(constraint['foreign_key'])
            operations.append(new_table_copy.create())
            fk_operations.append(create_fks)

        # process any remaining fk constraints
        if fk_operations:
            operations.extend(fk_operations)

        # finally remove any tables
        for table_to_remove in tables_to_remove:
            operations.append(remove_table(table_to_remove))
            tracking_db.remove_table(table_to_remove)

        # all done!!!
        return [ errors_1215, operations ]

    def _process_adds(self, tracking_db, tables_to_add):
        """ Runs through tables_to_add and resolves FK constraints to determine order to add tables in

        tracking_db and tables_to_add are passed in by reference and modified

        :returns: A list of 1215 error messages and a list of mygration operations
        :rtype: ( [{'error': string, 'foreign_key': mygrations.formats.mysql.definitions.constraint}], [mygrations.formats.mysql.mygrations.operations.operation] )
        """
        errors_1215 = []
        operations = []
        good_tables = {}

        # keep looping through tables as long as we find some to process
        # the while loop will stop under two conditions: if all tables
        # are processed or if we stop adding tables, which happens if we
        # have tables with mutualy-dependent foreign key constraints
        last_number_to_add = 0
        while tables_to_add and len(tables_to_add) != last_number_to_add:
            last_number_to_add = len(tables_to_add)
            for new_table_name in tables_to_add:
                new_table = self.db_to.tables[new_table_name]
                bad_constraints = tracking_db.unfulfilled_fks(new_table)

                # if we found no problems then we can add this table to our
                # tracking db and add the "CREATE TABLE" operation to our list of operations
                if not bad_constraints:
                    tables_to_add.remove(new_table_name)
                    operations.append(new_table.create())
                    tracking_db.add_table(new_table)
                    continue

                # the next question is whether this is a valid constraint
                # that simply can't be added yet (because it has dependencies
                # that have not been added) or if there is an actual problem
                # with the constraint.
                if new_table_name in good_tables:
                    continue

                # If we are here we have to decide if this table is fulfillable
                # eventually, or if there is a mistake with a foreign key that
                # we can't fix.  To tell the difference we just check if the
                # database we are migrating to can fulfill these foreign keys.
                broken_constraints = self.db_to.unfulfilled_fks(new_table)
                if not broken_constraints:
                    good_tables[new_table_name] = True
                    continue

                # otherwise it is no good: record as such
                tables_to_add.remove(new_table_name)
                for error in broken_constraints.values():
                    errors_1215.append(error['error'])

        return (errors_1215, operations)

    def _process_updates(self, tracking_db, tables_to_update):
        """ Runs through tables_to_update and resolves FK constraints to determine order to add them in

        tracking_db is passed in by reference and modified

        This doesn't return a list of 1215 errors because those would have been
        Taken care of the first run through when the "to" database was mygrated
        by itself.  Instead, this separates alters and addition/modification of
        foreign key updates into different operations so the foreign key updates
        can be processed separately.

        :returns: a dict
        :rtype: {'fks': list, 'kitchen_sink': list}
        """

        tables_to_update = tables_to_update[:]

        operations = {
            'removed_fks':  [],
            'fks':          [],
            'kitchen_sink': []
        }

        for update_table_name in tables_to_update:
            target_table = self.db_to.tables[update_table_name]
            source_table = self.db_from.tables[update_table_name]

            more_operations = source_table.to(target_table, True)
            if 'removed_fks' in more_operations:
                operations['removed_fks'].append( more_operations['removed_fks'] )
                for operation in more_operations['removed_fks']:
                    tracking_db.apply_operation(update_table_name, operation)
            if 'fks' in more_operations:
                operations['fks'].append(more_operations['fks'])
                for operation in more_operations['fks']:
                    tracking_db.apply_operation(update_table_name, operation)
            if 'kitchen_sink' in more_operations:
                operations['kitchen_sink'].append(more_operations['kitchen_sink'])
                for operation in more_operations['kitchen_sink']:
                    tracking_db.apply_operation(update_table_name, operation)

        return operations

To be clear, this class is responsible for mapping out the operations required to bring one database up-to-spec with another (or to generate a database from scratch). Otherwise, it doesn't do anything itself. It's primary output is from its errors_1215 and operations properties, which are then used by other classes to update an actual database, present a migration plan, etc...

It primarily interacts with the database objects, which is the class that I posted for review in my last post (Resolving MySQL 1215 errors in a declarative MySQL migration system)

You complain of ad hocery. Ok, fair enough, but what's implemented seems a sensible approach. Perhaps you could refine your complaint: I'm concerned about time complexity. Or guaranteed termination. Or correctly identifying a cycle when given incorrect inputs. From my perspective, I'm hoping to read comments describing loop invariants, and was slightly disappointed I didn't see topo sort mentioned: en.wikipedia.org/wiki/Topological_sorting — J_H, Commented Oct 12, 2017 at 19:42
I'm primarily concerned about overall time complexity and an algorithm that is easy to "reason about". I initially planned on implementing something like topological sorting (although I hadn't gotten so far as researching the options), but ideally I would like to find something that can handle both proper add-table order and proper modify-table order. I have a suspicion that topological sorting will only do the former (although that wouldn't stop me from using it if it helps). I'm also still suspicious about this algorithm failing on edge cases, but haven't finished searching for them yet. — Conor Mancone, Commented Oct 12, 2017 at 19:49
Just a quick GitHub tip; please keep your source code on your main branch. The master branch is pointless as you can just move over its files. — T145, Commented Oct 15, 2017 at 7:40
@T145 I don't follow your last sentence. Care to clarify? That being said I'm following the git-flow model. It's a bit of overkill here, but it is how I keep us organized at work and so I used it here out of habit. The master branch will have stuff just as soon I have something that is releasable. — Conor Mancone, Commented Oct 15, 2017 at 11:19
As it is now I figured the branch would remain empty indefinitely, and I'm not a fan of leaving the main branch empty and scattering the project across multiple other branches. Even so, it's generally good practice to keep the part of your code that's under active development as the main branch, even if it's not the final product. — T145, Commented Oct 15, 2017 at 17:29

J_H · Accepted Answer · 2017-10-13 04:27:04Z

from mygrations.formats.mysql.definitions.database import database
from mygrations.formats.mysql.mygrations.operations.alter_table import alter_table
from mygrations.formats.mysql.mygrations.operations.add_constraint import add_constraint
from mygrations.formats.mysql.mygrations.operations.remove_table import remove_table

No biggie, but these names seem inconveniently long. It's common for __init__.py, using relative imports, to offer shorter names to callers.

If only one database is passed in, ... blank slate.

It seems like the public API is more complex to explain than necessary. Maybe it should always accept two databases, and the docs could helpfully describe how to pass in an empty "blank slate" object for the DB init case. I was kind of hoping to see __init__() create such an object if db_from is None, as that would simplify the explanation. Turns out it shows up much later as else database().

... general steps ... 4. modify, 5. maybe add, 6. remove / add

Yes, I agree with you, that sounds ad hocish. Maybe an early test case was winning after applying those steps, so you adopted them. I'm frankly not convinced that it always wins. I'm worried a devious adversary could use those steps to design FK cycles and similar nonsense.

I advocate in favor of a topo sort. I would love to see an argument along the lines of "with each invocation, a certain work list shrinks by at least one" so we're convinced it will (efficiently) terminate. There's also the matter of guaranteeing we detect invalid inputs, such as cyclic dependencies, and reporting as a fatal error. Kahn or DFS may be relevant. Summary: I'm on your side, but I'm afraid I'm not yet convinced.

Clearly you have a relationship with the end user, who can send you both "good" and "bad" (impossible) inputs. I wonder if there is some formalism you could impose on the user, where a "batched" set of, say, three schema column changes, could be viewed as a shorthand way of expressing three distinct transactions, each making exactly one column change. Then the effect of your algorithm is to discover the proper, legal sequence of changes, or to make a sound pronouncement of "impossible" for bad inputs. That is, we define a batch of several updates as well-formed if the user could have put them in some well-formed sequence of individual updates, even though he didn't tell you the sequence.

There is a combinatorial aspect to discovering such a sequence. You might find it convenient to simplify the problem by specifying that the user shall only follow certain orderings for the unseen individual transactions. For example, when adding / deleting columns A & B, where A appears before B, valid transactions would always deal first with A and then after succeeding would deal with B, so your algorithm need not consider the other sequence.

    If db_from doesn't exist then it will be a list of operations to
    create db_to.

Good comment, perfectly clear.

Here's a nit:

def __bool__(self):
    return True if len(self._operations) else False

Please phrase as return len(self._operations) > 0

def _differences(self, a, b):

This method is very nice, and clear. a & b are sensible names, but they don't connote the arrow of time as added/removed do. Consider naming them before/after, old/new, prev/next, from/to, src/dst, whatever.

        if check.errors_1215:
            return [ check.errors_1215, [] ]

The 1st line is nice, thanks to __bool__(). The 2nd line returns a list where a pair (a 2-tuple) would be more natural.

IMPORTANT! tracking db and tables_to_add are both passed in by reference

I completely understand your trepidation here, and the need for CAPS and punctuation. But maybe forgo them, and simply create a copy of the input args?

    (errors_1215, operations) = self._process_adds(tracking_db, tables_to_add)

Maybe lose the extra parens: errors_1215, operations = ...

    # now apply table updates.  This acts differently: it returns a dictionary with
    # two sets of operations: one to update the tables themselves, and one to update
    # the foreign keys.  The latter are applied after everything else has happened.

This is a good comment. It's not obvious to me this is the Right or Best strategy. The comment leaves me wondering if we should maybe be cycling through, identifying one operation to perform, in greedy fashion, and then we cycle through again and pick up another operation.

    split_operations = self._process_updates(tracking_db, tables_to_update)
    # remove fks first to avoid 1215 errors caused by trying to remove a column
    # that is being used in a FK constraint that hasn't yet been removed
    if split_operations['removed_fks']:
        operations.extend(split_operations['removed_fks'])
    if split_operations['kitchen_sink']:
        operations.extend(split_operations['kitchen_sink'])
    if split_operations['fks']:
        fk_operations.extend(split_operations['fks'])

The API doesn't feel natural here. A dict could optionally have 'fks' or the other two. This is more of a struct, so namedtuple seems the more natural way to represent the return value.

    # ... there may be some
    # tables to add that we were not able to add before because they
    # relied on adding a column to a table before a foreign key could
    # be supported.

I didn't put those words in your mouth. You're totally arguing in favor of a topo sort.

More code follows. Yeah, I'm completely with you, feels very ad hocish. As though we are likely to get nearer to the goal, much of the time, but it's not guaranteed. I really want to see a comment or assert that says a certain list is getting smaller by one each time through a loop, so we know all needed work has been accomplished once length gets to zero. I'm sorry, this might involve ditching some code and writing more from scratch. But look on the positive side, now you have a good handle on what the true requirements really are.

Overall, this appears to be some solid well-written code. My concern is that it may be resting on the foundation of an algorithm that is a bit shaky. There is room for adding comments / asserts that make stronger assurances.

Thanks for all the comments! One point to clarify which probably doesn't change the overall gist of your response: this being MySQL, circular dependencies are actually allowed. They are uncommon, but I use them myself maybe 1 out of 100 times. In such cases the only way to handle it is to add the tables without the constraints and then add the constraints (effectively, to break edges). — Conor Mancone, Commented Oct 13, 2017 at 10:33
Having done some reading/thinking now on topological, I think it is a good way to approach this. The only issue is that (in wikipedia speak) I do not have a directed acyclic graph. I.e. I have directed cycles, and all the algorithms I have found so far are intended for use cases where there are no directed cycles. I have to find an algorithm that can first find and break enough edges to remove any directed cycles (effectively, decide which foreign keys to apply afterwards), and then apply a more typical algorithm. Any suggestions? — Conor Mancone, Commented Oct 13, 2017 at 13:06
On having cycles vs. DAG: ok, that's wierd. Fine, maybe useful, I'll grant you, but it never occurred to me that that would be valid input. Two ways out, both manual (i.e., "not my problem" from the algorithm's perspective). Just like one can do a 3NF normalization exercise, perhaps More Tables are needed so there's no cycles in production. Alternatively, keep your production schema but insist on the user supplying edge breaking advice, e.g. instead of arbitrary first/last lexical rule picking "paper" or "scissors", user specifies "rock". This could appear in column comments, or a file. — J_H, Commented Oct 13, 2017 at 14:31
I think that breaking up the problem is the answer. Right now I'm trying to figure out if there is a way to determine where the cycles are and break them up automatically to boil it down to a standard topological sort. Then I can apply changes according to the topological sort and simply apply the foreign keys that I "broke" off previously. — Conor Mancone, Commented Oct 13, 2017 at 14:38
Upon reading your vignette, it sounds to me like tasks and repeating_tasks should each take references to base_tasks. Also, perhaps rename the first one to scheduled_tasks, indicating it holds the one-offs. — J_H, Commented Oct 13, 2017 at 14:49

Stack Exchange Network

Automatic database migration system

1 Answer 1

Linked

Hot Network Questions

Automatic database migration system

1 Answer 1

Linked

Related

Hot Network Questions