Splitting up large SQLAlchemy model

Question

Anybody have advice on splitting up a large SQLAlchemy model into smaller parts? I have a ~2000 line model called Article that is becoming difficult to manage. We often have to scroll and scroll to find code that needs updating. It's doing four main things, with each of these methods calling smaller methods, all on the same model:

self.add() - set various fields by calling related models, machine learning APIs, or other external APIs (60% of code)
self.update() - updates some fields via a redis queue (15% of code)
self.to_dict() - creates a dictionary representation of the model that can be sent to elasticsearch (20% of code)
self.store() - stores the result of self.to_dict() in elasticsearch (5% of code)

Here's how I am thinking about fixing it:

Rather than a single file models/article.py, create a package like: /article, init.py, add.py, models.py, store.py, to_dict.py, etc. In this case the model is simple, and all the functionality is moved into other files as functions. The con to this is it could be harder to refer to shared code that was a part of the model. So I'll need to keep some properties on the model. (preferred option)
Spread the code out among other models. A lot of times we are referring to another model in self.add(). What I could do is move much of the code from Article into that other related model.
subclasses of Article that perform the main functionality. This is what ChatGPT seems to recommend, but I like it the least due to added complexity of subclasses.

What option would you prefer the most? Any idea I'm missing?

This question is hard to answer without knowing the specifics. Mainly related to data dependencies inside the current model. I would figure out what depends on what, and which moving parts need access to which information, and if the right answer isn't clear after that coming back here to ask a more specific question. (As an aside, seeing "ChatGPT" and "recommend" in the same sentence is alarming, much like seeing someone get financial advice from a Magic 8 Ball.) — Jasmijn
– Jasmijn, Commented Jan 2, 2024 at 8:15

J_H · Accepted Answer · 2024-01-03 19:12:12Z

single responsibility

I'm accustomed to seeing lots of SqlAlchemy model classes that are narrowly focused on representing rows and on relational integrity (like foreign keys). Such classes will only occasionally need a """docstring""", as they are self explanatory.

But in your case, it sounds like the first thing you need to do is write a one-sentence docstring explaining what the single responsibility of that code is.

RDBMS integrity, like UNIQUE index or FK, is definitely within scope.

Here are some things that I feel are out of scope:

supporting JOINs that a higher level library routine, or app level routine, should handle. (Consider creating a named JOIN query by issuing CREATE VIEW and defining a SqlAlchemy model for that relation.)
ML APIs (Consider creating ML report table(s) which JOIN to the Article table using FK.)
other APIs (same as for ML APIs)
redis queuing (other than possibly offering a field where some higher level library handler can store an essential redis ID, if PK does not suffice. Or maybe we want a redis queueing table on the side, with a FK back to Article.)
any presentation layer code that converts to an ES-compatible dict
any ES handlers that interact with an ElasticSearch backend

In other words, the Article model should focus on Codd & Date relational theory, without being distracted by the many things in the world that will sometimes happen to interact with a stored relation.

transition

Recommend you do this using a new name, like class Article1, as it clearly will be a Breaking Change. Now, go clean up all the breakage by implementing additonal code layered atop the Article1 model. Hopefully your existing automated integration tests will help guide this effort.

Alternatively, break things a little bit at a time. Start by evicting any ES interactions to some new layer. Then any dict presentation. Then APIs, including those for machine learning. Then calls to other models (prefer declarative SQL, like FK or a VIEW that contains JOIN(s), in order to tell SqlAlchemy and the RDBMS backend how those models relate to an article).

When you add new code or a new module, be sure to write a docstring for it. That way you, and future maintenance engineers, will know what is in- or out-of-scope, and won't be tempted to def kitchen_sink() in a place where it doesn't belong.

layers

Spread the code out among other models. ... move much of the code from Article into other related model.

Your second approach is the one I most closely agree with. But it sounds like you contemplate pushing e.g. ES interactions into some other SqlAlchemy model that currently happens to be "small". Don't do that.

Organize your code in layers, and only allow dependencies in one direction: down the abstraction hierarchy. So "app" is most abstract and sits at the top, above a "search" layer that interacts with ES, and that's above a "persistence" layer that knows about table rows. Possibly a machine_learning.py module would similarly be in the middle, and would not interact with "search" at all. The OP doesn't offer enough details so it's hard to say exactly where the isolation boundaries would go. Occasionally audit your import statements to verify that code only depends on lower abstraction layers. If you ever encounter a cyclic dependency, you will know at once that you have done the Wrong Thing and should undo it.

Be sure to write automated test suites which exercise the newly created layers. Keep running the tests while you're authoring new target code, and you'll wind up with decent coverage. Pointing the SqlAlchemy connect string at a throw-away sqlite RDBMS file offers a very convenient way to produce small integration tests. If, for example, your ES backend is not running or otherwise unavailable, the other tests should still offer a Green bar.

documentation

Write a ReadMe.md or Confluence wiki entry describing

the old design
pain points induced by the old design
the new design (can be an evolving section!)
architectural summary of the new design

The first three will be helpful to you and your colleagues as you work on paying down some of the accumulated technical debt. The fourth item will help future maintenance engineers figure out if they are "doing it the Right Way" or "falling down that same old rabbit hole again" when they add new features and fix bugs.

Dogweather · Accepted Answer · 2024-01-03 05:56:40Z

IMO the frame of mind you could have going into this refactor is: "the SQL Alchemy model is responsible for just two things: (1) CRUD DB operations, and (2) validation: ensuring its data is consistent. Everything else can move to a better place.

There are some established patterns for handling these issues.

self.to_dict() - creates a dictionary representation of the model that can be sent to elasticsearch (20% of code)

This is a great candidate to move to a Presenter. At its core, a Presenter just delegates all method calls to the 'presented' object. Then, you move code like to_es_dict() to the presenter. Creating a new presenter usually looks like: object = present(object).

self.store() - stores the result of self.to_dict() in elasticsearch (5% of code)

This—or most of it—can easily be moved to an es module because it's data layer code. Search index updates are often to de-coupled from db updates, I think this would be its own library. I.e., a business logic function would handle this.

self.add() - set various fields by calling related models, machine learning APIs, or other external APIs (60% of code)

I would write functions and strongly typed DTO classes for all these various calls. These functions would do all the API work. I'd pass these to the S.A. model in its __new__().

self.update() - updates some fields via a redis queue (15% of code)

These business logic features can also be moved to functions.

Stack Exchange Network

Splitting up large SQLAlchemy model

2 Answers 2

single responsibility

transition

layers

documentation

Hot Network Questions

Splitting up large SQLAlchemy model

2 Answers 2

single responsibility

transition

layers

documentation

Related

Hot Network Questions