Start with a clearer understanding of the current code's drawbacks. You
described your current implementation as prototype code. While criticizing it,
you expressed a desire for an implementation "with smaller function size, more
concise and elegant code" and conveyed the idea that "it needs to be split into
smaller methods". Perhaps those things are true. But I would suggest something
different. The current code is plenty concise. Its primary drawback is its
algorithmic complexity and its lack of any mechanisms to help the reader
understand what's going on. There are no code comments, for example, to guide
the reader. The logic rests substantially on opaque list indexes, which carry
no meaningful information. To illustrate the last point with one line of code,
consider what something as simple as a few constants could do to enhance
readability:
# Current code: making the reader work hard.
if row[0] >= last[1] or row[2] != last[2]:
...
# Add a few constants to give meaning to those indexes.
BEGIN, END, DATA = (0, 1, 2)
...
if row[BEGIN] >= last[END] or row[DATA] != last[DATA]:
...
Smarter data, simpler code. As with your password program, you are
under-investing in data and over-investing in algorithm. As Fred Brooks, Rob
Pike, and other computing
giants have emphasized, the key to programming is data. If you select the right
data structures, and create the right data objects, algorithms tend to simplify
and code tends to become more readable. Your current data structures are bare
bones (a list of triples) -- hence the need to rely on numeric indexes rather
than meaningful and self-documenting attributes. One good move would be to
define a simple data object to represent one of those intervals/ranges/triples.
from dataclasses import dataclass
@dataclass(order = True)
class Interval:
begin: int
end: int
data: object
Speaking of data structures, consider using an interval
tree. This problem seems
well-suited for a data structure specialized to answer questions about
overlapping intervals. The
intervaltree library seems to be
active and reasonably maintained. Here's an illustration of how you might use
it. Compared to your current implementation, the code is a little bit shorter
and, much more important, a lot more readable (even without the comments). But
notice the role that the comments play in providing a narrative to guide the
reader through the process, clarifying intent, and explaining anything that
might not be directly evident from the code.
from intervaltree import IntervalTree, Interval as TreeInverval
def merge_rows(rows):
# Convert the row data to an IntervalTree.
t = IntervalTree()
for begin, end, data in rows:
# Add the row to the tree.
iv = TreeInverval(begin, end + 1, data)
t.add(iv)
# Slice all intervals in the tree at the current interval's endpoints.
t.slice(iv.begin)
t.slice(iv.end)
# After those slices, intervals in the tree that overlap the current
# interval can be replaced by new intervals holding the current data.
for ov in t.overlap(iv):
new = TreeInverval(ov.begin, ov.end, iv.data)
t.remove(ov)
t.add(new)
# Merge abutting intervals having the same data value.
# Because TreeInverval instances are immutable, we will
# convert them to our own Interval instances.
ivs = []
prev = None
for iv in sorted(Interval(*iv) for iv in t):
if prev and iv.data == prev.data and prev.end == iv.begin:
prev.end = iv.end
else:
ivs.append(iv)
prev = iv
# Convert back to the "row" data format that we started with.
# Or not: perhaps the rest of your program would benefit from
# using Interval instances rather than raw triples.
return [
[iv.begin, iv.end - 1, iv.data]
for iv in ivs
]