Return to Answer

fixed typo in code

Source Link

edited Jun 23, 2023 at 14:54

3.7k
13
48

from intervaltree import IntervalTree, Interval as TreeInvervalTreeInterval

def merge_rows(rows):
    # Convert the row data to an IntervalTree.
    t = IntervalTree()
    for begin, end, data in rows:
        # Add the row to the tree.
        iv = TreeInvervalTreeInterval(begin, end + 1, data)
        t.add(iv)

        # Slice all intervals in the tree at the current interval's endpoints.
        t.slice(iv.begin)
        t.slice(iv.end)

        # After those slices, intervals in the tree that overlap the current
        # interval can be replaced by new intervals holding the current data.
        for ov in t.overlap(iv):
            new = TreeInvervalTreeInterval(ov.begin, ov.end, iv.data)
            t.remove(ov)
            t.add(new)

    # Merge abutting intervals having the same data value.
    # Because TreeInvervalTreeInterval instances are immutable, we will
    # convert them to our own Interval instances.
    ivs = []
    prev = None
    for iv in sorted(Interval(*iv) for iv in t):
        if prev and iv.data == prev.data and prev.end == iv.begin:
            prev.end = iv.end
        else:
            ivs.append(iv)
            prev = iv

    # Convert back to the "row" data format that we started with.
    # Or not: perhaps the rest of your program would benefit from
    # using Interval instances rather than raw triples.
    return [
        [iv.begin, iv.end - 1, iv.data]
        for iv in ivs
    ]

from intervaltree import IntervalTree, Interval as TreeInverval

def merge_rows(rows):
    # Convert the row data to an IntervalTree.
    t = IntervalTree()
    for begin, end, data in rows:
        # Add the row to the tree.
        iv = TreeInverval(begin, end + 1, data)
        t.add(iv)

        # Slice all intervals in the tree at the current interval's endpoints.
        t.slice(iv.begin)
        t.slice(iv.end)

        # After those slices, intervals in the tree that overlap the current
        # interval can be replaced by new intervals holding the current data.
        for ov in t.overlap(iv):
            new = TreeInverval(ov.begin, ov.end, iv.data)
            t.remove(ov)
            t.add(new)

    # Merge abutting intervals having the same data value.
    # Because TreeInverval instances are immutable, we will
    # convert them to our own Interval instances.
    ivs = []
    prev = None
    for iv in sorted(Interval(*iv) for iv in t):
        if prev and iv.data == prev.data and prev.end == iv.begin:
            prev.end = iv.end
        else:
            ivs.append(iv)
            prev = iv

    # Convert back to the "row" data format that we started with.
    # Or not: perhaps the rest of your program would benefit from
    # using Interval instances rather than raw triples.
    return [
        [iv.begin, iv.end - 1, iv.data]
        for iv in ivs
    ]

from intervaltree import IntervalTree, Interval as TreeInterval

def merge_rows(rows):
    # Convert the row data to an IntervalTree.
    t = IntervalTree()
    for begin, end, data in rows:
        # Add the row to the tree.
        iv = TreeInterval(begin, end + 1, data)
        t.add(iv)

        # Slice all intervals in the tree at the current interval's endpoints.
        t.slice(iv.begin)
        t.slice(iv.end)

        # After those slices, intervals in the tree that overlap the current
        # interval can be replaced by new intervals holding the current data.
        for ov in t.overlap(iv):
            new = TreeInterval(ov.begin, ov.end, iv.data)
            t.remove(ov)
            t.add(new)

    # Merge abutting intervals having the same data value.
    # Because TreeInterval instances are immutable, we will
    # convert them to our own Interval instances.
    ivs = []
    prev = None
    for iv in sorted(Interval(*iv) for iv in t):
        if prev and iv.data == prev.data and prev.end == iv.begin:
            prev.end = iv.end
        else:
            ivs.append(iv)
            prev = iv

    # Convert back to the "row" data format that we started with.
    # Or not: perhaps the rest of your program would benefit from
    # using Interval instances rather than raw triples.
    return [
        [iv.begin, iv.end - 1, iv.data]
        for iv in ivs
    ]

Source Link

answered May 23, 2023 at 22:01

FMc

13.1k
2
29
40

Start with a clearer understanding of the current code's drawbacks. You described your current implementation as prototype code. While criticizing it, you expressed a desire for an implementation "with smaller function size, more concise and elegant code" and conveyed the idea that "it needs to be split into smaller methods". Perhaps those things are true. But I would suggest something different. The current code is plenty concise. Its primary drawback is its algorithmic complexity and its lack of any mechanisms to help the reader understand what's going on. There are no code comments, for example, to guide the reader. The logic rests substantially on opaque list indexes, which carry no meaningful information. To illustrate the last point with one line of code, consider what something as simple as a few constants could do to enhance readability:

# Current code: making the reader work hard.
if row[0] >= last[1] or row[2] != last[2]:
    ...

# Add a few constants to give meaning to those indexes.
BEGIN, END, DATA = (0, 1, 2)
...
if row[BEGIN] >= last[END] or row[DATA] != last[DATA]:
    ...

Smarter data, simpler code. As with your password program, you are under-investing in data and over-investing in algorithm. As Fred Brooks, Rob Pike, and other computing giants have emphasized, the key to programming is data. If you select the right data structures, and create the right data objects, algorithms tend to simplify and code tends to become more readable. Your current data structures are bare bones (a list of triples) -- hence the need to rely on numeric indexes rather than meaningful and self-documenting attributes. One good move would be to define a simple data object to represent one of those intervals/ranges/triples.

from dataclasses import dataclass

@dataclass(order = True)
class Interval:
    begin: int
    end: int
    data: object

Speaking of data structures, consider using an interval tree. This problem seems well-suited for a data structure specialized to answer questions about overlapping intervals. The intervaltree library seems to be active and reasonably maintained. Here's an illustration of how you might use it. Compared to your current implementation, the code is a little bit shorter and, much more important, a lot more readable (even without the comments). But notice the role that the comments play in providing a narrative to guide the reader through the process, clarifying intent, and explaining anything that might not be directly evident from the code.

from intervaltree import IntervalTree, Interval as TreeInverval

def merge_rows(rows):
    # Convert the row data to an IntervalTree.
    t = IntervalTree()
    for begin, end, data in rows:
        # Add the row to the tree.
        iv = TreeInverval(begin, end + 1, data)
        t.add(iv)

        # Slice all intervals in the tree at the current interval's endpoints.
        t.slice(iv.begin)
        t.slice(iv.end)

        # After those slices, intervals in the tree that overlap the current
        # interval can be replaced by new intervals holding the current data.
        for ov in t.overlap(iv):
            new = TreeInverval(ov.begin, ov.end, iv.data)
            t.remove(ov)
            t.add(new)

    # Merge abutting intervals having the same data value.
    # Because TreeInverval instances are immutable, we will
    # convert them to our own Interval instances.
    ivs = []
    prev = None
    for iv in sorted(Interval(*iv) for iv in t):
        if prev and iv.data == prev.data and prev.end == iv.begin:
            prev.end = iv.end
        else:
            ivs.append(iv)
            prev = iv

    # Convert back to the "row" data format that we started with.
    # Or not: perhaps the rest of your program would benefit from
    # using Interval instances rather than raw triples.
    return [
        [iv.begin, iv.end - 1, iv.data]
        for iv in ivs
    ]